Skip to content

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jamesdhope opened this issue Jun 1, 2024 · 7 comments · Fixed by #2910
Closed

Comments

@jamesdhope
Copy link

jamesdhope commented Jun 1, 2024

Issue and Suggested Fix

Please can this helpful tutorial be updated with the HF from datasets import load_dataset and merged into main with the dependency issue workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Asset

https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/9cf2d4ead514e661e20d2070c9bf7324/transformer_tutorial.ipynb#scrollTo=TY5T9Gic_qih

Describe the bug

ImportError                               Traceback (most recent call last)
[<ipython-input-26-b02c7921f3b1>](https://localhost:8080/#) in <cell line: 5>()
      3 from torchtext.vocab import build_vocab_from_iterator
      4 
----> 5 train_iter = WikiText2(split='train')
      6 tokenizer = get_tokenizer('basic_english')
      7 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])

6 frames
[/usr/local/lib/python3.10/dist-packages/torchdata/datapipes/iter/util/cacheholder.py](https://localhost:8080/#) in <module>
     22     portalocker = None
     23 
---> 24 from torch.utils.data.datapipes.utils.common import _check_unpickable_fn, DILL_AVAILABLE
     25 
     26 from torch.utils.data.graph import traverse_dps

ImportError: cannot import name 'DILL_AVAILABLE' from 'torch.utils.data.datapipes.utils.common' (/usr/local/lib/python3.10/dist-packages/torch/utils/data/datapipes/utils/common.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Describe your environment

Google Colab environment. Have replicated the issue locally with same pip package versions.

cc @sekyondaMeta @svekars @kit1980

@loganthomas
Copy link
Contributor

I'm happy to take a crack at this, but I'm not sure I understand the issue.

the HF from datasets import load_dataset

What does HF mean?

with the dependency workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Does this need to go in the pytorch source code or in the tutorial itself?

I tried a simple sniff test:

torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into ``bsz`` separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Arguments:
        data: Tensor, shape ``[N]``
        bsz: int, batch size

    Returns:
        Tensor of shape ``[N // bsz, bsz]``
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

and got the following error in the Colab notebook:

HTTPError                                 Traceback (most recent call last)
[<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>()
      8 train_iter = WikiText2(split='train')
      9 tokenizer = get_tokenizer('basic_english')
---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
     11 vocab.set_default_index(vocab['<unk>'])
     12 

54 frames
[/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
   1019 
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1022 
   1023     def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)

@jamesdhope
Copy link
Author

jamesdhope commented Jun 5, 2024 via email

@loganthomas
Copy link
Contributor

/assigntome

@svekars
Copy link
Contributor

svekars commented Jun 5, 2024

We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:

Language Modeling with nn.Transformer and torchtext
==========================================

The content is deprecated.

.. raw:: html
   <meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/">

@jamesdhope
Copy link
Author

jamesdhope commented Jun 5, 2024 via email

@svekars
Copy link
Contributor

svekars commented Jun 6, 2024

I do believe the purpose of this tutorial was to use torchtext with pytorch. We actually don't have the source file of the tutorial in the repo anymore.

@loganthomas
Copy link
Contributor

@jamesdhope I'll be submitting a PR shortly to deprecate this tutorial. However, it does look like other tutorials make use of the Wikitext-2 dataset without torchtext:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants