[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

jamesdhope · 2024-06-01T13:34:57Z

Issue and Suggested Fix

Please can this helpful tutorial be updated with the HF from datasets import load_dataset and merged into main with the dependency issue workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Asset

https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/9cf2d4ead514e661e20d2070c9bf7324/transformer_tutorial.ipynb#scrollTo=TY5T9Gic_qih

Describe the bug

ImportError                               Traceback (most recent call last)
[<ipython-input-26-b02c7921f3b1>](https://localhost:8080/#) in <cell line: 5>()
      3 from torchtext.vocab import build_vocab_from_iterator
      4 
----> 5 train_iter = WikiText2(split='train')
      6 tokenizer = get_tokenizer('basic_english')
      7 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])

6 frames
[/usr/local/lib/python3.10/dist-packages/torchdata/datapipes/iter/util/cacheholder.py](https://localhost:8080/#) in <module>
     22     portalocker = None
     23 
---> 24 from torch.utils.data.datapipes.utils.common import _check_unpickable_fn, DILL_AVAILABLE
     25 
     26 from torch.utils.data.graph import traverse_dps

ImportError: cannot import name 'DILL_AVAILABLE' from 'torch.utils.data.datapipes.utils.common' (/usr/local/lib/python3.10/dist-packages/torch/utils/data/datapipes/utils/common.py)

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Describe your environment

Google Colab environment. Have replicated the issue locally with same pip package versions.

cc @sekyondaMeta @svekars @kit1980

The text was updated successfully, but these errors were encountered:

loganthomas · 2024-06-05T03:29:43Z

I'm happy to take a crack at this, but I'm not sure I understand the issue.

the HF from datasets import load_dataset

What does HF mean?

with the dependency workaround:

import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

Does this need to go in the pytorch source code or in the tutorial itself?

I tried a simple sniff test:

torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata

from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data: Tensor, bsz: int) -> Tensor:
    """Divides the data into ``bsz`` separate sequences, removing extra elements
    that wouldn't cleanly fit.

    Arguments:
        data: Tensor, shape ``[N]``
        bsz: int, batch size

    Returns:
        Tensor of shape ``[N // bsz, bsz]``
    """
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)  # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

and got the following error in the Colab notebook:

HTTPError                                 Traceback (most recent call last)
[<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>()
      8 train_iter = WikiText2(split='train')
      9 tokenizer = get_tokenizer('basic_english')
---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
     11 vocab.set_default_index(vocab['<unk>'])
     12 

54 frames
[/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
   1019 
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1022 
   1023     def close(self):

HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)

jamesdhope · 2024-06-05T06:53:28Z

Hey LoganThanks for your reply. Fix needs to go to the tutorial, not the source.The issue is the source is not accessible. I am suggesting to replace with a Hugging Face open dataset, such as WikiText2.JamesOn 5 Jun 2024, at 04:30, Logan Thomas ***@***.***> wrote: I'm happy to take a crack at this, but I'm not sure I understand the issue. the HF from datasets import load_dataset What does HF mean? with the dependency workaround: import torch torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata Does this need to go in the pytorch source code or in the tutorial itself? I tried a simple sniff test: torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available() import torchdata from torchtext.datasets import WikiText2 from torchtext.data.utils import get_tokenizer from torchtext.vocab import build_vocab_from_iterator train_iter = WikiText2(split='train') tokenizer = get_tokenizer('basic_english') vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>']) vocab.set_default_index(vocab['<unk>']) def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor: """Converts raw text into a flat Tensor.""" data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter] return torch.cat(tuple(filter(lambda t: t.numel() > 0, data))) # ``train_iter`` was "consumed" by the process of building the vocab, # so we have to create it again train_iter, val_iter, test_iter = WikiText2() train_data = data_process(train_iter) val_data = data_process(val_iter) test_data = data_process(test_iter) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') def batchify(data: Tensor, bsz: int) -> Tensor: """Divides the data into ``bsz`` separate sequences, removing extra elements that wouldn't cleanly fit. Arguments: data: Tensor, shape ``[N]`` bsz: int, batch size Returns: Tensor of shape ``[N // bsz, bsz]`` """ seq_len = data.size(0) // bsz data = data[:seq_len * bsz] data = data.view(bsz, seq_len).t().contiguous() return data.to(device) batch_size = 20 eval_batch_size = 10 train_data = batchify(train_data, batch_size) # shape ``[seq_len, batch_size]`` val_data = batchify(val_data, eval_batch_size) test_data = batchify(test_data, eval_batch_size) and got the following error in the Colab notebook: HTTPError Traceback (most recent call last) [<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>() 8 train_iter = WikiText2(split='train') 9 tokenizer = get_tokenizer('basic_english') ---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>']) 11 vocab.set_default_index(vocab['<unk>']) 12 54 frames [/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self) 1019 1020 if http_error_msg: -> 1021 raise HTTPError(http_error_msg, response=self) 1022 1023 def close(self): HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None) Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

loganthomas · 2024-06-05T10:22:34Z

/assigntome

svekars · 2024-06-05T16:24:19Z

We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:

Language Modeling with nn.Transformer and torchtext
==========================================

The content is deprecated.

.. raw:: html
   <meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/">

jamesdhope · 2024-06-05T17:40:55Z

Torchtext is only used for the vocab and tokenizer. Could this be swapped out for an alternative library?

…

On Wed, Jun 5, 2024 at 5:24 PM Svetlana Karslioglu ***@***.***> wrote: We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content: Language Modeling with nn.Transformer and torchtext ========================================== The content is deprecated. .. raw:: html <meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/"> — Reply to this email directly, view it on GitHub <#2895 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYKSML5VTBBEVR3E36HZM3ZF43URAVCNFSM6AAAAABIUFKWW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ3TKNZQGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

svekars · 2024-06-06T15:57:23Z

I do believe the purpose of this tutorial was to use torchtext with pytorch. We actually don't have the source file of the tutorial in the repo anymore.

loganthomas · 2024-06-07T01:27:24Z

@jamesdhope I'll be submitting a PR shortly to deprecate this tutorial. However, it does look like other tutorials make use of the Wikitext-2 dataset without torchtext:

jamesdhope added the bug label Jun 1, 2024

sekyondaMeta added medium docathon-h1-2024 and removed bug labels Jun 4, 2024

github-actions bot assigned loganthomas Jun 5, 2024

This was referenced Jun 7, 2024

DEP: deprecate transformer tutorial #2910

Merged

[BUG] - Training Transformer models using Distributed Data Parallel and Pipeline Parallelism Tutorial broken #2916

Open

svekars closed this as completed in #2910 Jun 10, 2024

This was referenced Jun 13, 2024

FIX: Update transformer_tutorial.rst #2929

Closed

FIX: Update transformer_tutorial.rst #2949

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

jamesdhope commented Jun 1, 2024 •

edited by pytorch-bot bot

Loading

loganthomas commented Jun 5, 2024

jamesdhope commented Jun 5, 2024 via email

loganthomas commented Jun 5, 2024

svekars commented Jun 5, 2024 •

edited

Loading

jamesdhope commented Jun 5, 2024 via email

svekars commented Jun 6, 2024

loganthomas commented Jun 7, 2024

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895

Comments

jamesdhope commented Jun 1, 2024 • edited by pytorch-bot bot Loading

Issue and Suggested Fix

Asset

Describe the bug

Describe your environment

loganthomas commented Jun 5, 2024

jamesdhope commented Jun 5, 2024 via email

loganthomas commented Jun 5, 2024

svekars commented Jun 5, 2024 • edited Loading

jamesdhope commented Jun 5, 2024 via email

svekars commented Jun 6, 2024

loganthomas commented Jun 7, 2024

jamesdhope commented Jun 1, 2024 •

edited by pytorch-bot bot

Loading

svekars commented Jun 5, 2024 •

edited

Loading