-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[BUG] - Dependency Issue in Language Modeling with nn.Transformer and torchtext Tutorial #2895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm happy to take a crack at this, but I'm not sure I understand the issue.
What does HF mean?
Does this need to go in the I tried a simple sniff test: torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
"""Converts raw text into a flat Tensor."""
data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batchify(data: Tensor, bsz: int) -> Tensor:
"""Divides the data into ``bsz`` separate sequences, removing extra elements
that wouldn't cleanly fit.
Arguments:
data: Tensor, shape ``[N]``
bsz: int, batch size
Returns:
Tensor of shape ``[N // bsz, bsz]``
"""
seq_len = data.size(0) // bsz
data = data[:seq_len * bsz]
data = data.view(bsz, seq_len).t().contiguous()
return data.to(device)
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size) # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size) and got the following error in the Colab notebook:
Also looks like |
Hey LoganThanks for your reply. Fix needs to go to the tutorial, not the source.The issue is the source is not accessible. I am suggesting to replace with a Hugging Face open dataset, such as WikiText2.JamesOn 5 Jun 2024, at 04:30, Logan Thomas ***@***.***> wrote:
I'm happy to take a crack at this, but I'm not sure I understand the issue.
the HF from datasets import load_dataset
What does HF mean?
with the dependency workaround:
import torch
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata
Does this need to go in the pytorch source code or in the tutorial itself?
I tried a simple sniff test:
torch.utils.data.datapipes.utils.common.DILL_AVAILABLE = torch.utils._import_utils.dill_available()
import torchdata
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])
def data_process(raw_text_iter: dataset.IterableDataset) -> Tensor:
"""Converts raw text into a flat Tensor."""
data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
# ``train_iter`` was "consumed" by the process of building the vocab,
# so we have to create it again
train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
def batchify(data: Tensor, bsz: int) -> Tensor:
"""Divides the data into ``bsz`` separate sequences, removing extra elements
that wouldn't cleanly fit.
Arguments:
data: Tensor, shape ``[N]``
bsz: int, batch size
Returns:
Tensor of shape ``[N // bsz, bsz]``
"""
seq_len = data.size(0) // bsz
data = data[:seq_len * bsz]
data = data.view(bsz, seq_len).t().contiguous()
return data.to(device)
batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size) # shape ``[seq_len, batch_size]``
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)
and got the following error in the Colab notebook:
HTTPError Traceback (most recent call last)
[<ipython-input-12-0398103be9c1>](https://localhost:8080/#) in <cell line: 10>()
8 train_iter = WikiText2(split='train')
9 tokenizer = get_tokenizer('basic_english')
---> 10 vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
11 vocab.set_default_index(vocab['<unk>'])
12
54 frames
[/usr/local/lib/python3.10/dist-packages/requests/models.py](https://localhost:8080/#) in raise_for_status(self)
1019
1020 if http_error_msg:
-> 1021 raise HTTPError(http_error_msg, response=self)
1022
1023 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
This exception is thrown by __iter__ of HTTPReaderIterDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)
Also looks like advanced_source/ddp_pipeline.py might suffer from same issue. (https://pytorch.org/tutorials/intermediate/pipeline_tutorial.html)
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>
|
/assigntome |
We do not support this tutorial anymore as torchtext is not maintained anymore. Can you please create a redirect file called beginner_source/transformer_tutorial.rst with the following content:
|
Torchtext is only used for the vocab and tokenizer. Could this be swapped
out for an alternative library?
…On Wed, Jun 5, 2024 at 5:24 PM Svetlana Karslioglu ***@***.***> wrote:
We do not support this tutorial anymore as torchtext is not maintained
anymore. Can you please create a redirect file called
beginner_source/transformer_tutorial.rst with the following content:
Language Modeling with nn.Transformer and torchtext
==========================================
The content is deprecated.
.. raw:: html
<meta http-equiv="refresh" content="0; url=https://pytorch.org/tutorials/">
—
Reply to this email directly, view it on GitHub
<#2895 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYKSML5VTBBEVR3E36HZM3ZF43URAVCNFSM6AAAAABIUFKWW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQGQ3TKNZQGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I do believe the purpose of this tutorial was to use torchtext with pytorch. We actually don't have the source file of the tutorial in the repo anymore. |
@jamesdhope I'll be submitting a PR shortly to deprecate this tutorial. However, it does look like other tutorials make use of the |
Issue and Suggested Fix
Please can this helpful tutorial be updated with the HF
from datasets import load_dataset
and merged into main with the dependency issue workaround:Asset
https://colab.research.google.com/github/pytorch/tutorials/blob/gh-pages/_downloads/9cf2d4ead514e661e20d2070c9bf7324/transformer_tutorial.ipynb#scrollTo=TY5T9Gic_qih
Describe the bug
Describe your environment
Google Colab environment. Have replicated the issue locally with same pip package versions.
cc @sekyondaMeta @svekars @kit1980
The text was updated successfully, but these errors were encountered: