Skip to content

Commit 17db3bb

Browse files
committed
spacy-init
1 parent efa42b4 commit 17db3bb

File tree

4 files changed

+139
-0
lines changed

4 files changed

+139
-0
lines changed

src/spacy_nlp/ner_1.log

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
The abstract from the paper is the following:
2+
Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.
3+
Tips:
4+
5+
OneFormer requires two inputs during inference: image and task token.
6+
During training, OneFormer only uses panoptic annotations.
7+
If you want to train the model in a distributed environment across multiple nodes, then one should update the get_num_masks function inside in the OneFormerLoss class of modeling_oneformer.py. When training on multiple nodes, this should be set to the average number of target masks across all nodes, as can be seen in the original implementation here.
8+
One can use OneFormerProcessor to prepare input images and task inputs for the model and optional targets for the model. OneformerProcessor wraps OneFormerImageProcessor and CLIPTokenizer into a single instance to both prepare the images and encode the task inputs.
9+
To get the final segmentation, depending on the task, you can call post_process_semantic_segmentation() or post_process_instance_segmentation() or post_process_panoptic_segmentation(). All three tasks can be solved using OneFormerForUniversalSegmentation output, panoptic segmentation accepts an optional label_ids_to_fuse argument to fuse instances of the target object/s (e.g. sky) together.
10+
11+
The figure below illustrates the architecture of OneFormer. Taken from the original paper.

src/spacy_nlp/ner_2.log

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
A transformer is a deep learning architecture -+--->
2+
that relies on the parallel multi-head attention mechanism.[1] The modern transformer was proposed in the 2017 paper titled 'Attention Is All You Need' by Ashish Vaswani et al., Google Brain team. It is notable for requiring less training time than previous recurrent neural architectures, such as long short-term memory (LSTM),[2] and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence.[3] Input text is split into n-grams encoded as tokens and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished. Though the transformer paper was published in 2017, the softmax-based attention mechanism was proposed earlier in 2014 by Bahdanau, Cho, and Bengio for machine translation,[4][5] and the Fast Weight Controller, similar to a transformer, was proposed in 1992 by Schmidhuber.[6][7][8]
3+
This architecture is now used not only in -+---> natural language processing and computer vision,[9] but also in audio[10] and multi-modal processing. It has also led to the development of pre-trained systems, such as generative pre-trained transformers (GPTs)[11] and BERT[12] (Bidirectional Encoder Representations from Transformers).
4+
Timeline of natural language processing models
5+
Timeline
6+
In 1990, Elman network, using a -+---> recurrent network, encoded each word in a training set as a vector, called static "word embedding", and the whole vocabulary as a vector database, allowing it to perform such tasks as sequence-prediction that are beyond the power of a simple multilayer perceptron. Shortcoming of the static embeddings was that they didn't differentiate between multiple meanings of same-spelt words.[13]
7+
In 1992, the Fast Weight Controller was published by Jürgen Schmidhuber[6] It learns to answer queries by programming the attention weights of another neural network through outer products of key vectors and value vectors called FROM and TO. The Fast Weight
8+
9+
10+
Fine-tuning BERT for named-entity recognition
11+
12+
In this notebook, we are going to use -+---> BertForTokenClassification which is included in the Transformers library by HuggingFace. This model has BERT as its base architecture, with a token classification head on top, allowing it to make predictions at the token level, rather than the sequence level. Named entity recognition is typically treated as a token classification problem, so that's what we are going to use it for.
13+
14+
This tutorial uses the idea of -+---> transfer learning, i.e. first pretraining a large neural network in an unsupervised way, and then fine-tuning that neural network on a task of interest. In this case, BERT is a neural network pretrained on 2 tasks: masked language modeling and next sentence prediction. Now, we are going to fine-tune this network on a NER dataset. Fine-tuning is supervised learning, so this means we will need a labeled dataset.
15+
16+
If you want to know more about -+---> BERT, I suggest the following resources:
17+
18+
the original paper
19+
Jay Allamar's blog post as well as his tutorial
20+
Chris Mccormick's Youtube channel
21+
Abbishek Kumar Mishra's Youtube channel
22+
23+
The following notebook largely follows the same structure as the tutorials by Abhishek Kumar Mishra. For his tutorials on the Transformers library, see his Github repository.
24+
25+
NOTE: this notebook assumes basic knowledge about deep learning, BERT, and native PyTorch. If you want to learn more Python, deep learning and PyTorch, I highly recommend cs231n by Stanford University and the FastAI course by Jeremy Howard et al. Both are freely available on the web.
26+
27+
Now, let's move on to the real stuff!
28+
Importing Python Libraries and preparing the environment
29+
30+
This notebook assumes that you have the following libraries installed:
31+
32+
pandas
33+
numpy
34+
sklearn
35+
pytorch
36+
transformers
37+
seqeval

src/spacy_nlp/reqmts_spacy_1.log

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
aiohttp==3.8.4
2+
aiosignal==1.3.1
3+
annotated-types==0.5.0
4+
async-timeout==4.0.2
5+
attrs==22.2.0
6+
blis==0.7.10
7+
catalogue==2.0.9
8+
certifi @ file:///croot/certifi_1671487769961/work/certifi
9+
charset-normalizer==3.1.0
10+
click==8.1.3
11+
confection==0.1.2
12+
cymem==2.0.7
13+
Flask==2.2.3
14+
frozenlist==1.3.3
15+
idna==3.4
16+
itsdangerous==2.1.2
17+
Jinja2==3.1.2
18+
langcodes==3.3.0
19+
MarkupSafe==2.1.2
20+
multidict==6.0.4
21+
murmurhash==1.0.9
22+
numpy==1.25.2
23+
openai==0.27.2
24+
packaging==23.1
25+
pathy==0.10.2
26+
preshed==3.0.8
27+
pydantic==2.3.0
28+
pydantic_core==2.6.3
29+
requests==2.28.2
30+
smart-open==6.3.0
31+
spacy==3.6.1
32+
spacy-legacy==3.0.12
33+
spacy-loggers==1.0.4
34+
srsly==2.4.7
35+
thinc==8.1.12
36+
tqdm==4.65.0
37+
typer==0.9.0
38+
typing_extensions==4.7.1
39+
urllib3==1.26.15
40+
wasabi==1.1.2
41+
Werkzeug==2.2.3
42+
yarl==1.8.2

src/spacy_nlp/src/test_1.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# SOURCE -- https://realpython.com/natural-language-processing-spacy-python/#installation-of-spacy
2+
3+
import spacy
4+
5+
nlp = spacy.load("en_core_web_sm")
6+
print(type(nlp)) ##<class 'spacy.lang.en.English'>
7+
8+
"""
9+
To start processing your input, you construct a Doc object.
10+
A Doc object is a sequence of Token objects representing a lexical token.
11+
Each Token object has information about a
12+
particular piece—typically one word—of text.
13+
"""
14+
15+
introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")
16+
print(type(introduction_doc)) # <class 'spacy.tokens.doc.Doc'>
17+
18+
ls_test = [token.text for token in introduction_doc]
19+
print(ls_test)
20+
#token == TOKEN Object
21+
#.text attribute to get the text contained within that token.
22+
# ['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'spaCy', '.']
23+
# Token object, you called the .text attribute to get the text contained within that token.
24+
25+
import pathlib
26+
#file_name = "/home/dhankar/temp/08_23/spacy_1/ner_1.log"
27+
file_name = "/home/dhankar/temp/08_23/spacy_1/ner_2.log"
28+
29+
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
30+
#print([token.text for token in introduction_doc]) # OK
31+
#
32+
sentences = list(introduction_doc.sents)
33+
print(len(sentences))
34+
for sentence in sentences:
35+
print(f"{sentence[:5]} ----> FIRST FIVE TOKENS ONLY")
36+
37+
# Custom - -+---> set_custom_boundaries
38+
from spacy.language import Language
39+
@Language.component("set_custom_boundaries")
40+
def set_custom_boundaries(introduction_doc):
41+
"""
42+
"""
43+
for token in introduction_doc[:-1]:
44+
if token.text == "-+--->":
45+
introduction_doc[token.i + 1].is_sent_start = True
46+
return introduction_doc
47+
48+
custom_nlp = spacy.load("en_core_web_sm")
49+

0 commit comments

Comments
 (0)