Using fine-tuned Gensim Word2Vec Embeddings with Torchtext and Pytorch
This is going to be a very quick little hack I came up with while I was working on a Sequence-to-Sequence architecture on scientific documents recently. In case you are unaware, Torchtext is a python library that makes preprocessing of text data immensely easy. This involves creating a vocabulary, padding sequences to equal length, generating vector embeddings, and creating batch iterators. All of these are essential tasks in any NLP workflow and take up hours to code. Torchtext, on the other hand, helps you get up and running in under 1 hour.
An essential factor in improving any NLP model performance is choosing the correct word embeddings. These embeddings help capture the context of each word in your particular dataset, which helps your model understand each word better.
Vector Embeddings with TorchText
Torchtext handles creating vector embeddings for words in your dataset in the following way. It first creates a Field class that defines how the text in your dataset is going to be pre-processed.
from torchtext.data import Field
import spacydef tokenize(sentence):
sentence = sentence.strip()
sentence = sentence.lower()
sentence = sentence.replace('\n', '')
return [tok.text for tok in en.tokenizer(sentence)]en = spacy.load('en_core_web_sm')
TEXT = Field(tokenize=tokenize, lower=True, init_token = '<sos>', eos_token='<eos>')
Once you load your respective dataset using this TEXT
Field, the next step is to create a vocabulary based on all the unique words it encountered. This is also the step at which the Field needs to know what the vector Embeddings for each of those words would be. You have the following options:
Option 1: Randomly initialized embeddings
If no specific vector embeddings are specified, Torchtext initializes random vector embeddings which would get updated during training through backpropagation.
TEXT.build_vocab(trn)
(where trn is an instance of the torchtext TabularDataset class created by loading the raw text corpus.)
Option 2: Pre-trained Glove vectors
Glove is one of the most popular types of vector embeddings used for NLP tasks. Many pre-trained Glove embeddings have been trained on large amounts of news articles, Twitter data, blogs, etc. Fortunately, Torchtext works great with Glove embeddings, and using them is as easy as just passing the specific pre-trained embeddings you want.
pre_trained_vector_type = 'glove.6B.200d'
TEXT.build_vocab(trn, vectors = pre_trained_vector_type)
Option 3: Fine-tuning Glove vectors
Sometimes you would have a very specific dataset that would contain words not present in the pre-trained Glove vocabulary. This is something I was struggling with since I was working with scientific texts which contained very unique words such as ‘neuralnetworks’, ‘reinforcement’, ‘backpropagation’, and many more. Using pre-trained vectors wouldn’t have helped since they wouldn’t contain such words and Torchtext would initialize random vectors for them. An approach mentioned in this article describes how to fine-tune pre-trained Glove vectors using the Mittens
python library to include such unique words.
fine_trained_vectors = vocab.Vectors('path to fine-tuned file')
TEXT.build_vocab(trn, vectors = fine_trained_vectors)
Note: The approach described in the article linked above is very inefficient since creates a co-occurrence matrix that is stored in memory. It only works if you have buttloads of RAM to spare. I didn’t have that luxury, and it exceeded the RAM limit on Google Colab as well, so I would suggest you Option 4, which is much easier and efficient.
Option 4: Fine-Tuned Word2Vec vectors
Word2Vec vectors can be fine-tuned on your dataset easily with the help of the gensim
library:
import gensim# WORD2VEC
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 100
W2V_MIN_COUNT = 2# Collect corpus for training word embeddings
documents = [tokenize(_text) for _text in np.array(train.summary)]
documents = documents + [tokenize(_text) for _text in np.array(train.title)]# Train Word Embeddings and save
w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT)w2v_model.build_vocab(documents)
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)# Train Word Embeddings
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)
w2v_model.save('embeddings.txt')
However, you cannot load the saved embeddings.txt
file with vocab.Vectors(…)
like the Glove embeddings, because it is saved in a binary format and has specific information at the start of the file required by Gensim for loading. But this information leads to an encoding error when loaded with vocab.Vectors(…)
. This is how the file looks:
€ cgensim.models.word2vec
Word2Vec
q ) q }q (X max_final_vocabq NX callbacksq )X loadq cgensim.utils
call_on_class_only
q X wvq cgensim.models.keyedvectors
Word2VecKeyedVectors
q ) q }q
(X vectorsq cnumpy.core.multiarray
_reconstruct
qcnumpy
ndarray
q
K …q c_codecs
encode
q X bq X latin1q †q Rq ‡q Rq (K M¿hM, †q cnumpy
dtype
q X f4q K K ‡q Rq (K X <q NNNJ???ÿJ???ÿK tq b‰h X£9á ÓNî¾ Ê ¿ Ž*?îà–¿6Pù¾©ãŒ¿*aH>L £>Y 3¿F ë¾ßý¾^¢Q?w› ?ª˜©>*¹#¿¼ðÇ¿ †I¿ƒrD?$ 羚´û¿¹W>½ïSÁ¾L\ƾ\» ¾q z?0p ?pËg?)ns»¤ýӾ󈓿0©|<…®¿eÒ$¿}Gl¿˜Cø¿‡JÈ?Ê©ò>õf‘>F>̾ £Ã?²É=Î v¿ŒI ?Ò—ú=`% ¿”4Ž>“UÅ= xc? ¿¶;o¿º7@¾3 ˆ¾“S ¿UL >ˆz°¿ñjê¾³Þ ¿û£/¿X…~½$[Œ¿5às>ÒÝw¿?׊¿oÅT¿(»w?)€À¾ 3¿ P ¿^ >£Ôº? ôÿ% T>/{R=³ÿ†>§ Ú¾öÞä>ïeM=x³ ?Dð\¿Ž±L¿–•õ¿ÆÆ ?dÀt> ©$¿ù R¿gë ¿¥ ½¾kYg>̽¼`s#¿ã»å=ä|ø¿Á u>u 1¿„L¢¾” w¿»®¢¾ÂÚ®¿–J÷½ñ 2?n¾+¿ý V¿‚‿Õ\©½q° ? œØ¾ s
¿ :¿ ¿Œ
.....
.....
To work around this issue, we need to leverage the gensim
Word2Vec class to set the vectors in the Torchtext TEXT
Field.
Step 1: We first build the vocabulary in the TEXT
Field as before, however, we need to match the same minimum frequency of words to filter out as the Word2Vec model
import torchtext.vocab as vocab
from tqdm import tqdm_notebook# build vocab
TEXT.build_vocab(trn, min_freq=W2V_MIN_COUNT)
Step 2: Load the saved embeddings.txt
file using gensim
.
w2v_model = gensim.models.word2vec.Word2Vec.load('embeddings.txt')
Step 3: We set the vectors manually for each word in the vocabulary using the TEXT.vocab.set_vectors(…)
. It accepts the following arguments (according to the Torchtext documentation):
- stoi — A dictionary of string to the index of the associated vector in the vectors input argument. This can be obtained using
TEXT.vocab.stoi
. - vectors — An indexed iterable that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.
- dim — The dimensionality of the vectors. This is given by
W2V_SIZE
when training the Word2Vec embeddings.
Special tokens i.e. <unk>, <pad>, <eos>, <sos>, which are used for sequence-to-sequence or language generation tasks are not considered when training the Word2Vec embeddings but are present in the TEXT
Field vocabulary. Therefore, these special tokens and any other extra tokens in the vocabulary are initialized a vector embedding with all zeros.
word2vec_vectors = []
for token, idx in tqdm_notebook(TEXT.vocab.stoi.items()):
if token in w2v_model.wv.vocab.keys():
word2vec_vectors.append(torch.FloatTensor(w2v_model[token]))
else:
word2vec_vectors.append(torch.zeros(W2V_SIZE))TEXT.vocab.set_vectors(TEXT.vocab.stoi, word2vec_vectors, W2V_SIZE)
Loading the Vector Embeddings with Pytorch
Finally, to load these vector embeddings into a Pytorch model using the nn.Embedding
layer.
pre_trained_emb = torch.FloatTensor(TEXT.vocab.vectors)
embedding = nn.Embedding.from_pretrained(pre_trained_emb)
If you run into any issues, please leave a response below. Hope this helped :)