Previously, we’ve built a simple PV-DBOW-‘like’ model (https://amsterdam.luminis.eu/2017/02/21/coding-doc2vec/). We’ve made a couple of choices, e.g., about how to generate training batches, how to compute the loss function, etc. In this blog post, we’ll take a look at the choices made in the popular gensim library. First, we’ll convince ourselves that we implemented indeed more or less the same thing :-). Then, by looking at the differences, we’ll get ideas to improve and extend our own implementation (of course, this could work both ways ;-)). The first extension we are interested in is to infer a document vector for a new document. We’ll discuss how the gensim implementation achieves this.
Disclaimer: You will notice that we’ll write this blog post in a somehwat dry, bullet-point style. You may use it for reference if you ever want to work on doc2vec. We plan to, anyway. If you see mistakes in our eyeball-interpretation of what gensim does, feel free to (gently) correct us; please refer to the same git commit version of the code against which we wrote this blog post, and use line numbers to point to code.
Code walk through of gensim’s PV-DBOW
We’ll start with the non-optimized Python module doc2vec (https://github.com/RaRe-Technologies/gensim/blob/8766edcd8e4baf3cfa08cdc22bb25cb9f2e0b55f/gensim/models/doc2vec.py). Note that the link is to the specific version against which this blog post was written. To narrow it down, and to stay as close to our own PV-DBOW implementation, we’ll first postulate some assumptions:
- we’ll initialize the Doc2Vec class as follows
d2v = Doc2Vec(dm=0, **kwargs). That is, we’ll use the PV-DBOW flavour of doc2vec.
- We’ll use just one, unique, ‘document tag’ for each document.
- We’ll use negative sampling.
The first thing to note about the Doc2Vec class is that is subclasses the Word2Vec class, overriding some of its methods. By prefixing methods with the class, we’ll denote which exact method is called. The super class object is then initialized as follows, in lines 640-643, by deduction:
Word2Vec(sg=1, null_word=0, **kwargs)
sg stands for Skip-Gram. Remember from elsewhere on the net that the Skip-Gram Word2Vec model is trained to predict surrounding words (for any word in a corpus).
Word2Vec.train() is called: a model is trained. Here, some parallelisation is taken care of that I will not go into at this point. At some point however,
Doc2Vec._do_train_job() is called: in a single job a number of documents is trained on. Since we have self.sg = 1,
Doc2Vec.train_document_dbow() is called there, for each document in the job.
In this method, the model is trained to predict each word in the document. For this,
Word2Vec.train_sg_pair() is used. Only, instead of two words, this method now receives the document tag and a word: the task is to correctly predict each word given the document tag. In this method, weights are updated. It seems, then, that at each stochastic gradient descent iteration, only one training example is used.
Comparison of ours and gensim’s Doc2Vec implementation
By just eyeballing the code, at first sight, the following similarities and differences stand out:
- The network architecture used seems the same in both implementations: the input layer has as many neurons as there are documents in the corpus, there is one hidden layer, and the output layer has the vocabulary size.
- Neither implementation uses regularisation.
- One training example (document, term) per SGD iteration is used by gensim, whereas we allow computing the loss function over multiple training examples.
- All terms in a document are offered to SGD right after each other in gensim, whereas we generate batches consisting of several term windows from different documents.
- In gensim, the order in which training documents are offered is the same order each epoch; we randomize the order of term windows again each epoch.
Inferring document vectors
Given a new, unseen document, using
Doc2Vec.infer_vector(), a document vector can be estimated anyway. How? Well, the idea is that we keep the word classifier that operates on the hidden space fixed. In other words, we keep the weights between the hidden layer and the output layer fixed. Now, we train a new mini-network with just one input neuron–the new document id–. We optimize the network such that the document gets an optimal position in the hidden space. In other words, again, we train the weights that connect this new document id to the hidden layer. How does gensim initialize the weights for the new input neuron? Randomly, set to small weights, just like it was done for the initial documents that were trained on. The training procedure consists of a fixed number of steps (we can choose how many). At each step, all words in the document are offered as a training example, one after the other.
We’ve seen that gensim’s implementation and ours do implement roughly the same thing, although there are a number of differences. This consolidates our position with our small ‘proof-of-concept’ implementation of doc2vec (https://amsterdam.luminis.eu/2017/02/21/coding-doc2vec/). We’ve eyeballed how gensim’s doc2vec implementation manages to infer a document vector for an unseen document; now we are in a position to extend our own implementation to do the same. Of course, you can do it yourself, too!