So what is Doc2vec and where does it come from? In recent years some Google papers were published by Tomas Mikolov and friends about a neural network that could be trained to produce so-called paragraph vectors [1, 2, 3, 9]. The authors did not release software with their research papers. So others have tried to implement it. Doc2vec is an implementation of paragraph vectors by the authors of gensim, a much used library for numerical methods in the field of natural language processing (NLP). The name is different, but it is the same algorithm: doc2vec just sounds better than paragraph vectors. It is also a nicer name, because doc2vec builds upon the same neural network architectures that underly those other famous algorithms that go by the name word2vec. If you don’t know word2vec, Google it, there are plenty of resources where you can learn about it. Resources to learn about doc2vec, however, are just a bit less abundant, so in this post we’ll google it for you.
First, what does doc2vec do? Well, it gives you vectors of a fixed length–to be determined by you–that can represent text fragments of varying size, such as sentences, paragraphs, or documents. It achieves this by training a small neural network to perform a prediction task. To train a network, you need labels. In this case, the labels will come from the texts themselves. After the network has been trained, you can re-use a part of it, and this part will give you your sentence / paragraph / document vectors. These vectors can then be used in various algorithms, including document classification . One of the success factors for using doc2vec will be the answer to this: the task you are using the doc2vec vectors for, is it related to the way doc2vec was trained, in a useful way?
What benefits does doc2vec offer over other methods? There are many ways to represent sentences, paragraphs or documents as a fixed size vector. The simplest way is to create a vocabulary of all the words in a corpus, and represent each document by a vector that has an entry for each word in the vocabulary. But such a vector would be quite large, and it would be quite sparse, too (it would contain many zeroes). Some algorithms have difficulty working with sparse and high dimensional vectors. doc2vec yields vectors of a more manageable size, as determined by you. Again, there are many algorithms that do this for you, such as LDA , LSI , or Siamese CBOW , to name a recent one by a former colleague. To argue for the one or the other, what researchers would normally do is implement the prediction task they care about with several algorithms, and then measure which algorithm performed best. For example, in  paragraph vectors are compared to LDA for various tasks; the authors conclude that paragraph vectors outperform LDA. This does not mean that doc2vec will always be best for your particular application. But perhaps it is worth trying out. Running experiments with doc2vec is one way of learning about what it does, when you can use it, when it works well, and when it is less useful.
In terms of implementations, we’ve already mentioned the Doc2Vec class in gensim . There’s also an implementation in deeplearning4j . And Facebook’s fastText may have an implementation, too . Since I like working with Tensorflow, I’ve googled “doc2vec tensorflow” and found a promising, at first sight clean and consise implementation . And a nice discussion about a few lines of Tensorflow code as well, with the discussion shifting to the gensim implementation . Implementations in other low level neural network frameworks may exist.
Zooming in just a bit, it turns out that doc2vec is not just one algorithm, but rather it refers to a small group of alternative algorithms. And these are of course extended in new research. For example, in , a modified version of a particular doc2vec algorithm is used, according to a blog post about the paper . In that blog post, some details on the extension in  are given, based on correspondence with the authors of . An implementation of that extension is believed not to be there yet by the author of . In general, it may be impossible to recreate exactly the same algorithms as the authors of the original papers used. Rather, studying concrete implementations is another way of learning about how doc2vec algorithms work.
A third way of learning more is reading. If you know a bit about how neural networks work, you can start by checking the original papers [1, 2, 3, 9]. There are some notes on the papers, too, in blogs by various people [10, 14]. The papers and the blog posts leave some details to the reader, and are not primarily intended as lecture material. The Stanford course on deep learning for NLP has some good lecture notes on some of the algorithms leading up to doc2vec , but doc2vec itself is not covered. There are enough posts explaining how to use the gensim Doc2Vec class [5, 6, 8, 12]. Some of these posts do include some remarks on the workings of Doc2Vec [5, 6, 8] or even perform experiments with it [6, 8, 12]. But they do not really drill down to the details of the neural net itself. I could not find a blog post explaining the neural net layout in , or reporting on experiments with .
Now that you have come this far, wouldn’t it be nice to set out to take a look at how doc2vec, the algorithm, works? With the aim to add some detail and elaboration to the concise exposition in the original papers. And perhaps we can add and discuss some working code, if not too much of it is needed! Stay tuned for more on this.
- Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
- Andrew N. Dai, Christopher Olah, Quoc V. Le. Document Embedding with Paragraph Vectors, NIPS 2014.
- Tom Kenter, Alexey Borisov, Maarten de Rijke. Siamese CBOW: Optimizing Word Embeddings for Sentence Representations. ACL 2016.