It was also mentioned later in the lecture that having two vectors representing each word is meant to make the optimisation easier (so it's kind of a trick); at the end, the two vectors learnt will have to be averaged over in order to reach a single vector for each word.
To be fair, the fact that each word is represented by two vectors was also mentioned in the original paper describing word2vec:
As of the first difference pointed out in the article, one of the CS224D lectures on word2vec did addressed it:
https://youtu.be/aRqn8t1hLxs?t=2650
It was also mentioned later in the lecture that having two vectors representing each word is meant to make the optimisation easier (so it's kind of a trick); at the end, the two vectors learnt will have to be averaged over in order to reach a single vector for each word.
To be fair, the fact that each word is represented by two vectors was also mentioned in the original paper describing word2vec:
https://arxiv.org/pdf/1310.4546.pdf
On page 3, just beneath equation (2).
Why so surprised?