Continuous Bag of Words (CBoW)

Predicts a middle word giving surrounding context words, i.e. predict word with in the middle with index $k$ of a sequence of $2 k + 1$ words.

2025-04-28_21-20-01_Continuous Bag of Words_Model architecture.png

Use a sliding window with size $k$ to generate pairs of $(x, y)$ values, where $x = 2 k$ words and $y =$ middle word

Input: Start with one-hot vectors $e_{i} \in R^{| V |}$ for $2 k$ context words, with $k$ as the span of the model

Embedding Layer:

Add a shared linear transformation $W \in R^{D \times | V |}$ of size $D \times | V |$ applied to each input one-hot vector $e_{i}$ .
We get a dense vector $v_{i}$ of size $D$ , the size of the new word embedding, often chosen as $D ≪ | V |$
$\forall i, v_{i} = W e_{i}$
This layer learns the word embeddings.

Context Aggregation Layer:

Sums all vectors $u = \sum_{k} v_{k}$ , which combines the contextual meaning vectors together
Non-trainable layer, which produces $u \in R^{D}$
Roughly encapsulates the meaning of the entire incomplete sentence of $2 k$ input words

Prediction Layer:

Final linear layer (matrix $R$ , size $| V | \times D$ ) followed by a softmax function
Produces an output $o = s o f t m a x (R \times u)$ with $R \in R^{| V | \times D}$ so that $o \in R^{| V |}$
Predicts probability distribution over the entire vocabulary $V$ for the missing word

Training:

Use a sliding window over a large text corpus to generate (context words, middle word) pairs
Classification loss is preferably a negative logarithm: $$L_t = -\log(o_t = e_t|e_{-t})$$ where $e_{- t} = (e_{t - k}, . . . e_{t - 1}, e_{t + 1}, . . ., e_{t + k})$

This and SkipGram are referred to as Word2Vec approaches.