ELMo

Embeddings for Language Model - Deep Contextualised Word Representations

ELMo expects inputs of entire sentences, and will produce embeddings for all words in the sentence separately.

ELMo decomposes words inputs into characters, which will then be represented as one-hot vectors and fed as inputs. This handles OOV words naturally, with lower input dimensionality compared to word one-hot vectors.

  1. Input of words as a list of characters into a character-level network
  2. To get an embedding of a word, input the whole sentence for context, then take only the vector which corresponds to the word
  3. Use multiple layers of RNNs on word-level
  4. Use highway layers for transition in the middle
  5. Train whole network for task of predicting next word in sentence
  6. Extract embedding layer for feature representation

Architecture

Character-level vectors go through 1D Convolutional Neural Networks with different kernel sizes - the original model used kernels of size 1, 2, 3, 4, 5, 6, 7 with 32, 32, 64, 128, 256, 512, 1024 channels respectively.

The outputs from each layer are max-pooled and concatenated. This final concatenated vector of size 2048 can be used as a first word embedding, regarded as a character-level context extraction process, but does not benefit from context yet.

Now, we want a layer to combine and transfer context from words to each other. We can use bidirectional LSTMs with two hidden states, going in two opposite directions.

hk=RNNr(w1,w2,wk)hk=RNNl(wL,wL1,wk)hk=[hk,hk]

with L being the length of the sentence. These are used to propagate context.

2025-04-28_23-16-40_ELMo_Bi-LSTMs.png

We also add some Feedforward layers in the middle to allow for smoother transition from character-level CNNs output to Bi-LSTM input. This can be done with a succession of Linear layers, or a variation called a Highway Layer.

2025-04-28_23-33-56_ELMo_Complete.png