Attention Layer
Enhances the important parts of the input data and fades out the rest, for the network to devote more computing power on important data. The technique mimics cognitive attention.
Used extensively in Transformers
Attention Coefficients
They identify which elements of the input sequence
is relevate to compute an element in the output sequence
We want a matrix
Multi-Head Self-Attention Layer
Obtain first Embeddings for any word
- Can be context-aware like ELMo, where
- Can be simple embeddings with Positional encoding
Bidirectional RNN for input embedding, used to propagate context between words of an input sequence.
Scaled Dot-Product Attention
Input vectors are transformed into vectors
is the query - the element we are trying to compute a new, contextually aware representation refers to keys - labels associated with all available pieces of information, which the query will be compared against to determine relevance refers to values - actual pieces of information/content associated with each key
These are the result of three different learned weight matrices, i.e
where
We compute the similarity score:
- Computed by
, dot-product/cosine angle formula for all pairs of query/key words
We can divide by the square root of the dimension of the keys to prevent large dot products from dominating the softmax. Otherwise, divide by the dimension
Then, apply a Softmax function to give a distribution over the similarities
This is thus the
Self Attention
When
The output of a single self-attention "head" for a word is a weighted sum of the
Multi-Head
The self-attention calculation of
These outputs are then concatenated together and linearly transformed back to the original dimension of the input vectors.
Feedforward
- Feedforward layers to reshape attention vectors so that it is acceptable by next encoder/decoder layer.
- Uses a skip connection as in ResNet
Masking
This is used in the decoder during trainer to prevent attending to future tokens in the target sequence, hence ensuring the model only uses preceding tokens to predict the next one.
The Upper triangle mask is common.
Add & Norm
Residual connection followed by Layer Normalisation applied after both the attention and feedforward layers
Rationale
Thought process of selecting context vector model:
- Averaging words like Continuous Bag of Words is simple, but words might cancel each other out - an averaging effect if given a large number of words
- Concatenating hidden states of bi-RNN gives surrounding context of word
(as per ELMo) which preserves meaning of surrounding words - But there is a vanishing effect of memory between successive hidden states if there are large number of words in the sentence.
- Words that are far away from each other will have their context lost by the hidden state of the Long Short-Term Memory RNN