Attention Layer

Enhances the important parts of the input data and fades out the rest, for the network to devote more computing power on important data. The technique mimics cognitive attention.

Used extensively in Transformers

Attention Coefficients

They identify which elements of the input sequence $x = (x_{1}, . . ., x_{N})$ is relevate to compute an element $y_{k}$ in the output sequence $y = (y_{1}, . . ., y_{M})$
$a_{i, j} = importance of x_{i} w.r.t y_{j}$

We want a matrix $A$ containing coefficients $a_{i, j} \in [0, 1]$ , which describes the proportion of the meaning of word $v_{j}$ that needs to be transferred to word $v_{i}$ to produce a more relevant embedding $v_{i}^{'}$ :

v_{i}^{'} = \sum_{j}^{} a_{i, j} v_{j}

Multi-Head Self-Attention Layer

Obtain first Embeddings for any word $w_{k}$ denoted $h_{k}$

Can be context-aware like ELMo, where $h_{k} = [{\overset{\leftarrow}{h}}_{k}, {\vec{h}}_{k}]$
Can be simple embeddings with Positional encoding

Bidirectional RNN for input embedding, used to propagate context between words of an input sequence.

Scaled Dot-Product Attention

Input vectors are transformed into vectors $V, K, Q$ , where:

$Q$ is the query - the element we are trying to compute a new, contextually aware representation
$K$ refers to keys - labels associated with all available pieces of information, which the query will be compared against to determine relevance
$V$ refers to values - actual pieces of information/content associated with each key

These are the result of three different learned weight matrices, i.e

\begin{aligned} Q & = v_{i} * W_{Q} \\ K & = v_{i} * W_{K} \\ V & = v_{i} * W_{V} \end{aligned}

where $W_{Q}, W_{K}, W_{V}$ are learned weight matrices.

We compute the similarity score:

Computed by $s_{i, j} = q_{i} \cdot k_{j}$ , dot-product/cosine angle formula
$S = Q K^{T}$ for all pairs of query/key words

We can divide by the square root of the dimension of the keys to prevent large dot products from dominating the softmax. Otherwise, divide by the dimension $d$ (as shown below).

S = \frac{Q K^{T}}{\sqrt{d_{k}}}

Then, apply a Softmax function to give a distribution over the similarities

A = S_{2} = s o f t m a x (Q K^{T}) = s o f t m a x (S)

This is thus the $a_{i, j}$ coefficients. We multiply the attention weights by the value vector $V$ and sum them up. This encapsulates the inter-dependencies between the words of the sentence.

V^{'} = s o f t m a x (\frac{Q K^{T}}{d}) V

Self Attention

When $Q, K, V$ are derived from the same input sequence, this allows words in a sentence to attend to other words in the same sentence to compute their new representation.

The output of a single self-attention "head" for a word is a weighted sum of the $V$ vectors of all words, where the weights are determined by the similarity between the current word's $Q$ and all other words' $K$ s. This gives you a new vector for each word that incorporates information from the entire sentence, weighted by relevance.

Multi-Head

The self-attention calculation of $Q, K, V$ is done multiple times in parallel using different learned linear transformations. Multiple heads allow the model to simultaneously learn and attend to different kinds of relationships, or information from different representation subspaces.

These outputs are then concatenated together and linearly transformed back to the original dimension of the input vectors.

Feedforward

Feedforward layers to reshape attention vectors so that it is acceptable by next encoder/decoder layer.
Uses a skip connection as in ResNet

Masking

This is used in the decoder during trainer to prevent attending to future tokens in the target sequence, hence ensuring the model only uses preceding tokens to predict the next one.

The Upper triangle mask is common.

Add & Norm

Residual connection followed by Layer Normalisation applied after both the attention and feedforward layers

Rationale

Thought process of selecting context vector model:

Averaging words like Continuous Bag of Words is simple, but words might cancel each other out - an averaging effect if given a large number of words
Concatenating hidden states of bi-RNN gives surrounding context of word $w_{k}$ (as per ELMo) which preserves meaning of surrounding words
But there is a vanishing effect of memory between successive hidden states if there are large number of words in the sentence.
Words that are far away from each other will have their context lost by the hidden state of the Long Short-Term Memory RNN