Attention Layer

Enhances the important parts of the input data and fades out the rest, for the network to devote more computing power on important data. The technique mimics cognitive attention.

Used extensively in Transformers

Attention Coefficients

They identify which elements of the input sequence x=(x1,...,xN) is relevate to compute an element yk in the output sequence y=(y1,...,yM)

ai,j=importance of xi w.r.t yj

We want a matrix A containing coefficients ai,j[0,1], which describes the proportion of the meaning of word vj that needs to be transferred to word vi to produce a more relevant embedding vi :

vi=jai,jvj

Multi-Head Self-Attention Layer

Obtain first Embeddings for any word wk denoted hk

Bidirectional RNN for input embedding, used to propagate context between words of an input sequence.

Scaled Dot-Product Attention

Input vectors are transformed into vectors V,K,Q, where:

These are the result of three different learned weight matrices, i.e

Q=viWQK=viWKV=viWV

where WQ,WK,WV are learned weight matrices.

We compute the similarity score:

We can divide by the square root of the dimension of the keys to prevent large dot products from dominating the softmax. Otherwise, divide by the dimension d (as shown below).

S=QKTdk

Then, apply a Softmax function to give a distribution over the similarities

A=S2=softmax(QKT)=softmax(S)

This is thus the ai,j coefficients. We multiply the attention weights by the value vector V and sum them up. This encapsulates the inter-dependencies between the words of the sentence.

V=softmax(QKTd)V

Self Attention

When Q,K,V are derived from the same input sequence, this allows words in a sentence to attend to other words in the same sentence to compute their new representation.

The output of a single self-attention "head" for a word is a weighted sum of the V vectors of all words, where the weights are determined by the similarity between the current word's Q and all other words' Ks. This gives you a new vector for each word that incorporates information from the entire sentence, weighted by relevance.

Multi-Head

The self-attention calculation of Q,K,V is done multiple times in parallel using different learned linear transformations. Multiple heads allow the model to simultaneously learn and attend to different kinds of relationships, or information from different representation subspaces.

These outputs are then concatenated together and linearly transformed back to the original dimension of the input vectors.

Feedforward

Masking

This is used in the decoder during trainer to prevent attending to future tokens in the target sequence, hence ensuring the model only uses preceding tokens to predict the next one.

The Upper triangle mask is common.

Add & Norm

Residual connection followed by Layer Normalisation applied after both the attention and feedforward layers


Rationale

Thought process of selecting context vector model: