Embeddings

A function which transforms an object $x$ to another object $x^{'}$ , which should be mathematically more tractable and useful than $x$

They're typically used in machine learning to encode some non-mathematical data into a meaningful data to be later fed into a mathematical model.

A problem arises when input data cannot be simply formatted as a tensor (since Neural Networks are designed to operate with Tensors of numerical values).

They're preferably injective or ideally bijective.

Injective functions map distinct elements of domain $X$ to distinct elements of codomain $X^{'}$ such that $\forall x_{1}, x_{2} \in X$ ,

x_{1} \neq x_{2} \Rightarrow f (x_{1}) \neq f (x_{2})

f (x_{1}) = f (x_{2}) \Rightarrow x_{1} = x_{2}

Surjective functions map all elements from codomain $X^{'}$ to at least one value of domain $X$ such that $\forall x^{'} \in X^{'}$ ,

\exists x \in X s . t . f (x) = x^{'}

Bijective functions are both injective and subjective, and define a one-to-one mapping between all elements in both domain $X$ and codomain $X^{'}$

We want embedding functions to ideally be bijective as:

Non-surjective functions could predict vectors with no meanings (element in $X^{'}$ which cannot be mapped to an element in $X$ )
Non-injective functions could predict vectors with multiple meanings
They allow for any element $x \in X$ to be encoded as unique $x^{'} \in X^{'}$ with $y = f (x)$ , and any element in $x^{'} \in X^{'}$ to be decoded as a unique $x \in X$ with $x = f^{- 1} (x^{'})$

Manual Embeddings

Refers to manually extracting meaningful data $y$ from an object $x$ whose format cannot be fed into an Neural Networks. This means manually defining features to represent an object numerically

For example, representing movies using two scores in $[- 1, 1]$ e.g. $x_{1}$ is target audience (kids vs adults) and $x_{2}$ is type (arthouse vs blockbuster). These manual embeddings can be used to train a classifier or rank preferences (by distance to a decision boundary).

Language embeddings

Problems in embeddings for languages:

Identical words can have multiple (very different) meanings
Two different words can have very close meanings
Two different words can have very close meanings, but their embeddings may not need to be as the decision to do so may be task-specific
Size of $| V |$ can scale to become very very large

This can be solved with frequency-based embeddings (TF-IDF and consorts) and prediction-based embeddings (Continuous Bag of Words and Skip-gram model).

Train an NN on a language-related task (e.g., predicting words).
Use one of the hidden layers of the trained NN as the feature representation (embedding) for words.
Start with a basic embedding (like one-hot).
Input this into the NN; the hidden layer activation becomes the "better" learned embedding.
These learned embeddings are usually dense (not sparse) and capture similarities (similar words have positive inner products).