Embeddings

A function which transforms an object x to another object x, which should be mathematically more tractable and useful than x

They're typically used in machine learning to encode some non-mathematical data into a meaningful data to be later fed into a mathematical model.

A problem arises when input data cannot be simply formatted as a tensor (since Neural Networks are designed to operate with Tensors of numerical values).

They're preferably injective or ideally bijective.

Injective functions map distinct elements of domain X to distinct elements of codomain X such that x1,x2X,

 x1x2f(x1)f(x2)f(x1)=f(x2)x1=x2

Surjective functions map all elements from codomain X to at least one value of domain X such that xX,

xX s.t.f(x)=x

Bijective functions are both injective and subjective, and define a one-to-one mapping between all elements in both domain X and codomain X

We want embedding functions to ideally be bijective as:

Manual Embeddings

Refers to manually extracting meaningful data y from an object x whose format cannot be fed into an Neural Networks. This means manually defining features to represent an object numerically

For example, representing movies using two scores in [1,1] e.g. x1 is target audience (kids vs adults) and x2 is type (arthouse vs blockbuster). These manual embeddings can be used to train a classifier or rank preferences (by distance to a decision boundary).

Language embeddings

Problems in embeddings for languages:

  1. Identical words can have multiple (very different) meanings
  2. Two different words can have very close meanings
  3. Two different words can have very close meanings, but their embeddings may not need to be as the decision to do so may be task-specific
  4. Size of |V| can scale to become very very large

This can be solved with frequency-based embeddings (TF-IDF and consorts) and prediction-based embeddings (Continuous Bag of Words and Skip-gram model).

  1. Train an NN on a language-related task (e.g., predicting words).
  2. Use one of the hidden layers of the trained NN as the feature representation (embedding) for words.
  3. Start with a basic embedding (like one-hot).
  4. Input this into the NN; the hidden layer activation becomes the "better" learned embedding.
  5. These learned embeddings are usually dense (not sparse) and capture similarities (similar words have positive inner products).