Embeddings
A function which transforms an object
to another object , which should be mathematically more tractable and useful than
They're typically used in machine learning to encode some non-mathematical data into a meaningful data to be later fed into a mathematical model.
A problem arises when input data cannot be simply formatted as a tensor (since Neural Networks are designed to operate with Tensors of numerical values).
They're preferably injective or ideally bijective.
Injective functions map distinct elements of domain
Surjective functions map all elements from codomain
Bijective functions are both injective and subjective, and define a one-to-one mapping between all elements in both domain
We want embedding functions to ideally be bijective as:
- Non-surjective functions could predict vectors with no meanings (element in
which cannot be mapped to an element in ) - Non-injective functions could predict vectors with multiple meanings
- They allow for any element
to be encoded as unique with , and any element in to be decoded as a unique with
Manual Embeddings
Refers to manually extracting meaningful data
from an object whose format cannot be fed into an Neural Networks. This means manually defining features to represent an object numerically
For example, representing movies using two scores in
Language embeddings
Problems in embeddings for languages:
- Identical words can have multiple (very different) meanings
- Two different words can have very close meanings
- Two different words can have very close meanings, but their embeddings may not need to be as the decision to do so may be task-specific
- Size of
can scale to become very very large
This can be solved with frequency-based embeddings (TF-IDF and consorts) and prediction-based embeddings (Continuous Bag of Words and Skip-gram model).
- Train an NN on a language-related task (e.g., predicting words).
- Use one of the hidden layers of the trained NN as the feature representation (embedding) for words.
- Start with a basic embedding (like one-hot).
- Input this into the NN; the hidden layer activation becomes the "better" learned embedding.
- These learned embeddings are usually dense (not sparse) and capture similarities (similar words have positive inner products).