Word embeddings are a kind of word representation. It permits words with like meaning to have the same representation. They are a spread representation for text. That is maybe one of the important advances for the inspiring show of deep learning methods on challenging natural language processing problems.
Word embeddings are a very popular and powerful way to associate a vector with a word. However, the vectors got through one-hot encoding are binary, sparse, and very high-dimensional. They are generally made of zeros and have the same dimensionality as the number of words in the vocabulary. Word embeddings are low-dimensional floating-point vectors. That is dense vectors, as opposed to sparse vectors. Word embeddings are learned from data because they are different from the word vectors attained via one-hot encoding. When dealing with very big vocabularies, it’s common to see word embeddings that are 256-dimensional, 512-dimensional, or 1,024-dimensional. Instead, one-hot encoding words usually lead to vectors that are 20,000-dimensional or greater. We may say that in this case the capturing of a vocabulary is 20,000 tokens. Consequently, word embeddings pack extra information into distant fewer dimensions.
Word embeddings are actually a class of techniques. In these techniques, the individual words are represented as real-valued vectors in a predefined vector space. Every word is mapped to one vector. The vector values are learned in a way that looks like a neural network. After this, the technique is frequently taken into the field of deep learning.
Simple to the method is the knowledge of using a densely distributed representation for every word. Every one word is signified by a real-valued vector. That is represented often by tens or hundreds of dimensions. This is compared to the thousands or millions of dimensions necessary for sparse word representations, for example, a one-hot encoding.
Methods to obtain Embeddings
There are two methods to obtain word embeddings:
Learn word embeddings together with the main task we care about for example document classification or sentiment prediction. We start with random word vectors and then learn word vectors in the equal way we learn the weights of a neural network in this arrangement.
Weight into our model word embeddings. That was pre-computed using a diverse machine-learning task than the one we’re trying to solve. These are called Pre-trained word embeddings.
Learning word embeddings with the embedding layer
One embedding layer, for the absence of a better name, is a word embedding that is learned together with a neural network model on a specific natural language processing task, for example, language modeling or document classification.
It needs that document text to be cleaned. It may be prepared such that each word is one-hot encoded. The size of the vector space is stated as part of the model, for example, 50, 100, or 300 dimensions. The vectors are reset with small random numbers. The embedding layer is used on the forward-facing end of a neural network. This is fit in a supervised way using the Backpropagation algorithm.
Any one-hot encoded-words are mapped to the word vectors. The word vectors are concatenated before being fed as input to the model if a Multilayer Perceptron model is used. Each word can be taken as one input in a sequence if a recurrent neural network is used.
This approach of learning an embedding layer needs a lot of training data. It may be slow, but then again will learn an embedding both directed to the specific text data and the NLP task.
Using Pre-trained Word Embeddings
From time to time, we have so tiny training data available that we can’t use our data alone to learn a suitable task-specific embedding of our vocabulary. What do we do then? Instead of learning word embeddings jointly with the problem we want to solve, we may load embedding vectors from a pre-computed embedding space that we know is highly structured and displays valuable properties. Those properties capture generic features of language structure. The basis after using pre-trained word embeddings in natural-language processing is abundant the same as for using pre-trained convnets in image classification. We don’t have sufficient data open to learning really powerful features on our own. Then, we imagine the features that we need to be fairly generic. That is common visual structures or semantic features. It creates a sense to reuse features learned on a different problem in this case. These word embeddings are usually computed using word-occurrence statistics, using a variety of techniques, some involving neural networks, others not. The idea of a dense, low-dimensional embedding space for words was originally explored by Bengio et al. in the early 2000s. This was computed in an unsupervised way. It only started to take off in research and industry applications after the release of one of the most well-known and successful word-embedding schemes named the Word2vec algorithm.
Word2vec algorithm (https://code.google.com/ archive/p/word2vec), was developed by Tomas Mikolov at Google in 2013. Word2vec dimensions arrest the specific semantic properties, for example, gender. There are many pre-computed databases of word embeddings that we may download and use in a Keras Embedding layer. Word2vec is one of them.
Global Vectors for Word Representation (GloVe)
One more widespread one is called Global Vectors for Word Representation (GloVe, https://nlp.stanford edu/projects/glove). This was developed by Stanford researchers in 2014. This embedding technique is built on factorizing a matrix of word co-occurrence statistics. Its designers have made existing pre-computed embeddings for millions of English tokens. These are obtained from Wikipedia data and Common Crawl data.