Introduction
Deep-learning models that would process text either understood as sequences of word or sequences of characters, statistic, and sequence data generally. The two important deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets. We’ll discuss both of those approaches. Applications of those algorithms include the following:
- Document classification and statistic classification, like identifying the topic of a piece of writing or the author of a book.
- Statistic comparisons, like estimating how closely related two documents or two stock tickers are
Working with text data
Text is one of the foremost widespread sorts of sequence data. It is often understood as either a sequence of characters or a sequence of words, but it’s commonest to figure at the extent of words. The deep-learning sequence-processing models introduced within the following sections can use text to supply a basic sort of natural-language understanding. That is sufficient for applications including document classification, sentiment analysis, author identification, and even question-answering. Obviously, restrain mind that none of those deep learning models truly understand text during a human sense; rather, these models can map the statistical structure of written communication that is enough to undo many simple textual tasks. Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much an equivalent way that computer vision is pattern recognition applied to pixels.
Similar to all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric tensors. Vectorizing text is that the process of remodeling text into numeric tensors. This will be wiped out multiple ways:
- Segment text into words, and transform each word into a vector.
- Fragment text into characters, and transform each character into a vector.
- Extract n-grams of words or characters, and transform each n-gram into a vector.
The N-grams are overlapping groups of multiple serial of words or characters. Jointly, the many units into which we will break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is named tokenization. All text-vectorization processes contain applying some tokenization scheme then associating numeric vectors with the generated tokens. These vectors are nurtured into deep neural networks as packed into sequence tensors.
Understanding n-grams and bag-of-words
Word n-grams are groups of N (or fewer) consecutive words that we simply can extract from a sentence. An equivalent concept can also be applied to characters rather than words. Here’s an easy example. Consider the sentence “The cat sat on the mat.” it’s going to be decomposed into the subsequent set of 2-grams:
{“The”, “The cat”, “cat”, “cat sat”, “sat”,
“sat on”, “on”, “on the”, “the”, “the mat”, “mat”}
It may even be decomposed into the subsequent set of 3-grams:
{“The”, “The cat”, “cat”, “cat sat”, “The cat sat”,
“sat”, “sat on”, “on”, “cat sat on”, “on the”, “the”,
“sat on the”, “the mat”, “mat”, “on the mat”}
Such a group is named a bag-of-2-grams or bag-of-3-grams, respectively. The term bag here refers to the very fact that we’re handling a group of tokens instead of an inventory or sequence: the tokens haven’t any specific order. This family of tokenization methods is named bag-of-words. As bag-of-words isn’t an order-preserving tokenization method. It tends to be utilized in shallow language-processing models instead of in deep-learning models. Extracting n-grams may be a sort of feature engineering, and deep learning does away with this type of rigid, brittle approach, replacing it with hierarchical feature learning. One-dimensional convnets and recurrent neural networks are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by watching continuous word or character sequences. For this reason, we won’t cover n-grams any longer. But do confine mind that they’re a strong, unavoidable feature-engineering tool when using lightweight, shallow text-processing models like logistic regression and random forests.