Deep learning for text and sequences

Deep learning for text and sequences

Introduction

Deep-learning models that would process text either understood as sequences of word or sequences of characters, statistic, and sequence data generally. The two important deep-learning algorithms for sequence processing are recurrent neural networks and 1D convnets. We’ll discuss both of those approaches. Applications of those algorithms include the following:

  • Document classification and statistic classification, like identifying the topic of a piece of writing or the author of a book.
  • Statistic comparisons, like  estimating how closely related two documents or two stock tickers are

Working with text data

Text is one of the foremost widespread sorts of sequence data. It is often understood as either a sequence of characters or a sequence of words, but it’s commonest to figure at the extent of words. The deep-learning sequence-processing models introduced within the following sections can use text to supply a basic sort of natural-language understanding. That is sufficient for applications including document classification, sentiment analysis, author identification, and even question-answering. Obviously, restrain mind that none of those deep learning models truly understand text during a human sense; rather, these models can map the statistical structure of written communication that is enough to undo many simple textual tasks. Deep learning for natural-language processing is pattern recognition applied to words, sentences, and paragraphs, in much an equivalent way that computer vision is pattern recognition applied to pixels.

Similar to all other neural networks, deep-learning models don’t take as input raw text: they only work with numeric tensors. Vectorizing text is that the process of remodeling text into numeric tensors.  This will be wiped out multiple ways:

  • Segment text into words, and transform each word into a vector.
  • Fragment text into characters, and transform each character into a vector.
  • Extract n-grams of words or characters, and transform each n-gram into a vector.

The N-grams are overlapping groups of multiple serial of words or characters. Jointly, the many units into which we will break down text (words, characters, or n-grams) are called tokens, and breaking text into such tokens is named tokenization. All text-vectorization processes contain applying some tokenization scheme then associating numeric vectors with the generated tokens. These vectors are nurtured into deep neural networks as packed into sequence tensors.

Understanding n-grams and bag-of-words

Word n-grams are groups of N (or fewer) consecutive words that we simply can extract from a sentence.  An equivalent concept can also be applied to characters rather than words. Here’s an easy example. Consider the sentence “The cat sat on the mat.” it’s going to be decomposed into the subsequent set of 2-grams:

{“The”, “The cat”, “cat”, “cat sat”, “sat”,

“sat on”, “on”, “on the”, “the”, “the mat”, “mat”}

It may even be decomposed into the subsequent set of 3-grams:

{“The”, “The cat”, “cat”, “cat sat”, “The cat sat”,

“sat”, “sat on”, “on”, “cat sat on”, “on the”, “the”,

“sat on the”, “the mat”, “mat”, “on the mat”}

Such a group is named a bag-of-2-grams or bag-of-3-grams, respectively. The term bag here refers to  the very fact  that we’re  handling   a group  of tokens  instead of   an inventory  or sequence: the tokens  haven’t any  specific order. This family of tokenization methods is named bag-of-words. As bag-of-words isn’t an order-preserving tokenization method. It tends to be utilized in shallow language-processing models instead of in deep-learning models. Extracting n-grams may be a sort of feature engineering, and deep learning does away with this type of rigid, brittle approach, replacing it with hierarchical feature learning. One-dimensional convnets and recurrent neural networks are capable of learning representations for groups of words and characters without being explicitly told about the existence of such groups, by watching continuous word or character sequences. For this reason, we won’t cover n-grams any longer. But do confine mind that they’re a strong, unavoidable feature-engineering tool when using lightweight, shallow text-processing models like logistic regression and random forests.

Mansoor Ahmed is Chemical Engineer, web developer, a writer currently living in Pakistan. My interests range from technology to web development. I am also interested in programming, writing, and reading.
Posts created 422

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.

Back To Top