Natural Language Processing Text Vectorization Approaches

Introduction

It always needs to transform natural language either text or audio into numerical form for Natural Language Processing to work. Text vectorization approaches are very best choices for traditional machine learning algorithms. They can support converting text to numeric feature vectors. For this purpose, some techniques namely Bag of Words and td-idf vectorization are great choices. In this article, we will try to learn about these approaches in detail.

Description

There has been the requirement to transform the text into roughly a machine may understand. The process of changing words into numbers are known as Vectorization. There are three utmost used methods to convert text into numeric feature vectors;

  • Bag of Words
  • tf-idf vectorization and
  • Word embedding.

Important terms we need to understand before going into detail

  • Document: A document is a particular text data point e.g. a product review
  • Corpus: It is a group of all the documents
  • Feature: All unique word in the corpus is a feature

Bag of Words

Bag of words is a Natural Language Processing method of text demonstrating. It is a method of story extraction with text data. This methodology is a modest and flexible technique for extracting features from documents.

A bag of words is a demonstration of text that defines the incidence of words inside a document. We only save track of word counts and neglect the grammatical particulars and the word order. It is named a bag of words as any information nearby the order or building of words in the document is thrown out. The model is just alarmed with whether identified words take place in the document, not wherein the document.

  • The notion behind this technique is open, however very influential.
  • In the beginning, we describe a fixed-length vector where each entry matches a word in our pre-defined dictionary of words.
  • The size of the vector matches the size of the dictionary.
  • We count how many times each word of our dictionary looks in the text.
  • We put this number in the parallel vector entry.

Example:

  • If a dictionary comprises the words {MonkeyLearn, is, the, not, great},
  • We need to vectorize the text (MonkeyLearn is great).
  • We will have the vector: (1, 1, 0, 0, 1).

We can use some more forward-thinking methods to improve this representation. The main problem with this method is that it doesn’t capture the meaning of the text even when using n-grams.

Text vectorization transformed by Deep Learning

  • One way solution to this problem was to find vectorize words.
  • That became very popular with the word2vec.
  • Tomas Mikolov and a research team from Google established this model in 2013.
  • It is likely to have a neural network learn good vector representations of words using huge amounts of data.
  • That have some needed properties like being able to do math with them.
  • These vectors are suitable for doing a lot of tasks linked to NLP.
  • As each of its dimensions encodes a changed property of the word.
  • The subsequent step is to get a vectorization for a complete sentence as an alternative to just a lone word.
  • That is very valuable if we want to do text classification for instance.
  • This problem has not been totally addressed up till now.
  • There have been some important developments in the last few years with applications similar to the Skip-Thought Vectors.

Skip-Thought Vectors

  • Skip-Thoughts vectors were established by the University of Toronto.
  • The key idea behind this algorithm is following:
  • In the same technique, we can get a good word vector representation by using a neural network.
  • That attempts to predict the nearby words of a word.
  • They use a neural network to predict the adjacent sentences of a sentence.
  • They required vast amounts of contiguous text data for this to work.
  • They found it in the BookCorpus dataset.
  • These are free books written by up till now unpublished authors.
  • They prove in their paper that these sentence vectors may be used as a very healthy text representation.

Natural Language Processing Text Vectorization Approaches

Bag of Words Model creation with Sklearn

  • We can use the CountVectorizer() function from the Sk-learn library to simply implement the Bag of Words model using Python as:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
sentence_1="This is a good job.I will not miss it for anything"
sentence_2="This is not good at all"
CountVec = CountVectorizer(ngram_range=(1,1), # to use bigrams ngram_range=(2,2)
                           stop_words='english')
#transform
Count_data = CountVec.fit_transform([sentence_1,sentence_2])
#create dataframe
cv_dataframe=pd.DataFrame(Count_data.toarray(),columns=CountVec.get_feature_names())
print(cv_dataframe)

Output

Bag of Words Model creation with Sklearn

Tf-idf Vectorization

  • The Bag of Words technique treats all words equally.
  • It cannot differentiate very common words or rare words.
  • Tf-idf resolves this problem of Bag of Words Vectorization.
  • tf-idf stands for Term frequency-inverse document frequency.
  • It provides a measure that takes the position of a word in deliberation reliant on how often it happens in a document and a corpus.
  • We would know about the term frequency and inverse document frequency distinctly to understand tf-idf.
  • Term frequency provides a measure of a frequency of a word in a document.
  • tf (word) = No. times of word seems in document / total number of words in a document
  • For our instance, in the document “Cat loves to play with a ball.” the term frequency value for word cat will be: tf( cat ) = 1 / 6
  • Inverse document frequency is a measure of the standing of the word.
  • It measures how common a specific word is across all the documents in the corpus.
  • idf(word) = log(No. of total documents / No. of a document with the word in it)
  • The notion is to classify how common or rare a word is.
  • For instance, the words ‘is’ or ‘and’ are very common. They would be present in nearly every document.
  • Suppose that a word ‘is’ is present in all the documents is a corpus of 1000 documents.
  • The idf for that would be: idf(‘is) = log (1000/1000) = log 1 = 0
  • Therefore common words will have lesser importance.
  • Similar way, the idf(cat) = 0 in our illustration.
  • The tf-idf is an increase or multiplication of tf and idf values.
  • tf-idf(cat for document 2) = tf(cat) * idf(cat) = 1 / 6 * 0 = 0
  • As a result, tf-idf(cat) for document 2 will be 0

Below is the python application of tf-idf Vectorization using Scikit-learn.

from sklearn.feature_extraction.text import TfidfVectorizer

2 document1 = 'Dog hates a cat It loves to go out and play'

3 document2 = 'Cat loves to play with a ball'

4

5 # converting sentences to lower case

6 document1 = document1.lower()

7 document2 = document2.lower()

8

9 # Intialize TfidfVectorizer

10 tfidf_vect = TfidfVectorizer()

11 # fit the corpus to TfidfVectorizer

12 tfidf_vect.fit([document1, document2])

13

14 print("feature names ", tfidf_vect.get_feature_names())

15

16 # tf-idf representation of document1

17 tfidf1 = tfidf_vect.transform([document1])

18 print("Representation of document1: ", tfidf1.toarray())

19

20 # tf-idf representation of document2

21 tfidf2 = tfidf_vect.transform([document2])

22 print("Representation of document2: ", tfidf2.toarray())

23

24 # Output:

25 # feature names:  ['and', 'ball', 'cat', 'dog', 'go', 'hates', 'it', 'loves', 'out', 'play', 'to', 'with']

26

27 # Representation of document1:  

28 # [[0.35 0. 0.25 0.35 0.35 0.35 0.35 0.25 0.35 0.25 0.25 0.]]

29

30 # Representation of document2:  

31 # [[0. 0.498 0.35 0. 0. 0. 0. 0.35 0. 0.35 0.35 0.498]]

Leave a Comment