Text Classification with Naive Bayes classifier

Introduction

In this post, we are going to discuss that how to classify text using Naive Bayes classifer. Naive Bayes classifiers are collectively a group of classification algorithms. That is based on Bayes’ Theorem. It is not only one algorithm but a family of algorithms. All algorithms mutually share a general principle. For example each pair of techniques being classified is free of each other.
The Naive Bayes classifier is a easy classifier. It classifies based on probabilities of events. It is the implemented usually to text classification. Therefore, it is a easy algorithm as it does work well in many text classification problems. It can deliver accurate results without much training data.

Text Classification

Text classification is the method of categorizing text into organized groups. Text classifiers may automatically analyze text. After that they assign a set of pre-defined tags or well-defined categories. That is based on its content by using Natural Language Processing (NLP).

Text Classification Model

  • Machine Learning needs us to form classifications supported past observations.
  • We give the machine a group of knowledge having texts with labels tagged thereto .
  • Then, we let the model to find out on of these data which can later give us some useful insight on the categories of text input we feed.

Text Classification Model

General Workflow

General Workflow

Description

Now, we are getting to get an inventory of sentences and classify them supported the user’s sentiment. We would like to inform whether the sentence carries a positive or a negative sentiment.
Downloading the Data

  • We can just open the browser, download the CSV files into an area folder.
  • Load the files into Data Frames with the help of pandas.
  • Python is that the best to download the files, instead of the browser.
  • Find a link to the compressed file, and follow the instructions to urge the info .
  • First, create a folder to store the downloaded data into it.
  • The subsequent code checks whether the specified folder exists or not.
  • If it’s not there, it creates it into the current working directory:
import os
data_dir = f'{os.getcwd()}/data'
if not os.path.exists(data_dir):
os.mkdir(data_dir)
  • Then we’d like to put in the requests library using pip, as we’ll use it to download the data:
pip install requests
  • Then, download the compressed data like as:
import requests
url =
'https://archive.ics.uci.edu/ml/machine-learning-databases/00331/se
ntiment labelled sentences.zip'
response = requests.get(url)
  • Now, we will un-compress the data and store it into the info folder we’ve just created.
  • We’ll be using the zipfile module to un-compress our data.
  • The ZipFile method is used to read a file object.
  • Thus, we would use BytesIO to change the content of the response into a file-like object.
  • Then, extract the content of the zip file into our folder as follows:
import zipfile
from io import BytesIO
with zipfile.ZipFile(file=BytesIO(response.content), mode='r') as
compressed_file:
compressed_file.extractall(data_dir)
  • Now the data is written into three separate files in data folder.
  • We will load each one of the three files into a separate data frame.
  • Then, we will combine the three data frames into one data frame as follows:
df_list = []
for csv_file in ['imdb_labelled.txt', 'yelp_labelled.txt',
'amazon_cells_labelled.txt']:
csv_file_with_path = f'{data_dir}/sentiment labelled
sentences/{csv_file}'
temp_df = pd.read_csv(
csv_file_with_path,
sep="\t", header=0,
names=['text', 'sentiment']
)
df_list.append(temp_df)
df = pd.concat(df_list)
  • We will display the distribution of the sentiment labels using the subsequent code:
explode = [0.05, 0.05]
colors = ['#777777', '#111111']
df['sentiment'].value_counts().plot(
kind='pie', colors=colors, explode=explode
  • We can also display a couple of sample sentences using the subsequent code, after tweaking pandas’ settings to display more characters per cell:
pd.options.display.max_colwidth = 90
df[['text', 'sentiment']].sample(5, random_state=42)

Data Preparation

Now we’d like to organize the data for our classifier to use it:

  • We start by splitting the Data Frame into training and testing sets.
  • We kept 40% of the data set for testing:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.4,
random_state=42)
  • Then, get the labels from the sentiment column as follows:
y_train = df_train['sentiment']
y_test = df_test['sentiment']
  • Because of the textual features, let’s convert them using CountVectorizer.
  • We will include unigrams also as bigrams and trigrams.
  • We will similarly ignore rare words by setting min_df to three to exclude words occurring in fewer than three documents.
  • This is often a useful practice for removing spelling mistakes and noisy tokens.
  • Finally, we will strip accents from letters and convert them to ASCII:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1,3), min_df=3,
strip_accents='ascii')
x_train = vec.fit_transform(df_train['text'])
x_test = vec.transform(df_test['text'])
  • At the end, we will use the Naive Bayes classifier to classify our data.
  • We set fit_prior=True for the model to use the distribution of the category labels in the training data as its prior:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB(fit_prior=True)
clf.fit(x_train, y_train)
y_test_pred = clf.predict(x_test)

This time, our old good accuracy score might not be informative enough. we would like to understand how accurate we are per class. Furthermore, counting on our use case, we may have to inform whether the model was ready to identify all the negative tweets, albeit it did that at the expense of misclassifying some positive tweets. To be ready to get this information, we’d like to use the precision and recall scores.

Precision, recall, and F1 score

Out of the samples that were assigned to the positive class, the share of them that were actually positive is that the precision of this class. For the positive tweets, the share of them that the classifier correctly predicted to be positive is that the recall for this class. As we will see, the precision and recall are calculated per class. Here is how we formally express the precision score in terms of true positives and false positives:

True positive
The recall score is expressed in terms of true positives and false negatives.

Recall

To summarize the 2 previous scores into one number, the F1 score are often used. It combines the precision and recall scores using the subsequent formula:

F1Score

Here we calculate the three aforementioned metrics for our classifier:

p, r, f, s = precision_recall_fscore_support(y_test, y_test_pred)
For clarity we have put the resulting metrics into the subsequent table. Always keep in mind that the support is simply the amount of samples in each class:

Precision Recall F1Score
We have equivalent scores as long as the sizes of the 2 classes are almost equal. In cases where the classes are imbalanced, it’s more common to ascertain one class achieving a better precision or a better recall compared to the opposite .

Leave a Comment