Naive Bayes Classifier (NBC) of text dataset

serispoorthi
Apr 18, 2022
2 min read

Updated: Apr 19, 2022

The dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.

Each news headline has a corresponding category. Categories and corresponding article counts are as follows:

POLITICS: 32739
WELLNESS: 17827
ENTERTAINMENT: 16058
TRAVEL: 9887
STYLE & BEAUTY: 9649
PARENTING: 8677
HEALTHY LIVING: 6694
QUEER VOICES: 6314
FOOD & DRINK: 6226
BUSINESS: 5937
COMEDY: 5175
SPORTS: 4884
BLACK VOICES: 4528
HOME & LIVING: 4195
PARENTS: 3955
THE WORLDPOST: 3664
WEDDINGS: 3651
WOMEN: 3490
IMPACT: 3459
DIVORCE: 3426
CRIME: 3405
MEDIA: 2815
WEIRD NEWS: 2670
GREEN: 2622
WORLDPOST: 2579
RELIGION: 2556
STYLE: 2254
SCIENCE: 2178
WORLD NEWS: 2177
TASTE: 2096
TECH: 2082
MONEY: 1707
ARTS: 1509
FIFTY: 1401
GOOD NEWS: 1398
ARTS & CULTURE: 1339
ENVIRONMENT: 1323
COLLEGE: 1144
LATINO VOICES: 1129
CULTURE & ARTS: 1030
EDUCATION: 1004

We divide the dataset as train, development and test.

Build a vocabulary as list.

[‘the’ ‘I’ ‘happy’ … ] (omit rare words for example if the occurrence is less than five times)
A reverse index as the key value might be handy {“the”: 0, “I”:1, “happy”:2 , … }

We shall calculate the following probability as

Probability of the occurrence
- P[“the”] = num of documents containing ‘the’ / num of all documents
Conditional probability based on the sentiment
- P[“the” | Positive] = # of positive documents containing “the” / num of all positive review documents

We shall Calculate accuracy using dev dataset

Conduct five fold cross validation

To ensure and compare the effect of Smoothing

Derive Top 10 words that predicts positive and negative class
- P[Positive| word]

Using the test dataset by using the optimal hyperparameters you found in the step, and use it to calculate the final accuracy.

Objective

Classification of news headlines is a text classification, where we have used Kaggle platform's data set to identify news headlines from past years. We apply Naive Bayes classifier for news category. We extract, select and evaluate as a part of text classification.

Naive Bayes Classifier ---> P(A/B) = P(B/A).P(A)/P(B)

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. Tokenizers can be used to find the words and punctuation in a string:

We Perform category filtering

By picking top 5 categories

We perform category wise probabilities

Then, we perform word wise probability on data

NLTK contains a module called tokenize() which further classifies into two sub-categories:

Word tokenize: We use word_tokenize() method to split a sentence into tokens or words
Sentence tokenize: We use sent_tokenize() method to split a paragraph into sentences