top of page

Naive Bayes Classifier (NBC) of text dataset

serispoorthi

Updated: Apr 19, 2022

The dataset contains around 200k news headlines from the year 2012 to 2018 obtained from HuffPost. The model trained on this dataset could be used to identify tags for untracked news articles or to identify the type of language used in different news articles.


Each news headline has a corresponding category. Categories and corresponding article counts are as follows:

  • POLITICS: 32739

  • WELLNESS: 17827

  • ENTERTAINMENT: 16058

  • TRAVEL: 9887

  • STYLE & BEAUTY: 9649

  • PARENTING: 8677

  • HEALTHY LIVING: 6694

  • QUEER VOICES: 6314

  • FOOD & DRINK: 6226

  • BUSINESS: 5937

  • COMEDY: 5175

  • SPORTS: 4884

  • BLACK VOICES: 4528

  • HOME & LIVING: 4195

  • PARENTS: 3955

  • THE WORLDPOST: 3664

  • WEDDINGS: 3651

  • WOMEN: 3490

  • IMPACT: 3459

  • DIVORCE: 3426

  • CRIME: 3405

  • MEDIA: 2815

  • WEIRD NEWS: 2670

  • GREEN: 2622

  • WORLDPOST: 2579

  • RELIGION: 2556

  • STYLE: 2254

  • SCIENCE: 2178

  • WORLD NEWS: 2177

  • TASTE: 2096

  • TECH: 2082

  • MONEY: 1707

  • ARTS: 1509

  • FIFTY: 1401

  • GOOD NEWS: 1398

  • ARTS & CULTURE: 1339

  • ENVIRONMENT: 1323

  • COLLEGE: 1144

  • LATINO VOICES: 1129

  • CULTURE & ARTS: 1030

  • EDUCATION: 1004

We divide the dataset as train, development and test.

Build a vocabulary as list.

  • [‘the’ ‘I’ ‘happy’ … ] (omit rare words for example if the occurrence is less than five times)

  • A reverse index as the key value might be handy {“the”: 0, “I”:1, “happy”:2 , … }

We shall calculate the following probability as

  • Probability of the occurrence

    • P[“the”] = num of documents containing ‘the’ / num of all documents

  • Conditional probability based on the sentiment

    • P[“the” | Positive] = # of positive documents containing “the” / num of all positive review documents

We shall Calculate accuracy using dev dataset

  • Conduct five fold cross validation

To ensure and compare the effect of Smoothing

  • Derive Top 10 words that predicts positive and negative class

    • P[Positive| word]

Using the test dataset by using the optimal hyperparameters you found in the step, and use it to calculate the final accuracy.


Objective


Classification of news headlines is a text classification, where we have used Kaggle platform's data set to identify news headlines from past years. We apply Naive Bayes classifier for news category. We extract, select and evaluate as a part of text classification.


Naive Bayes Classifier ---> P(A/B) = P(B/A).P(A)/P(B)


NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. Tokenizers can be used to find the words and punctuation in a string:




We Perform category filtering





By picking top 5 categories









We perform category wise probabilities






Then, we perform word wise probability on data









NLTK contains a module called tokenize() which further classifies into two sub-categories:

  • Word tokenize: We use word_tokenize() method to split a sentence into tokens or words

  • Sentence tokenize: We use sent_tokenize() method to split a paragraph into sentences



Naive Bayes Classifier - https://colab.research.google.com/drive/1vyioeWZNZOZWfc5BPCf3KGHyGzhf5k0E#scrollTo=54a84fa3


References


stemming

naive bayes

33 views0 comments

Recent Posts

See All

Comments


+1785-521-0399

  • LinkedIn
  • Instagram

©2022 by Spoorthi Seri. Proudly created with Wix.com

bottom of page