JOB POSTING DETECTION USING TEXT CLASSIFICATION

serispoorthi
Apr 29, 2022
5 min read

INTRODUCTION

A recent survey by ‘ActionFraud’ found that more than 67% of people look online for jobs but are unaware of the growing number of ‘fake jobs’ or ‘job scams’ they are at risk to.

Over 700,000 jobseekers reported losing more than £500,000 through job scams – a 300% rise over the last two years.

Many people are desperate looking out for jobs and registering to various fake sites inadvertently. The mailboxes have sometimes been packed with inquiries about suspicious companies and hiring agencies and recruiters, or websites or organizations with a questionable look or contents advertising seagoing jobs; or with messages from people complaining or regretting having paid for what later proved to be non-existent positions.

The scammers provide users with a very lucrative job opportunity and later ask for money in return. Or they require investment from the job seeker with the promise of a job. This is a dangerous problem that can be addressed through Machine Learning techniques and Natural Language Processing (NLP).

This project utilizes information given by Kaggle. This information contains highlights that characterize a job posting as fraudulent or real.

DATA VIEW:

There are 3 numerical predictor columns and rest are textual predictor columns.

• Data shape:

There are 17880 observations and 18 columns.

• Checking for missing values:

Many missing values are present in the dataset.
The columns 'Department' and 'Salary Range' have to the dropped due 65% and 84% missing values. These columns are incapable of providing sufficient information for classification of job postings.

DATA VISUALISATION FOR NUMERICAL COLUMNS:

• Target variable:

The fraudulent column represents whether a job posting was Fake (1) or Real (0).

Number of fraudulent job postings are significantly less and comprise only 4.8% of total job postings. This is an imbalanced dataset.

TEXTUAL DATA PRE-PROCESSING:

Pre-processing dataset is an essential step before model building. For textual data pre processing is done by removing punctuations & special characters, cleaning texts, removing stop words, and applying lemmatization.

The final result obtained is as shown:

MISSING VALUES:

Missing values in textual data (Here, nan values) need to be dealt separately for every column type. This is done because the meaning of nan value changes for every column. This problem is dealt with as shown:

Here

companies without description have been assigned the term ‘undescribed’. Companies without any requirements have been assigned the term ‘unrequired’.
Companies without job title have been assigned the term ‘untitled’.
Companies without company profile have been assigned the term ‘unprofiled’.
Companies without job location have been assigned the term ‘remote’. Companies without job benefits have been assigned the term ‘unbenefited’.
Companies without educational requirements have been assigned the term ‘basic’.
Companies without work experience have been assigned the term ‘unrequired’.
Companies without industry domain and function have been assigned the term ‘unspecific’.
Companies without job employment type have been assigned the term ‘untyped’.

NUMERICAL DATA TO TEXTUAL DATA:

The three numerical columns need to be converted in textual format for the ease of model fitting and good results. This is done as follows:

ANALYSIS OF FREQUENT WORDS:

• Required Education

Fraudulent vs Real Jobs

Here we can see that “Basic” is the most frequent word in both categories which means most of jobs do not mention any required education. Looking at the size of other requirements we see real jobs mention more educational requirements like bachelor’s degree, high school etc than fraud jobs.

• Employment Type

Real jobs vs Fraudulent jobs

Here we see most of the real jobs as well as fraudulent jobs demand full time job role. In fraudulent jobs the number of no mention of employment type is higher than real jobs.

• Required Experience

Real jobs vs Fraudulent jobs

After unrequired, most real jobs demand mid senior level and associate level experience while fraud jobs ask for entry level experience. This might be because many people without much experience would be able to apply to fraud jobs.

• Function

Real jobs vs Fraudulent jobs

Here we see engineering is most frequent function in fraud jobs while in real jobs mostly and the job functions are equally frequent.

• Mean word count for columns: Description, Profile, Requirements

Looking at all the three columns mean word count, it is very clear that fraud job postings are less descriptive about their, job description, company profile and company requirements. Since they are all made up which make them inarticulate. Fraud jobs focus more on displaying attractive stipends, lucrative opportunities and incentives.

DATA COMBINING:

To train our model, a new text column called ‘clean_text’ is created by combing all the individual text columns as shown:

• Most frequent words combined:

Fraud vs Real job postings

Here we can see that the most common term in fraud jobs is project, experience, customer

and service while in real jobs the common terms are customer, service, senior, skill etc.

MODEL FITTING:

Before fitting the model, the imbalanced dataset is balanced using SMOTE technique and Random Over Sampling technique and Naïve Bayes and random forest models are fitted. Here we will prefer good f1 score than accuracy since f1 scores give better representation of fitted models in case of imbalanced data.

SMOTE:

Results obtained on fitting Naïve Bayes:

Results obtained on fitting Random Forest:

Here we can see that Random Forest Model gives better performance than the Naïve Bayes. The f1 score for class 1 (i.e., the fraud job posting) is significantly higher for random forest (0.76) than naïve bayes (0.44).

RANDOM OVER SAMPLING:

In this case the fraud job columns have been randomly replicated in order to match the number of real job observations. This insures balanced data for modelling.

Results obtained on fitting Naïve Bayes:

Results obtained on fitting Random Forest:

We see that f1 score is 0.44 in case of naïve bayes and 0.78 in case of random forest. Random over sampling increased the f1 score by 2% than SMOTE. Hence, it is better to use this method to balance the data.

Hence, we conclude that random forest model with random over sampling is the best model to predict the real/ fake job postings.

Hyper parameter Tuning:

It is the process of determining the right combination of hyperparameters that maximizes the model performance. Hyperparameters are adjustable parameters that let you control the model training process. For example, with neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model performance depends on hyperparameters.

Overfitting:

When machine learning algorithms are constructed, they leverage a sample dataset to train the model. However, when the model trains for too long on sample data or when the model is too complex, it can start to learn the “noise,” or irrelevant information, within the dataset. When the model memorizes the noise and fits too closely to the training set, the model becomes “overfitted,” and it is unable to generalize well to new data.

References:

www.towardsdatascience.com

www.windrosenetwork.com

www.medium.com

https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction

https://colab.research.google.com/drive/1hxVnm0pxvjh2hPCcACmyaFc1C-xd-g_C#scrollTo=R_6eU5uhAA50

JOB POSTING DETECTION USING TEXT CLASSIFICATION

Recent Posts

Comments