import pandas as pd
import numpy as np
from plotly import express as px
from matplotlib import pyplot as plt
import missingno as msn
import seaborn as sns
%matplotlib inline
df = pd.read_csv("fake_job_postings.csv")
df.head(2)
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
</style>
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
job_id | title | location | department | salary_range | company_profile | description | requirements | benefits | telecommuting | has_company_logo | has_questions | employment_type | required_experience | required_education | industry | function | fraudulent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Marketing Intern | US, NY, New York | Marketing | NaN | We're Food52, and we've created a groundbreaki... | Food52, a fast-growing, James Beard Award-winn... | Experience with content management systems a m... | NaN | 0 | 1 | 0 | Other | Internship | NaN | NaN | Marketing | 0 |
1 | 2 | Customer Service - Cloud Video Production | NZ, , Auckland | Success | NaN | 90 Seconds, the worlds Cloud Video Production ... | Organised - Focused - Vibrant - Awesome!Do you... | What we expect from you:Your key responsibilit... | What you will get from usThrough being part of... | 0 | 1 | 0 | Full-time | Not Applicable | NaN | Marketing and Advertising | Customer Service | 0 |
print(df.columns)
Index(['job_id', 'title', 'location', 'department', 'salary_range',
'company_profile', 'description', 'requirements', 'benefits',
'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
'required_experience', 'required_education', 'industry', 'function',
'fraudulent'],
dtype='object')
Question 1 :: How many Datapoints are present in the data?
Question 2 :: How many Features are present in the data?
print("sol1:- Total Number Of DataPoints:- {}.".format(df.shape[0]))
print("sol2:- Total Number of features:- {}.".format(df.shape[1]))
sol1:- Total Number Of DataPoints:- 17880.
sol2:- Total Number of features:- 18.
Question 3 :: check for null values ?
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
job_id 17880 non-null int64
title 17880 non-null object
location 17534 non-null object
department 6333 non-null object
salary_range 2868 non-null object
company_profile 14572 non-null object
description 17879 non-null object
requirements 15185 non-null object
benefits 10670 non-null object
telecommuting 17880 non-null int64
has_company_logo 17880 non-null int64
has_questions 17880 non-null int64
employment_type 14409 non-null object
required_experience 10830 non-null object
required_education 9775 non-null object
industry 12977 non-null object
function 11425 non-null object
fraudulent 17880 non-null int64
dtypes: int64(5), object(13)
memory usage: 2.5+ MB
msn.matrix(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f78397e4208>
msn.heatmap(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7833f10780>
msn.bar(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f783a141940>
for item in df.columns:
print("{} uniques: {}".format(item,df[item].unique().size))
job_id uniques: 17880
title uniques: 11231
location uniques: 3106
department uniques: 1338
salary_range uniques: 875
company_profile uniques: 1710
description uniques: 14802
requirements uniques: 11969
benefits uniques: 6206
telecommuting uniques: 2
has_company_logo uniques: 2
has_questions uniques: 2
employment_type uniques: 6
required_experience uniques: 8
required_education uniques: 14
industry uniques: 132
function uniques: 38
fraudulent uniques: 2
- job_id: - Every job have a different id
- title:- Job have a title
- location:- Location of job
- department:- Job department(ex:- marketing etc)
- salary_range:- range of salary
- company_profile:- what actually campany do like it is food company or tech company.
- description :- Full descripton of job
- requirements :- What are the Requirements
- benefits:- What are the extra benifit
- Telecommuting:- binary variable
- has_company_logo:- binary variable
- has_questions:- binary variable
- employment_type:- full-time or part time
- required_experience:- internship of how much experience needed
- required_education:- Minimum qualification
- industry:- Ex-marketing and advrtisement
- function:- functionality of job
- fraudulent:- it is fraud or not
Question 4 :: How many datapoints Are Fraudent in the given data ?
print("sol :: Number Of Fraudent job :: {}".format(df['fraudulent'].mean() * df['fraudulent'].size))
sol :: Number Of Fraudent job :: 866.0
msn.dendrogram(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f783a2177b8>