Skip to content

Latest commit

 

History

History
290 lines (208 loc) · 6.33 KB

NoteBook1.md

File metadata and controls

290 lines (208 loc) · 6.33 KB

Data Visualization

<- Back to Home

import pandas as pd
import numpy as np
from plotly import express as px
from matplotlib import pyplot as plt
import missingno as msn
import seaborn as sns
%matplotlib inline

Reading File

df = pd.read_csv("fake_job_postings.csv")
df.head(2)
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
job_id title location department salary_range company_profile description requirements benefits telecommuting has_company_logo has_questions employment_type required_experience required_education industry function fraudulent
0 1 Marketing Intern US, NY, New York Marketing NaN We're Food52, and we've created a groundbreaki... Food52, a fast-growing, James Beard Award-winn... Experience with content management systems a m... NaN 0 1 0 Other Internship NaN NaN Marketing 0
1 2 Customer Service - Cloud Video Production NZ, , Auckland Success NaN 90 Seconds, the worlds Cloud Video Production ... Organised - Focused - Vibrant - Awesome!Do you... What we expect from you:Your key responsibilit... What you will get from usThrough being part of... 0 1 0 Full-time Not Applicable NaN Marketing and Advertising Customer Service 0
print(df.columns)
Index(['job_id', 'title', 'location', 'department', 'salary_range',
       'company_profile', 'description', 'requirements', 'benefits',
       'telecommuting', 'has_company_logo', 'has_questions', 'employment_type',
       'required_experience', 'required_education', 'industry', 'function',
       'fraudulent'],
      dtype='object')

Question 1 :: How many Datapoints are present in the data?

Question 2 :: How many Features are present in the data?

print("sol1:- Total Number Of DataPoints:- {}.".format(df.shape[0]))
print("sol2:- Total Number of features:- {}.".format(df.shape[1]))
sol1:- Total Number Of DataPoints:- 17880.
sol2:- Total Number of features:- 18.

Question 3 :: check for null values ?

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
job_id                 17880 non-null int64
title                  17880 non-null object
location               17534 non-null object
department             6333 non-null object
salary_range           2868 non-null object
company_profile        14572 non-null object
description            17879 non-null object
requirements           15185 non-null object
benefits               10670 non-null object
telecommuting          17880 non-null int64
has_company_logo       17880 non-null int64
has_questions          17880 non-null int64
employment_type        14409 non-null object
required_experience    10830 non-null object
required_education     9775 non-null object
industry               12977 non-null object
function               11425 non-null object
fraudulent             17880 non-null int64
dtypes: int64(5), object(13)
memory usage: 2.5+ MB
msn.matrix(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f78397e4208>

png

msn.heatmap(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7833f10780>

png

msn.bar(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f783a141940>

png

for item in df.columns:
    print("{} uniques: {}".format(item,df[item].unique().size))
job_id uniques: 17880
title uniques: 11231
location uniques: 3106
department uniques: 1338
salary_range uniques: 875
company_profile uniques: 1710
description uniques: 14802
requirements uniques: 11969
benefits uniques: 6206
telecommuting uniques: 2
has_company_logo uniques: 2
has_questions uniques: 2
employment_type uniques: 6
required_experience uniques: 8
required_education uniques: 14
industry uniques: 132
function uniques: 38
fraudulent uniques: 2

Features:-

  • job_id: - Every job have a different id
  • title:- Job have a title
  • location:- Location of job
  • department:- Job department(ex:- marketing etc)
  • salary_range:- range of salary
  • company_profile:- what actually campany do like it is food company or tech company.
  • description :- Full descripton of job
  • requirements :- What are the Requirements
  • benefits:- What are the extra benifit
  • Telecommuting:- binary variable
  • has_company_logo:- binary variable
  • has_questions:- binary variable
  • employment_type:- full-time or part time
  • required_experience:- internship of how much experience needed
  • required_education:- Minimum qualification
  • industry:- Ex-marketing and advrtisement
  • function:- functionality of job
  • fraudulent:- it is fraud or not

Question 4 :: How many datapoints Are Fraudent in the given data ?

print("sol :: Number Of Fraudent job :: {}".format(df['fraudulent'].mean() * df['fraudulent'].size))
sol :: Number Of Fraudent job :: 866.0
msn.dendrogram(df)
<matplotlib.axes._subplots.AxesSubplot at 0x7f783a2177b8>

png