15 KiB
D02 Piscine AI - Data Science
Author:
Table of Contents:
Historical part:
Introduction
The goal of this day is to understand practical usage of Pandas. As Pandas in intensively used in Data Science, other days of the piscine will be dedidated to it.
Not only is the Pandas library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first ressource. The number of exercices is low on purpose: Take the time to understand the chapter 5 of the ressource, even if there are 40 pages.
The version of Pandas I used is '1.0.1'.
Rules
...
Ressources
- If I had to give you one ressource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf It contains ALL you need to know about Pandas.
- Pandas documentation:
https://pandas.pydata.org/docs/
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
Exercice 1
The goal of this exercice is to learn to create basic Pandas objects.
-
Create a DataFrame as below this using two ways:
- From a NumPy array
- From a Pandas Series
color list number 1 Blue [1, 2] 1.1 3 Red [3, 4] 2.2 5 Pink [5, 6] 3.3 7 Grey [7, 8] 4.4 9 Black [9, 10] 5.5 -
Print the types for every columns and the types of the first value of every columns
Solution
-
The solution is accepted if the DataFrame created is the same as the "model" DataFrame. Check that the index is not 1,2,3,4,5.
-
The solution is accepted if the types you get for the columns are
<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>
and if the types of the first value of the columns are
<class 'str'>
<class 'list'>
<class 'float'>
Exercice 2 Electric power consumption
The goal of this exercice is to learn to manipulate real data with Pandas. The data set used is Individual household electric power consumption
-
Delete the columns
Time
,Sub_metering_2
andSub_metering_3
-
Set
Date
as index -
Create a function that takes as input the DataFrame with the data set and returns a DataFrame with updated types:
def update_types(df): #TODO return df
-
Use
describe
to have an overview on the data set -
Delete the rows with missing values
-
Modify
Sub_metering_1
by multplying it by 0.06 -
Select all the rows for which the Date is greater than 2008-12-27 and
Voltage
is greater than 242 -
Print the 88888th row.
-
What is the date for which the
Global_active_power
is maximal ? -
Sort the first three columns by descending order of
Global_active_power
and ascending order ofVoltage
. -
Compute the daily average of
Global_active_power
.
Correction:
-
del
works but it is not a solution I recommand. For this exercice it is accepted. It is expected to usedrop
withaxis=1
.inplace=True
may be useful to avoid to affect the result to a variable. -
The prefered solution is
set_index
withinplace=True
. As long as the DataFrame returns the output below, the solution is accepted. If the type of the index is notdtype='datetime64[ns]'
the solution is not accepted.Input: df.head().index Output: DatetimeIndex(['2006-12-16', '2006-12-16','2006-12-16', '2006-12-16','2006-12-16'], dtype='datetime64[ns]', name='Date', freq=None)
-
The prefered solution is
pd.to_numeric
withcoerce=True
. The solution is accepted if all types arefloat64
.Input: df.dtypes Output: Global_active_power float64 Global_reactive_power float64 Voltage float64 Global_intensity float64 Sub_metering_1 float64 dtype: object
-
df.describe()
is expected -
You should have noticed that 25979 rows contain missing values (for a total of 129895).
df.isna().sum()
allows to check the number of missing values anddf.dropna()
withinplace=True
. The solution is accepted if you useddropna
and have the same number of missing values -
Two solutions are accepted:
df.loc[:,'A'] = df['A'] * 0.06
- Using
apply
anddf.loc[:,'A']
=
You may wonder
df.loc[:,'A']
is required and ifdf['A'] = ...
works too. The answer is no. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a copy of the DataFrame and not in the DataFrame. More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas -
The solution is accepted as long as the output of
print(filtered_df.head().to_markdown())
isDate Global_active_power Global_reactive_power 2008-12-27 00:00:00 0.996 0.066 2008-12-27 00:00:00 1.076 0.162 2008-12-27 00:00:00 1.064 0.172 2008-12-27 00:00:00 1.07 0.174 2008-12-27 00:00:00 0.804 0.184 Check that the number of rows is equal to 449667.
-
The solution is accepted if output is
Global_active_power 0.254 Global_reactive_power 0.000 Voltage 238.350 Global_intensity 1.200 Sub_metering_1 0.000 Name: 2007-02-16 00:00:00, dtype: float64
-
The solution is accepted if the output is
Timestamp('2009-02-22 00:00:00')
-
The solution is accepted if the output for print(sorted_df.tail().to_markdown()) is
Date Global_active_power Global_reactive_power Voltage 2008-08-28 00:00:00 0.076 0 234.88 2008-08-28 00:00:00 0.076 0 235.18 2008-08-28 00:00:00 0.076 0 235.4 2008-08-28 00:00:00 0.076 0 235.64 2008-12-08 00:00:00 0.076 0 236.5 -
The solution is based on
groupby
which creates groups based on the indexDate
and agregates the groups using themean
. The solution is accepted if the output isDate 2006-12-16 3.053475 2006-12-17 2.354486 2006-12-18 1.530435 2006-12-19 1.157079 2006-12-20 1.545658 ... 2010-12-07 0.770538 2010-12-08 0.367846 2010-12-09 1.119508 2010-12-10 1.097008 2010-12-11 1.275571 Name: Global_active_power, Length: 1433, dtype: float64
Exercice 3: E-commerce purchases
The goal of this exercice is to learn to manipulate real data with Pandas. This exercice is less guided since the exercice 2 should have given you a nice introduction.
The data set used is E-commerce purchases.
Questions:
- How many rows and columns are there?
- What is the average Purchase Price?
- What were the highest and lowest purchase prices?
- How many people have English
'en'
as their Language of choice on the website? - How many people have the job title of
"Lawyer"
? - How many people made the purchase during the
AM
and how many people made the purchase duringPM
? - What are the 5 most common Job Titles?
- Someone made a purchase that came from Lot:
"90 WT"
, what was the Purchase Price for this transaction? - What is the email of the person with the following Credit Card Number:
4926535242672853
- How many people have American Express as their Credit Card Provider and made a purchase above
$95
? - How many people have a credit card that expires in
2025
? - What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
Correction
The validate this exercice all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercice which is to use Pandas.
-
How many rows and columns are there?10000 entries
There many solutions based on: shape, info, describe
-
What is the average Purchase Price? 50.34730200000025
Even if
np.mean
gives the solution,df['Purchase Price'].mean()
is preferred -
What were the highest and lowest purchase prices?
min: 0
max: 99.989999999999995
-
How many people have English
'en'
as their Language of choice on the website? 1098 -
How many people have the job title of
"Lawyer"
? 30 -
How many people made the purchase during the
AM
and how many people made the purchase duringPM
?PM: 5068
AM: 4932
There many ways to the solution but the goal of this question was to make you use
value_counts
-
What are the 5 most common Job Titles?
Interior and spatial designer 31
Lawyer 30
Social researcher 28
Purchasing manager 27
Designer, jewellery 27
There many ways to the solution but the goal of this question was to make you use
value_counts
-
Someone made a purchase that came from Lot:
"90 WT"
, what was the Purchase Price for this transaction? 75.1 -
What is the email of the person with the following Credit Card Number:
4926535242672853
. bondellen@williams-garza.com -
How many people have American Express as their Credit Card Provider and made a purchase above
$95
? 39The prefered solution is based on this:
df[(df['A'] == X) & (df['B'] > Y)]
-
How many people have a credit card that expires in
2025
? 1033The prefered solution is based on the usage of
apply
on alambda
function that slices the string that contains the expiration date. -
What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
- hotmail.com 1638
- yahoo.com 1616
- gmail.com 1605
- smith.com 42
- williams.com 37
The prefered solution is based on the usage of
apply
on alambda
function that slices the string that contains the email. Thelambda
function usessplit
to split the string on@
. Finally,value_counts
is used to count the occurences.
Exercice 3 Handling missing values
The goal of this exercice is to learn to handle missing values. In the previsous exercice we used the first techniques: filter out the missing values. We were lucky because the proportion of missing values was low. But in some cases, dropping the missing values is not possible because the filtered data set would be too small. This article explains the different types of missing data and how they should be handled. https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-3e9c6ebcf78b
" It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values."
- Drop the
flower
column
-
Fill the missing values with a different "strategy" for each column:
sepal_length
->mean
sepal_width
->median
petal_length
,petal_width
->0
- Explain why filling the missing values with 0 or the mean is a bad idea
- Fill the missing values using the median
Correction
To validate the exercice, you should have done these two steps in that order:
-
Convert the numerical columns to
float
example: pd.to_numeric(df.loc[:,col], errors='coerce')
-
Fill the missing values. There are many solutions for this step, here is one of them.
example: df.fillna({0:df.sepal_length.mean(), 2:df.sepal_width.median(), 3:0, 4:0})
-
It is important to understand why filling the missing values with 0 or the mean of the column is a bad idea.
sepal_length | sepal_width | petal_length | petal_width | |
---|---|---|---|---|
count | 146 | 141 | 120 | 147 |
mean | 56.9075 | 52.6255 | 15.5292 | 12.0265 |
std | 572.222 | 417.127 | 127.46 | 131.873 |
min | -4.4 | -3.6 | -4.8 | -2.5 |
25% | 5.1 | 2.8 | 2.725 | 0.3 |
50% | 5.75 | 3 | 4.5 | 1.3 |
75% | 6.4 | 3.3 | 5.1 | 1.8 |
max | 6900 | 3809 | 1400 | 1600 |
Once we filled the missing values as suggested in the first question, df.describe()
returns this intersting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That's why in that case the best strategy to fill the missing values was the median. The truth is that I modified the data set ! But real data sets ALWAYS contains ouliers.
Bonus:
- If you noticed that there are some negative values and the huge values, you will be a good data scientist. YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA. Print the row with index 122 ;-)
This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers are handled.
EXos à ajouter:
Créer une Series train_test_split Ajouter 3 exos sur les fontions natives incontournable de Pandas
dropna