Browse Source

fix: day2: testing and feedback

pull/2/head
b.ghazlane 3 years ago
parent
commit
92f4d29539
  1. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex3/Ecommerce_purchases.txt
  2. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.csv
  3. 0
      one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.data
  4. 39
      one_md_per_day_format/piscine/Week1/day2.md

0
one_md_per_day_format/piscine/Week1/data/D02/ex4/Ecommerce_purchases.txt → one_md_per_day_format/piscine/Week1/data/D02/ex3/Ecommerce_purchases.txt

0
one_md_per_day_format/piscine/Week1/data/D02/ex3/iris.csv → one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.csv

1 sepal_length sepal_width petal_length petal_width flower
2 0 5.1 3.5 1.4 0.2 Iris-setosa
3 1 4.9 3.0 1.4 0.2 Iris-setosa
4 2 4.7 3.2 1.3 0.2 Iris-setosa
5 3 4.6 3.1 1.5 0.2 Iris-setosa
6 4 5.0 -3.6 -1.4 0.2 Iris-setosa
7 5 5.4 3.9 1.7 0.4 Iris-setosa
8 6 4.6 3.4 1.4 0.3 Iris-setosa
9 7 5.0 3.4 1.5 0.2 Iris-setosa
10 8 -4.4 2.9 1400.0 0.2 Iris-setosa
11 9 4.9 3.1 1.5 0.1 Iris-setosa
12 10 5.4 3.7 0.2 Iris-setosa
13 11 4.8 3.4 0.2 Iris-setosa
14 12 4.8 3.0 0.1 Iris-setosa
15 13 4.3 3.0 0.1 Iris-setosa
16 14 5.8 4.0 0.2 Iris-setosa
17 15 5.7 4.4 0.4 Iris-setosa
18 16 5.4 3.9 0.4 Iris-setosa
19 17 5.1 3.5 0.3 Iris-setosa
20 18 5.7 3.8 0.3 Iris-setosa
21 19 5.1 3.8 0.3 Iris-setosa
22 20 5.4 3.4 0.2 Iris-setosa
23 21 5.1 3.7 0.4 Iris-setosa
24 22 4.6 3.6 0.2 Iris-setosa
25 23 5.1 3.3 0.5 Iris-setosa
26 24 4.8 3.4 0.2 Iris-setosa
27 25 5.0 -3.0 0.2 Iris-setosa
28 26 5.0 3.4 0.4 Iris-setosa
29 27 5.2 3.5 0.2 Iris-setosa
30 28 5.2 3.4 0.2 Iris-setosa
31 29 4.7 3.2 0.2 Iris-setosa
32 30 4.8 3.1 1.6 0.2 Iris-setosa
33 31 5.4 3.4 1.5 0.4 Iris-setosa
34 32 5.2 4.1 1.5 0.1 Iris-setosa
35 33 5.5 4.2 1.4 0.2 Iris-setosa
36 34 4.9 3.1 1.5 0.1 Iris-setosa
37 35 5.0 3.2 1.2 0.2 Iris-setosa
38 36 5.5 3.5 1.3 0.2 Iris-setosa
39 37 4.9 1.5 0.1 Iris-setosa
40 38 4.4 3.0 1.3 0.2 Iris-setosa
41 39 5.1 3.4 1.5 0.2 Iris-setosa
42 40 5.0 3.5 1.3 0.3 Iris-setosa
43 41 4.5 2.3 1.3 0.3 Iris-setosa
44 42 4.4 3.2 1.3 0.2 Iris-setosa
45 43 5.0 3.5 1.6 0.6 Iris-setosa
46 44 5.1 3.8 1.9 0.4 Iris-setosa
47 45 4.8 3.0 1.4 0.3 Iris-setosa
48 46 5.1 3809.0 1.6 0.2 Iris-setosa
49 47 4.6 3.2 1.4 0.2 Iris-setosa
50 48 5.3 3.7 1.5 0.2 Iris-setosa
51 49 5.0 3.3 1.4 0.2 Iris-setosa
52 50 7.0 3.2 4.7 1.4 Iris-versicolor
53 51 6.4 3200.0 4.5 1.5 Iris-versicolor
54 52 6.9 3.1 4.9 1.5 Iris-versicolor
55 53 5.5 2.3 4.0 1.3 Iris-versicolor
56 54 6.5 2.8 4.6 1.5 Iris-versicolor
57 55 5.7 2.8 4.5 1.3 Iris-versicolor
58 56 6.3 3.3 4.7 1600.0 Iris-versicolor
59 57 4.9 2.4 3.3 1.0 Iris-versicolor
60 58 6.6 2.9 4.6 1.3 Iris-versicolor
61 59 5.2 2.7 3.9 Iris-versicolor
62 60 5.0 2.0 3.5 1.0 Iris-versicolor
63 61 5.9 3.0 4.2 1.5 Iris-versicolor
64 62 6.0 2.2 4.0 1.0 Iris-versicolor
65 63 6.1 2.9 4.7 1.4 Iris-versicolor
66 64 5.6 2.9 3.6 1.3 Iris-versicolor
67 65 6.7 3.1 4.4 1.4 Iris-versicolor
68 66 5.6 3.0 4.5 1.5 Iris-versicolor
69 67 5.8 2.7 4.1 1.0 Iris-versicolor
70 68 6.2 2.2 4.5 1.5 Iris-versicolor
71 69 5.6 2.5 3.9 1.1 Iris-versicolor
72 70 5.9 3.2 4.8 1.8 Iris-versicolor
73 71 6.1 2.8 4.0 1.3 Iris-versicolor
74 72 6.3 2.5 4.9 1.5 Iris-versicolor
75 73 6.1 2.8 4.7 1.2 Iris-versicolor
76 74 6.4 2.9 4.3 1.3 Iris-versicolor
77 75 6.6 3.0 4.4 1.4 Iris-versicolor
78 76 6.8 2.8 4.8 1.4 Iris-versicolor
79 77 6.7 3.0 5.0 1.7 Iris-versicolor
80 78 6.0 2.9 4.5 1.5 Iris-versicolor
81 79 5.7 2.6 3.5 1.0 Iris-versicolor
82 80 5.5 2.4 3.8 1.1 Iris-versicolor
83 81 5.5 2.4 3.7 1.0 Iris-versicolor
84 82 5.8 2.7 3.9 1.2 Iris-versicolor
85 83 6.0 2.7 5.1 1.6 Iris-versicolor
86 84 5.4 3.0 4.5 1.5 Iris-versicolor
87 85 6.0 3.4 4.5 1.6 Iris-versicolor
88 86 6.7 3.1 4.7 1.5 Iris-versicolor
89 87 6.3 2.3 4.4 1.3 Iris-versicolor
90 88 5.6 3.0 4.1 1.3 Iris-versicolor
91 89 5.5 2.5 4.0 1.3 Iris-versicolor
92 90 5.5 2.6 4.4 1.2 Iris-versicolor
93 91 6.1 3.0 4.6 1.4 Iris-versicolor
94 92 5.8 2.6 4.0 1.2 Iris-versicolor
95 93 5.0 2.3 3.3 1.0 Iris-versicolor
96 94 5.6 2.7 4.2 1.3 Iris-versicolor
97 95 5.7 3.0 4.2 1.2 Iris-versicolor
98 96 5.7 2.9 4.2 1.3 Iris-versicolor
99 97 6.2 2.9 4.3 1.3 Iris-versicolor
100 98 5.1 2.5 3.0 1.1 Iris-versicolor
101 99 5.7 2.8 1.3 Iris-versicolor
102 100 3.3 2.5 Iris-virginica
103 101 5.8 2.7 1.9 Iris-virginica
104 102 7.1 3.0 2.1 Iris-virginica
105 103 6.3 2.9 1.8 Iris-virginica
106 104 6.5 3.0 2.2 Iris-virginica
107 105 7.6 3.0 6.6 2.1 Iris-virginica
108 106 4.9 2.5 4.5 1.7 Iris-virginica
109 107 7.3 2.9 6.3 1.8 Iris-virginica
110 108 6.7 2.5 5.8 1.8 Iris-virginica
111 109 7.2 3.6 6.1 2.5 Iris-virginica
112 110 6.5 3.2 5.1 2.0 Iris-virginica
113 111 6.4 2.7 5.3 1.9 Iris-virginica
114 112 6.8 3.0 5.5 2.1 Iris-virginica
115 113 5.7 2.5 5.0 2.0 Iris-virginica
116 114 5.8 5.1 2.4 Iris-virginica
117 115 6.4 5.3 2.3 Iris-virginica
118 116 6.5 5.5 1.8 Iris-virginica
119 117 7.7 6.7 2.2 Iris-virginica
120 118 7.7 2.3 Iris-virginica
121 119 6.0 5.0 1.5 Iris-virginica
122 120 6.9 5.7 2.3 Iris-virginica
123 121 5.6 2.8 4.9 2.0 Iris-virginica
124 122 always check the data !!!!!!!!
125 123 6.3 2.7 4.9 1.8 Iris-virginica
126 124 6.7 3.3 5.7 2.1 Iris-virginica
127 125 7.2 3.2 6.0 1.8 Iris-virginica
128 126 6.2 2.8 -4.8 1.8 Iris-virginica
129 127 3.0 4.9 1.8 Iris-virginica
130 128 6.4 2.8 5.6 2.1 Iris-virginica
131 129 7.2 3.0 5.8 1.6 Iris-virginica
132 130 7.4 2.8 6.1 1.9 Iris-virginica
133 131 7.9 3.8 6.4 2.0 Iris-virginica
134 132 6.-4 2.8 5.6 2.2 Iris-virginica
135 133 6.3 2.8 1.5 Iris-virginica
136 134 6.1 2.6 5.6 1.4 Iris-virginica
137 135 7.7 3.0 6.1 2.3 Iris-virginica
138 136 6.3 3.4 5.6 2.4 Iris-virginica
139 137 6.4 3.1 5.5 1.8 Iris-virginica
140 138 6.0 3.0 4.8 1.8 Iris-virginica
141 139 6900 3.1 5.4 2.1 Iris-virginica
142 140 6.7 3.1 2.4 Iris-virginica
143 141 6.9 3.1 5.1 2.3 Iris-virginica
144 142 580 2.7 5.1 Iris-virginica
145 143 6.8 3.2 5.9 2.3 Iris-virginica
146 144 6.7 3.3 5.7 -2.5 Iris-virginica
147 145 6.7 3.0 5.2 2.3 Iris-virginica
148 146 6.3 2.5 5.0 1.9 Iris-virginica
149 147 6.5 3.0 5.2 2.0 Iris-virginica
150 148 6.2 3.4 5.4 2.3 Iris-virginica
151 149 5.9 3.0 5.1 1.8 Iris-virginica

0
one_md_per_day_format/piscine/Week1/data/D02/ex3/iris.data → one_md_per_day_format/piscine/Week1/data/D02/ex4/iris.data

39
one_md_per_day_format/piscine/Week1/day2.md

@ -97,8 +97,8 @@ The data set used is **Individual household electric power consumption**
4. Use `describe` to have an overview on the data set
5. Delete the rows with missing values
6. Modify `Sub_metering_1` by multiplying it by 0.06
7. Select all the rows for which the Date is greater than 2008-12-27 and `Voltage` is greater than 242
6. Modify `Sub_metering_1` by adding 1 to it and multiplying the total by 0.06. If x is a row the output is: (x+1)*0.06
7. Select all the rows for which the Date is greater or equal than 2008-12-27 and `Voltage` is greater or equal than 242
8. Print the 88888th row.
9. What is the date for which the `Global_active_power` is maximal ?
10. Sort the first three columns by descending order of `Global_active_power` and ascending order of `Voltage`.
@ -140,8 +140,8 @@ The data set used is **Individual household electric power consumption**
5. You should have noticed that 25979 rows contain missing values (for a total of 129895). `df.isna().sum()` allows to check the number of missing values and `df.dropna()` with `inplace=True`. The solution is accepted if you used `dropna` and have the number of missing values as 0.
6. Two solutions are accepted:
- `df.loc[:,'A'] = df['A'] * 0.06`
- Using `apply` and `df.loc[:,'A']` =
- `df.loc[:,'A'] = (df['A'] + 1) * 0.06`
- Using `apply`: `df.loc[:,'A'] = df.loc[:,'A'].apply(lambda x: (x+1)*0.06)`
You may wonder `df.loc[:,'A']` is required and if `df['A'] = ...` works too. The answer is no. This is important in Pandas. Depending on the version of Pandas, it may return a warning. The reason is that you are affecting a value to a **copy** of the DataFrame and not in the DataFrame.
More details: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas
@ -225,7 +225,7 @@ Questions:
The validate this exercise all answers should return the expected numerical value given in the correction AND uses Pandas. For example using NumPy to compute the mean doesn't respect the philosophy of the exercise which is to use Pandas.
1. How many rows and columns are there?**10000 entries**
1. How many rows and columns are there?**10000 entries** and **14 columns**
There many solutions based on: shape, info, describe
@ -296,9 +296,9 @@ https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-mi
"**It’s important to understand these different types of missing data from a statistics point of view. The type of missing data will influence how you deal with filling in the missing values.**"
1. Drop the `flower` column
- Preliminary: Drop the `flower` column
- Fill the missing values with a different "strategy" for each column:
1. Fill the missing values with a different "strategy" for each column:
`sepal_length` -> `mean`
@ -306,13 +306,17 @@ https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-mi
`petal_length`, `petal_width` -> `0`
2. Explain why filling the missing values with 0 or the mean is a bad idea
2. Fill the missing values using the median of the associated column using `fillna`.
- Bonus questions:
- Filling the missing values by 0 or the mean of the associated column is common in Data Science. In that case, explain why filling the missing values with 0 or the mean is a bad idea.
- Find a special row ;-)
3. Fill the missing values using the median
## Correction
To validate the exercise, you should have done these two steps in that order:
1. This question is validated if you have done these two steps in that order:
- Convert the numerical columns to `float`
@ -333,6 +337,11 @@ To validate the exercise, you should have done these two steps in that order:
4:0})
```
2. This question is validated if the solution is: `df.loc[:,col].fillna(df[col].median())`
**Bonus questions**:
- It is important to understand why filling the missing values with 0 or the mean of the column is a bad idea.
| | sepal_length | sepal_width | petal_length | petal_width |
@ -347,17 +356,9 @@ To validate the exercise, you should have done these two steps in that order:
| max | 6900 | 3809 | 1400 | 1600 |
Once we filled the missing values as suggested in the first question, `df.describe()` returns this interesting summary. We notice that the mean is way higher than the median. It means that there are maybe some outliers in the data. The quantile 75 and the max confirm that: 75% of the flowers have a sepal length smaller than 6.4 cm, but the max is 6900 cm. If you check on the internet you realise this small flower can't be that big. The outliers have a major impact on the mean which equals to 56.9. Filling this value for the missing value is not correct since it doesn't correspond to the real size of this flower. That is why in that case the best strategy to fill the missing values is the median. The truth is that I modified the data set ! But real data sets ALWAYS contains outliers.
Bonus:
Always think about the meaning of the data transformation ! If you fill the missing values by zero, it means that you consider that the length or width of some flowers may be 0. It doesn't make sense.
- If you noticed that there are some negative values and the huge values, you will be a good data scientist. **YOU SHOULD ALWAYS TRY TO UNDERSTAND YOUR DATA**. Print the row with index 122 ;-)
This week, we will have the opportunity to focus on the data pre-processing to understand how the outliers are handled.
EXos à ajouter:
Créer une Series
train_test_split
Ajouter 3 exos sur les fontions natives incontournable de Pandas
dropna
Loading…
Cancel
Save