Browse Source

feat: add ex00 to all quests

pull/42/head
Badr Ghazlane 2 years ago
parent
commit
f58c381f8e
  1. 2
      one_exercise_per_file/week01/day01/ex00/audit/readme.md
  2. 35
      one_exercise_per_file/week01/day01/ex00/readme.md
  3. 9
      one_exercise_per_file/week01/day02/ex00/audit/readme.md
  4. 64
      one_exercise_per_file/week01/day02/ex00/readme.md
  5. 9
      one_exercise_per_file/week01/day03/ex00/audit/readme.md
  6. 62
      one_exercise_per_file/week01/day03/ex00/readme.md
  7. 9
      one_exercise_per_file/week01/day04/ex00/audit/readme.md
  8. 55
      one_exercise_per_file/week01/day04/ex00/readme.md
  9. 9
      one_exercise_per_file/week01/day05/ex00/audit/readme.md
  10. 52
      one_exercise_per_file/week01/day05/ex00/readme.md
  11. 9
      one_exercise_per_file/week02/day01/ex00/audit/readme.md
  12. 81
      one_exercise_per_file/week02/day01/ex00/readme.md
  13. 9
      one_exercise_per_file/week02/day02/ex00/audit/readme.md
  14. 77
      one_exercise_per_file/week02/day02/ex00/readme.md
  15. 9
      one_exercise_per_file/week02/day03/ex00/audit/readme.md
  16. 74
      one_exercise_per_file/week02/day03/ex00/readme.md
  17. 9
      one_exercise_per_file/week02/day04/ex00/audit/readme.md
  18. 70
      one_exercise_per_file/week02/day04/ex00/readme.md
  19. 9
      one_exercise_per_file/week02/day05/ex00/audit/readme.md
  20. 56
      one_exercise_per_file/week02/day05/ex00/readme.md
  21. 9
      one_exercise_per_file/week03/day01/ex00/audit/readme.md
  22. 49
      one_exercise_per_file/week03/day01/ex00/readme.md
  23. 9
      one_exercise_per_file/week03/day02/ex00/audit/readme.md
  24. 49
      one_exercise_per_file/week03/day02/ex00/readme.md
  25. 9
      one_exercise_per_file/week03/day03/ex00/audit/readme.md
  26. 48
      one_exercise_per_file/week03/day03/ex00/readme.md
  27. 9
      one_exercise_per_file/week03/day04/ex00/audit/readme.md
  28. 52
      one_exercise_per_file/week03/day04/ex00/readme.md
  29. 9
      one_exercise_per_file/week03/day05/ex00/audit/readme.md
  30. 47
      one_exercise_per_file/week03/day05/ex00/readme.md

2
one_exercise_per_file/week01/day01/ex00/audit/readme.md

@ -10,8 +10,6 @@
##### Does `import jupyter` and `import numpy` run without any error ?
###### Does it display the right types as above?
###### Have you used the followingthe command `jupyter notebook --port 8891` ?
###### Is there a file named `Notebook_ex00.ipynb` in the working directory ?

35
one_exercise_per_file/week01/day01/ex00/readme.md

@ -1,7 +1,42 @@
# W1D01 Piscine AI - Data Science
## NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Your first NumPy array
- Exercise 2 Zeros
- Exercise 3 Slicing
- Exercise 4 Random
- Exercise 5 Split, concatenate, reshape arrays
- Exercise 6 Broadcasting and Slicing
- Exercise 7 NaN
- Exercise 8 Wine
- Exercise 9 Football tournament
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
*Version of NumPy I used to do the exercises: 1.18.1*.
I suggest to use the most recent one.
## Ressources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercices of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise.

9
one_exercise_per_file/week01/day02/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

64
one_exercise_per_file/week01/day02/ex00/readme.md

@ -0,0 +1,64 @@
# W1D02 Piscine AI - Data Science
## Pandas
The goal of this day is to understand practical usage of **Pandas**.
As **Pandas** in intensively used in Data Science, other days of the piscine will be dedicated to it.
Not only is the **Pandas** library a central component of the data science toolkit but it is used in conjunction with other libraries in that collection.
**Pandas** is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in **Pandas**. Data in **Pandas** is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn.
Most of the topics we will cover today are explained and describes with examples in the first resource. The number of exercises is low on purpose: Take the time to understand the chapter 5 of the resource, even if there are 40 pages.
## Exercises of the day
- Exercice 0 Environment and libraries
- Exercise 1 Your first DataFrame
- Exercise 2 Electric power consumption
- Exercise 3 E-commerce purchases
- Exercise 4 Handling missing values
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- If I had to give you one resource it would be this one:
https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf
It contains ALL you need to know about Pandas.
- Pandas documentation:
- https://pandas.pydata.org/docs/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

9
one_exercise_per_file/week01/day03/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `matplotlib` and `plotly` run without any error ?

62
one_exercise_per_file/week01/day03/ex00/readme.md

@ -0,0 +1,62 @@
# W1D03 Piscine AI - Data Science
## Visualizations
While working on a dataset it is important to check the distribution of the data. Obviously, for most of humans it is difficult to visualize the data in more than 3 dimensions
"Viz" is important to understand the data and to show results. We'll discover three libraries to visualize data in Python. These are one of the most used visualisation "libraries" in Python:
- Pandas visualization module
- Matplotlib
- Plotly
The goal is to understand the basics of those libraries. You'll have time during the project to master one (or the three) of them.
You may wonder why using one library is not enough. The reason is simple: it depends on the usage.
For example if you want to check the data quickly you may want to use Pandas viz module or Matplotlib.
If you want to plot a custom and more elaborated plot I suggest to use Matplotlib or Plotly.
And, if you want to create a very nice and interactive plot I suggest to use Plotly.
## Exercises of the day
- Exercise 1 Pandas plot 1
- Exercise 2 Pandas plot 2
- Exercise 3 Matplotlib 1
- Exercise 4 Matplotlib 2
- Exercise 5 Matplotlib subplots
- Exercise 6 Plotly 1
- Exercise 7 Plotly Box plots
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Plotly
- Jupyter or JupyterLab
I suggest to use the most recent version of the packages.
## Resources
- https://matplotlib.org/3.3.3/tutorials/index.html
- https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596
- https://github.com/rougier/matplotlib-tutorial
- https://jakevdp.github.io/PythonDataScienceHandbook/05.13-kernel-density-estimation.html
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `plotly`.

9
one_exercise_per_file/week01/day04/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

55
one_exercise_per_file/week01/day04/ex00/readme.md

@ -0,0 +1,55 @@
# W1D04 Piscine AI - Data Science
## Data wrangling with Pandas
Data wrangling is one of the crucial tasks in data science and analysis which includes operations like:
- Data Sorting: To rearrange values in ascending or descending order.
- Data Filtration: To create a subset of available data.
- Data Reduction: To eliminate or replace unwanted values.
- Data Access: To read or write data files.
- Data Processing: To perform aggregation, statistical, and similar operations on specific values.
Ax explained before, Pandas is an open source library, specifically developed for data science and analysis. It is built upon the Numpy (to handle numeric data in tabular form) package and has inbuilt data structures to ease-up the process of data manipulation, aka data munging/wrangling.
## Exercises of the day
- Exercise 1 Concatenate
- Exercise 2 Merge
- Exercise 3 Merge MultiIndex
- Exercise 4 Groupby Apply
- Exercise 5 Groupby Agg
- Exercise 6 Unstack
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

9
one_exercise_per_file/week01/day05/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy` and `import pandas` run without any error ?

52
one_exercise_per_file/week01/day05/ex00/readme.md

@ -0,0 +1,52 @@
# W1D05 Piscine AI - Data Science
## Time Series with Pandas
Time series data are data that are indexed by a sequence of dates or times. Today, you'll learn how to use methods built into Pandas to work with this index. You'll also learn for instance:
- to resample time series to change the frequency
- to calculate rolling and cumulative values for times series
- to build a backtest
Time series a used A LOT in finance. You'll learn to evaluate financial strategies using Pandas. It is important to keep in mind that Python is vectorized. That's why some questions constraint you to not use a for loop ;-).
## Exercises of the day
- Exercise 1 Series
- Exercise 2 Financial data
- Exercise 3 Multi asset returns
- Exercise 4 Backtest
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
- https://jakevdp.github.io/PythonDataScienceHandbook/
- https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
- https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- https://towardsdatascience.com/different-ways-to-iterate-over-rows-in-a-pandas-dataframe-performance-comparison-dc0d5dcef8fe
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy` and `jupyter`.

9
one_exercise_per_file/week02/day01/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

81
one_exercise_per_file/week02/day01/ex00/readme.md

@ -0,0 +1,81 @@
# W2D01 Piscine AI - Data Science
![Alt Text](w2_day01_linear_regression_video.gif)
## Linear regression with Scikit Learn
The goal of this day is to understand practical Linear regression and supervised learning.
The word "regression" was introduced by Sir Francis Galton (a cousin of C. Darwin) when he
studied the size of individuals within a progeny. He was trying to understand why
large individuals in a population appeared to have smaller children, more
close to the average population size; hence the introduction of the term "regression".
Today we will learn a basic algorithm used in **supervised learning** : **The Linear Regression**. We will be using **Scikit-learn** which is a machine learning library. It is designed to interoperate with the Python libraries NumPy and Pandas.
We will also learn progressively the Machine Learning methodology for supervised learning - today we will focus on evaluating a machine learning model by splitting the data set in a train set and a test set.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Scikit-learn estimator
- Exercise 2 Linear regression in 1D
- Exercise 3 Train test split
- Exercise 4 Forecast diabetes progression
- Bonus: Exercise 5 Gradient Descent - **Optional**
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
## Ressources
### To start with Scikit-learn
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html
- https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
- https://scikit-learn.org/stable/modules/linear_model.html
### Machine learning methodology and algorithms
- This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Andrew Ng is a star in the Machine Learning community. I recommend to spend some time during the projects to focus on some algorithms. However, Python is not the language used for the course. https://www.coursera.org/learn/machine-learning
- https://docs.microsoft.com/en-us/azure/machine-learning/algorithm-cheat-sheet
- https://scikit-learn.org/stable/tutorial/index.html
### Linear Regression
- https://towardsdatascience.com/laymans-introduction-to-linear-regression-8b334a3dab09
- https://towardsdatascience.com/linear-regression-the-actually-complete-introduction-67152323fcf2
### Train test split
- https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
- https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture?hl=en
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

9
one_exercise_per_file/week02/day02/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

77
one_exercise_per_file/week02/day02/ex00/readme.md

@ -0,0 +1,77 @@
# W2D02 Piscine AI - Data Science
## Classification with Scikit Learn
The goal of this day is to understand practical classification.
Today we will learn a different approach in Machine Learning: the classification which is a large domain in the field of statistics and machine learning. Generally, it can be broken down in two areas:
- **Binary classification**, where we wish to group an outcome into one of two groups.
- **Multi-class classification**, where we wish to group an outcome into one of multiple (more than two) groups.
You may wonder why the approach is different from regression and why we don't use regression and define a threshold from where the class would 1 else 0 - in binary classification.
The main reason is that the linear regression is sensitive to outliers, hence the treshold would vary depending on the outliers in the data. The article mentioned explains this reason with plots. To keep things simple, we can say that the output needed in classification is a probability to belong to one of the classes. So, by definition the value output by the classification model has to be between 0 and 1. The linear regression can't satisfy this constraint.
In mathematics, there are functions with nice properties that take as input a real (-inf, inf) and output a value between 0 and 1, the most popular of them is the **sigmoid** - which is the inverse function of the logit, hence the name logistic regression.
Let's take a small example to have a better understanding of the steps needed to perform a logistic regression on a binary data. Let's assume that we want to predict the gender given the people' size (height).
Logistic regression steps:
- Fit a sigmoid on the training data
- Compute sigmoid(size)=0.7 because the sigmoid returns values between 0 and 1
- Return the class: 0.7 > 0.5 => class 1. Thus, the gender is male
For the linear regression exercises, the loss (Mean Square Error - MSE) is minimized with an algorithm called **gradient descent**. In the classification, the loss MSE can't be used because the output of the model is 0 or 1 (for binary classification).
The **logloss** or **cross entropy** is the loss used for classification. Similarly, it has some nice mathematical properties. The minimization of the **logloss** is not covered in the exercises. However, since it is used in most machine learning models for classification, I recommend to spend some time reading the related article.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Logistic regression with Scikit-learn
- Exercise 2 Sigmoid
- Exercise 3 Decision boundary
- Exercise 4 Train test split
- Exercise 5 Breast Cancer prediction
- Bonus: Exercise 6 Multi-class - **Optional**
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
## Ressources
### Logistic regression
- https://towardsdatascience.com/understanding-logistic-regression-9b02c2aec102
### Logloss
- https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451
- https://medium.com/swlh/what-is-logistic-regression-62807de62efa
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

9
one_exercise_per_file/week02/day03/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

74
one_exercise_per_file/week02/day03/ex00/readme.md

@ -0,0 +1,74 @@
# W2D01 Piscine AI - Data Science
## Machine Learning Pipeline
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
As we know, the model evaluation methodology requires to split the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
This object takes as input the preprocessing transforms and a Machine Learning model. Then this object can be called the same way a Machine Learning model is called. This is pretty practical because we do not need anymore to carry many objects.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Imputer 1
- Exercise 2 Scaler
- Exercise 3 One hot Encoder
- Exercise 4 Ordinal Encoder
- Exercise 5 Categorical variables
- Exercise 6 Pipeline
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
## Ressources
### Step 3
- https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e
### Step 4
- https://medium.com/@societyofai/simplest-way-for-feature-scaling-in-gradient-descent-ae0aaa383039#:~:text=Feature%20scaling%20is%20an%20idea,of%20convergence%20of%20gradient%20descent.
### Pipeline
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

9
one_exercise_per_file/week02/day04/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

70
one_exercise_per_file/week02/day04/ex00/readme.md

@ -0,0 +1,70 @@
# W2D04 Piscine AI - Data Science
## Train and evaluate Machine Learning models
Today we will learn how to train and evaluate a machine learning model. You'll learn how tochoose the right Machine Learning metric depending on the problem you are solving and to compute it. A metric gives an idea of how good the model performs. Depending on working on a classification problem or a regression problem the metrics considered are different. It is important to understand that all metrics are just metrics, not the truth.
We will focus on the most important metrics:
- Regression:
- **R2**, **Mean Square Error**, **Mean Absolute Error**
- Classification:
- **F1 score**, **accuracy**, **precision**, **recall** and **AUC scores**. Even if it not considered as a metric, the **confusion matrix** is always useful to understand the model performance.
Warning: **Imbalanced data set**
Let us assume we are predicting a rare event that occurs less than 2% of the time. Having a model that scores a good accuracy is easy, it doesn't have to be "smart", all it has to do is to always predict the majority class. Depending on the problem it can be disastrous. For example, working with real life data, breast cancer prediction is an imbalanced problem where predicting the majority leads to disastrous consequences. That is why metrics as AUC are useful. Before to compute the metrics, read carefully this article to understand the role of these metrics.
You'll learn to train other types of Machine Learning models than linear regression and logistic regression. You're not supposed to spend time understanding the theory. I recommend to do that during the projects. Today, read the Scikit-learn documentation to have a basic understanding of the models you use. Focus on how to use correctly those Machine Learning models with Scikit-learn.
You'll also learn what is a grid-search and how to use it to train your machine learning models.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 MSE Scikit-learn
- Exercise 2 Accuracy Scikit-learn
- Exercise 3 Regression
- Exercise 4 Classification
- Exercise 5 Machine Learning models
- Exercise 6 Grid Search
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit Learn
- Jupyter or JupyterLab
*Version of Scikit Learn I used to do the exercises: 0.22*. I suggest to use the most recent one. Scikit Learn 1.0 is finally available after ... 14 years.
## Ressources
### Metrics
- https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html
### Imbalance datasets
- https://stats.stackexchange.com/questions/260164/auc-and-class-imbalance-in-training-test-dataset
### Gridsearch
- https://medium.com/fintechexplained/what-is-grid-search-c01fe886ef0a
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

9
one_exercise_per_file/week02/day05/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, `import matplotlib` and `import sklearn` run without any error ?

56
one_exercise_per_file/week02/day05/ex00/readme.md

@ -0,0 +1,56 @@
# W2D05 Piscine AI - Data Science
## Model selection methodology
If you finished yesterday's exercises you should be able to train several Machine Learning algorithms and to choose one returned by GridSearchCV.
GridSearchCV returns the model that gives the best score on the test set. Yesterday, as I told you, I changed the **cv** parameter to compute the GridSearch with a train set and a test set.
It means that the selected model is based on one single measure. What if, by luck, we predict correctly on that section ? What if the best model is bad ? What if I could have selected a better model ?
We will answer these questions today ! The topics we will cover are the one of the most important in Machine Learning.
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 K-Fold
- Exercise 2 Cross validation (k-fold)
- Exercise 3 GridsearchCV
- Exercise 4 Validation curve and Learning curve
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Scikit-learn
- Matplotlib
*Version of Pandas I used to do the exercises: 1.0.1*.
I suggest to use the most recent one.
## Resources
**Must read before to start the exercises**
### Biais-Variance trade off, aka Underfitting/Overfitting:
- https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/
- https://jakevdp.github.io/PythonDataScienceHandbook/05.03-hyperparameters-and-model-validation.html
### Cross-validation
- https://algotrading101.com/learn/train-test-split/
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.

9
one_exercise_per_file/week03/day01/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Do `import jupyter` and `import numpy` run without any error ?

49
one_exercise_per_file/week03/day01/ex00/readme.md

@ -0,0 +1,49 @@
# W3D01 Piscine AI - Data Science
## Neural Networks
Last week you learnt about some Machine Learning algorithms as Random Forest or Gradient Boosting. Neural Networks are another type of Machine Learning algorithms that are intensively used because of their efficiency. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Different types of neural networks exist and are specific to some use-cases. For example CNN for images, RNN or LSTMs for time-series or text, etc ...
Today we will focus on Artificial Neural Networks. The goal is to understand how do the neural networks work, train them on data and understand the challenges of training a neural network. The ressources below expalin very well the mecanisms behind neural networks, step by step.
However the exercices won't cover architectures as RNN, LSTM - used on sequences as time series or text, CNN - used a lot on images processing. One of the projects will require to know how to use the special architectures. To do so, I suggest that you go through this lesson: https://fr.coursera.org/specializations/deep-learning.
## Exercises of the day
- Exercise 1 The neuron
- Exercise 2 Neural network
- Exercise 3 Log loss
- Exercise 4 Forward propagation
- Exercise 5 Regression
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
*Version of NumPy I used to do the exercises: 1.18.1*.
I suggest to use the most recent one.
## Ressources
- https://victorzhou.com/blog/intro-to-neural-networks/
- https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4
- https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment with a version of Python >= `3.8`, with the following libraries: `numpy` and `jupyter`.

9
one_exercise_per_file/week03/day02/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas`, and `import keras` run without any error ?

49
one_exercise_per_file/week03/day02/ex00/readme.md

@ -0,0 +1,49 @@
# W3D02 Piscine AI - Data Science
## Keras
The goal of this day is to learn to use Keras to build Neural Networks. As explained on Keras website, Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.
And, TensorFlow was created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++.
There are two ways to build Keras models: sequential and functional.The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
Note:
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
## Exercises of the day
- Exercise 0
- Exercise 1 Sequential
- Exercise 2 Dense
- Exercise 3 Architecture
- Exercise 4 Optimize
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Keras
*Version of Keras I used to do the exercises: 2.4.3*.
I suggest to use the most recent one.
## Ressources
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, and `keras`.

9
one_exercise_per_file/week03/day03/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import numpy`, `import pandas` and `import keras` run without any error ?

48
one_exercise_per_file/week03/day03/ex00/readme.md

@ -0,0 +1,48 @@
# W3D03 Piscine AI - Data Science
## Keras 2
The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression.
Note:
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Regression - Optimize
- Exercise 2 Regression example
- Exercise 3 Multi classification - Softmax
- Exercise 4 Multi classification - Optimize
- Exercise 5 Multi classification example
## Virtual Environment
- Python 3.x
- NumPy
- Pandas
- Jupyter or JupyterLab
- Keras
*Version of Keras I used to do the exercises: 2.4.3*.
I suggest to use the most recent one.
## Ressources
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter` and `keras`.

9
one_exercise_per_file/week03/day04/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Does `import jupyter`, `import pandas`, `import nltk` and `import sklearn` run without any error ?

52
one_exercise_per_file/week03/day04/ex00/readme.md

@ -0,0 +1,52 @@
# W3D04 Piscine AI - Data Science
## Natural Language processing
“NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data.
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article **Your Guide to Natural Language Processing (NLP)** gives a very good introduction to NLP.
Today, we we will learn to preprocess text data and to create a bag of word representation. Les packages NLTK and Spacy to do the preprocessing
## Exercises of the day
- Exercise 0 Environment and libraries
- Exercise 1 Lower case
- Exercise 2 Punctuation
- Exercise 3 Tokenization
- Exercise 4 Stop words
- Exercise 5 Stemming
- Exercise 6 Text preprocessing
- Exercise 7 Bag of Word representation
## Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Pandas
- Scikit Learn
- NLTK
I suggest to use the most recent libraries.
## Ressources
- https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
- https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `nltk` and `scikit-learn`.

9
one_exercise_per_file/week03/day05/ex00/audit/readme.md

@ -0,0 +1,9 @@
##### The exercice is validated is all questions of the exercice are validated
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`
##### Run `python --version`
###### Does it print `Python 3.x`? x >= 8
##### Do `import jupyter`, `import pandas` and `import spacy` run without any error ?

47
one_exercise_per_file/week03/day05/ex00/readme.md

@ -0,0 +1,47 @@
# W3D05 Piscine AI - Data Science
## Natural Language processing with Spacy
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. I don't need to detail what spaCy does, it is perfectly summarized by spaCy in this article: **spaCy 101: Everything you need to know**.
Today, we will learn to use a pre-trained embedding to convert a text into a vector to compute similarity between words or sentences. Remember, embeddings translate large sparse vectors into a lower-dimensional space that preserves semantic relationships.
Word embeddings is a technique where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space. The BoW representation's dimension depends on the size of the vocabulary. But it can easily reach 10k words. We will also learn to use NER and Part-of-speech. NER allows to identify and segment the named entities and classify or categorize them under various predefined classes. Part-of-speech is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.
## Exercises of the day
- Exercise 1 Embedding 1
- Exercise 2 Tokenization
- Exercise 3 Embeddings 2
- Exercise 4 Sentences' similarity
- Exercise 5 NER
- Exercise 6 Part-of-speech tags
## Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Pandas
- Spacy
I suggest to use the most recent libraries.
## Ressources
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/doc
- https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
- https://medium.com/mlearning-ai/nlp-04-part-of-speech-tagging-in-spacy-dc3e239c2726
# Exercise 0 Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
1. Create a virtual environment named with a version of Python >= `3.8`, with the following libraries: `pandas`, `jupyter`, `spacy`.
Loading…
Cancel
Save