Browse Source

fix: format audits and clean exercices

pull/42/head
Badr Ghazlane 2 years ago
parent
commit
2924d82b16
  1. 2
      one_exercise_per_file/week01/raid01/audit/readme.md
  2. 39
      one_exercise_per_file/week02/raid02/audit/readme.md
  3. 33
      one_exercise_per_file/week03/day01/readme.md
  4. 36
      one_exercise_per_file/week03/day02/readme.md
  5. 33
      one_exercise_per_file/week03/day03/readme.md
  6. 25
      one_exercise_per_file/week03/day04/readme.md

2
one_exercise_per_file/week01/raid01/audit/readme.md

@ -1,4 +1,4 @@
# RAID01 - Backtesting on the SP500 - correction
# RAID01 - Backtesting on the SP500 - audit
### Preliminary

39
one_exercise_per_file/week02/raid02/audit/readme.md

@ -1,11 +1,10 @@
# Forest Cover Type Prediction - Correction
# RAID02 - Forest Cover Type Prediction - Audit
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
### Preliminary
## Problem
###### Does the structure of the project is as below ?
The expected structure of the project is:
@ -36,11 +35,11 @@ project
```
- The readme file contains a description of the project and explains how to run the code from an empty environment. It also gives a summary of the implementation of each python file. The preprocessing which is a key part should be decribed precisely. Finally, it should contain a conclusion that gives the performance of the strategy.
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file, especially details on the feature engineering which is a key step ?
- The environment has to contain all libraries used and their versions that are necessary to run the code.
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
- The notebook is not evaluated.
## 1. Preprocessing and features engineering:
@ -51,7 +50,7 @@ project
### Data splitting
The data splitting structure is:
###### Does data splitting (cross-validation) structure as follow ?
```
DATA
@ -71,13 +70,12 @@ DATA
```
- The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
- The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.
##### The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
##### The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.
### Gridsearch
- It contains at least these 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.
##### It contains at least these 5 different models: Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.
There are many options:
- 5 grid searches on 1 model
@ -87,20 +85,21 @@ There are many options:
### Training
- Check that the **target is removed from the X** matrix
###### Is the **target is removed from the X** matrix ?
### Results
Run predict.py on the test set, check that:
- Test (last day) accuracy > **0.65**.
Then, check:
- Train accuracy score < **0.98**. It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
- The confusion matrix is represented as a DataFrame. Example:
##### Run predict.py on the test set, check that: Test (last day) accuracy > **0.65**.
##### Train accuracy score < **0.98**.
It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
##### The confusion matrix is represented as a DataFrame. Example:
![alt text][confusion_matrix]
[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "
- The learning curve for the best model is plotted. Example:
##### The learning curve for the best model is plotted. Example:
![alt text][logo_learning_curve]
@ -108,4 +107,4 @@ Then, check:
Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).
- The trained model is saved as a pickle file
###### Is the trained model saved as a pickle file ?

33
one_exercise_per_file/week03/day01/readme.md

@ -1,22 +1,33 @@
# W3D1 Piscine AI - Data Science
# W3D01 Piscine AI - Data Science
## Neural Networks
# Table of Contents:
Last week you learnt about some Machine Learning algorithms as Random Forest or Gradient Boosting. Neural Networks are another type of Machine Learning algorithms that are intensively used because of their efficiency. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. Different types of neural networks exist and are specific to some use-cases. For example CNN for images, RNN or LSTMs for time-series or text, etc ...
Today we will focus on Artificial Neural Networks. The goal is to understand how do the neural networks work, train them on data and understand the challenges of training a neural network. The ressources below expalin very well the mecanisms behind neural networks, step by step.
# Introduction
However the exercices won't cover architectures as RNN, LSTM - used on sequences as time series or text, CNN - used a lot on images processing. One of the projects will require to know how to use the special architectures. To do so, I suggest that you go through this lesson: https://fr.coursera.org/specializations/deep-learning.
Deep learning is a huge domain. We will focus on Artificial Neural Networks. The goal is to understand how do the neural networks train and train them on data. Understand the challenges of training a neural network
Architectures as RNN, LSTM (learn sequences, used in TS and NLP),CNN used a lot in image processing are well know algorithms in deep learning but won't be covered by the AI branch. Once you have a good understanding of ANN feel free to extend your knowledge to new architectures.
## Exercises of the day
- Exercise 1 The neuron
- Exercise 2 Neural network
- Exercise 3 Log loss
- Exercise 4 Forward propagation
- Exercise 5 Regression
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
## Rules
*Version of NumPy I used to do the exercises: 1.18.1*.
I suggest to use the most recent one.
## Ressources
https://victorzhou.com/blog/intro-to-neural-networks/
## Ressources
- https://victorzhou.com/blog/intro-to-neural-networks/
https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4#:~:text=What%20is%20a%20neuron%3F,to%20become%20the%20neuron's%20output.
Reproduire cet article sans back prop
https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9
- https://srnghn.medium.com/deep-learning-overview-of-neurons-and-activation-functions-1d98286cf1e4
- https://towardsdatascience.com/machine-learning-for-beginners-an-introduction-to-neural-networks-d49f22d238f9

36
one_exercise_per_file/week03/day02/readme.md

@ -1,26 +1,32 @@
# W3D2 Piscine AI - Data Science
# W3D02 Piscine AI - Data Science
## Keras
# Table of Contents:
The goal of this day is to learn to use Keras to build Neural Networks. As explained on Keras website, Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.
And, TensorFlow was created by the Google Brain team, TensorFlow is an open source library for numerical computation and large-scale machine learning. TensorFlow bundles together a slew of machine learning and deep learning (aka neural networking) models and algorithms and makes them useful by way of a common metaphor. It uses Python to provide a convenient front-end API for building applications with the framework, while executing those applications in high-performance C++.
There are two ways to build Keras models: sequential and functional.The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
# Introduction
Keras backend TF
The goal of this day is to learn to use Keras to build Neural Networks.
Note:
There are two ways to build Keras models: sequential and functional.
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
The sequential API allows you to create models layer-by-layer for most problems. It is limited in that it does not allow you to create models that share layers or have multiple inputs or outputs. The exercises focuses on the usage of the sequential API.
## Exercises of the day
'2.4.3'
- Exercise 1 Sequential
- Exercise 2 Dense
- Exercise 3 Architecture
- Exercise 4 Optimize
## Historical
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
- Keras
*Version of Keras I used to do the exercises: 2.4.3*.
I suggest to use the most recent one.
## Ressources
## Rules
The correction will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
A developper
## Ressources
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

33
one_exercise_per_file/week03/day03/readme.md

@ -1,24 +1,31 @@
# W3D2 Piscine AI - Data Science
# W3D03 Piscine AI - Data Science
## Keras 2
# Table of Contents:
The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets. This helps to understand the specifics of networks for classification and regression.
Note:
# Introduction
Keras backend TF
The goal of this day is to learn to use Keras to build Neural Networks and train them on small data sets.
The audit will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
classification & regression
## Exercises of the day
'2.4.3'
- Exercise 1 Regression - Optimize
- Exercise 2 Regression example
- Exercise 3 Multi classification - Softmax
- Exercise 4 Multi classification - Optimize
- Exercise 5 Multi classification example
## Historical
## Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
- Keras
*Version of Keras I used to do the exercises: 2.4.3*.
I suggest to use the most recent one.
## Rules
## Ressources
The correction will provide the code and output because it is not straightforward to reproduce results using Keras. There are many source of randomness. Even if all the seeds are fixed to a constant they may be other source of randomness. https://machinelearningmastery.com/reproducible-results-neural-networks-keras/
A developper
## Ressources
https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
- https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

25
one_exercise_per_file/week03/day04/readme.md

@ -0,0 +1,25 @@
# D02 Piscine AI - Data Science
# Table of Contents:
# Introduction
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different.
https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
The algorithms do not understand words. They need a mathematical reprensation of them.
Today we will learn two important mathematical representations:
- Bag of Words
- Embedding
Each approach has its limits. Context ..
Les packages NLTK and Spacy to do the preprocessing
Loading…
Cancel
Save