Browse Source

fix: w3 and project 1 audit

pull/42/head
Badr Ghazlane 2 years ago
parent
commit
b222aca668
  1. 2
      one_exercise_per_file/week01/day01/ex03/audit/readme.md
  2. 36
      one_exercise_per_file/week01/day01/ex07/audit/readme.md
  3. 8
      one_exercise_per_file/week01/day01/readme.md
  4. 2
      one_exercise_per_file/week02/day05/ex03/audit/readme.md
  5. 37
      one_exercise_per_file/week03/day04/readme.md
  6. 33
      one_exercise_per_file/week03/day05/readme.md
  7. 60
      one_md_per_day_format/projects/project1/audit/readme.md
  8. 116
      one_md_per_day_format/projects/project1/readme.md

2
one_exercise_per_file/week01/day01/ex03/audit/readme.md

@ -1,4 +1,4 @@
##### The exercice is validated is all questions of the exercice are vaildated
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`.

36
one_exercise_per_file/week01/day01/ex07/audit/readme.md

@ -1,23 +1,6 @@
##### The exercice is validated is all questions of the exercice are validated
1. There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
This question is validated if, without having used a for loop or having filled the array manually, the output is:
##### This question is validated if, without having used a for loop or having filled the array manually, the output is:
```console
[[ 7. 1. 7.]
@ -32,3 +15,18 @@ This question is validated if, without having used a for loop or having filled t
[ 8. 5. 8.]]
```
There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
```

8
one_exercise_per_file/week01/day01/readme.md

@ -9,10 +9,10 @@ The goal of this day is to understand practical usage of **NumPy**. **NumPy** is
- Exercise 2 Zeros
- Exercise 3 Slicing
- Exercise 4 Random
- Exercise 5: Split, concatenate, reshape arrays
- Exercise 6: Broadcasting and Slicing
- Exercise 7: NaN
- Exercise 8: Wine
- Exercise 5 Split, concatenate, reshape arrays
- Exercise 6 Broadcasting and Slicing
- Exercise 7 NaN
- Exercise 8 Wine
- Exercise 9 Football tournament
## Virtual Environment

2
one_exercise_per_file/week02/day05/ex03/audit/readme.md

@ -18,7 +18,7 @@ gridsearch.fit(X_train, y_train)
The answers that uses another list of parameters are accepted too !
##### The question 2 is validated if you called this attributes:
##### The question 2 is validated if you called these attributes:
```python
print(gridsearch.best_score_)

37
one_exercise_per_file/week03/day04/readme.md

@ -1,25 +1,36 @@
# D02 Piscine AI - Data Science
# W3D04 Piscine AI - Data Science
## Natural Language processing
# Table of Contents:
“NLP makes it possible for humans to talk to machines:” This branch of AI enables computers to understand, interpret, and manipulate human language. This technology is one of the most broadly applied areas of machine learning and is critical in effectively analyzing massive quantities of unstructured, text-heavy data.
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in an unordered bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost. This is useful to train usual machine learning models on text data. Other types of models as RNNs or LSTMs take as input a complete and ordered sequence.
# Introduction
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. The article **Your Guide to Natural Language Processing (NLP)** gives a very good introduction to NLP.
Today, we we will learn to preprocess text data and to create a bag of word representation. Les packages NLTK and Spacy to do the preprocessing
Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. In natural language processing, a common technique for extracting features from text is to place all of the words that occur in the text in a bucket. This aproach is called a bag of words model or BoW for short. It’s referred to as a “bag” of words because any information about the structure of the sentence is lost.
Almost every Natural Language Processing (NLP) task requires text to be preprocessed before training a model. Deep learning models cannot use raw text directly, so it is up to us researchers to clean the text ourselves. Depending on the nature of the task, the preprocessing methods can be different.
## Exercises of the day
https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
- Exercise 1 Lower case
- Exercise 2 Punctuation
- Exercise 3 Tokenization
- Exercise 4 Stop words
- Exercise 5 Stemming
- Exercise 6 Text preprocessing
- Exercise 7 Bag of Word representation
## Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Scikit Learn
- NLTK
The algorithms do not understand words. They need a mathematical reprensation of them.
Today we will learn two important mathematical representations:
I suggest to use the most recent libraries.
- Bag of Words
- Embedding
## Ressources
Each approach has its limits. Context ..
Les packages NLTK and Spacy to do the preprocessing
- https://towardsdatascience.com/your-guide-to-natural-language-processing-nlp-48ea2511f6e1
- https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4

33
one_exercise_per_file/week03/day05/readme.md

@ -1,20 +1,31 @@
# W3D2 Piscine AI - Data Science
# W3D05 Piscine AI - Data Science
## Natural Language processing with Spacy
# Table of Contents:
Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start. I don't need to detail what spaCy does, it is perfectly summarized by spaCy in this article: **spaCy 101: Everything you need to know**.
Today, we will learn to use a pre-trained embedding to convert a text into a vector to compute similarity between words or sentences. Remember, embeddings translate large sparse vectors into a lower-dimensional space that preserves semantic relationships.
Word embeddings is a technique where individual words of a domain or language are represented as real-valued vectors in a lower dimensional space. The BoW representation's dimension depends on the size of the vocabulary. But it can easily reach 10k words. We will also learn to use NER and Part-of-speech. NER allows to identify and segment the named entities and classify or categorize them under various predefined classes. Part-of-speech is a special label assigned to each token (word) in a text corpus to indicate the part of speech and often also other grammatical categories such as tense, number (plural/singular), case etc.
# Introduction
## Exercises of the day
Embeddings ...
- Exercise 1 Embedding 1
- Exercise 2 Tokenization
- Exercise 3 Embeddings 2
- Exercise 4 Sentences' similarity
- Exercise 5 NER
- Exercise 6 Part-of-speech tags
Library: Spacy is a natural language processing (NLP) library for Python designed to have fast performance, and with word embedding models built in, it’s perfect for a quick and easy start.
## Virtual Environment
- Python 3.x
- Jupyter or JupyterLab
- Spacy
There are many type of language models pre-trained in Spacy. Each has its specificities depending on the hypothesis taken.
## Historical
I suggest to use the most recent libraries.
## Ressources
## Rules
## Ressources
- https://spacy.io/usage/spacy-101
- https://spacy.io/api/doc
- https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/
- https://medium.com/mlearning-ai/nlp-04-part-of-speech-tagging-in-spacy-dc3e239c2726

60
one_md_per_day_format/projects/project1/audit/readme.md

@ -0,0 +1,60 @@
# Project01 - First Kaggle: Titanic - audit
### Preliminary
```
project
│ README.md
│ environment.yml
│ username.txt
└───data
│ │ train.csv
│ | test.csv
| | gender_submission.csv
└───notebook
│ │ EDA.ipynb
|
|───scripts
```
###### Does the structure of the project is as below ?
###### Does the readme file give an introduction of the project, show the username, describe the feature engineering and show the best score the on the leaderboard ?
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
### Feature engineering
###### Does the notebook can be executed without any arror ?
###### Does the notebook explain the feature engineering that contributed to improve the accuracy ?
### Scripts
###### Can you train the best model on the train data with feature engineering without any error ?
###### Can you predict on the test set using the best model without any error ?
###### Is the score you get **on the test set** with the best model is close to what is expected ?
### Final score
###### Is the accuracy associated with the username in `username.txt` is higher than 79% ? The best submission score can be accessed from the user profile.
### Examples
Here are two very good submissions explained and detailed:
- https://www.kaggle.com/konstantinmasich/titanic-0-82-0-83
- https://www.kaggle.com/sreevishnudamodaran/ultimate-eda-fe-neural-network-model-top-2

116
one_md_per_day_format/projects/project1/readme.md

@ -0,0 +1,116 @@
# First Kaggle
## Introduction
The goal of this **1 week** project is to get the highest possible score on a Data Science competition. More precisely you will have to predict who survived the Titanic crash.
![alt text][titanic]
[titanic]: titanic.jpg "Titanic"
### Kaggle
Kaggle is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems.
### Titanic - Machine Learning from Disaster
One of the first Kaggle competition I did was: Titanic - Machine Learning from Disaster. This is a not-to-be-missed Kaggle competition.
You can see more [here](https://www.kaggle.com/c/titanic)
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there were not enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, you have to build a predictive model that answers the question: **“what sorts of people were more likely to survive?”** using passenger data (ie name, age, gender, socio-economic class, etc). **You will have to submit your prediction on Kaggle**.
## Preliminary
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this [resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
- Create a username following this structure: username_01EDU_ location_MM_YYYY. Submit the description profile and push it on GitHub the first day of the week. Do not touch this file anymore.
- It is possible to have different personal accounts merged in a team for one single competition.
## Deliverables
```console
project
│ README.md
│ environment.yml
│ username.txt
└───data
│ │ train.csv
│ | test.csv
| | gender_submission.csv
└───notebook
│ │ EDA.ipynb
|
|───scripts
```
- `README.md` introduction of the project, shows the username, describes the features engineering and the best score on the **leaderboard**. Note the score on the test set using the exact same pipeline that led to the best score on the leaderboard.
- `environment.yml` contains all libraries required to run the code.
- `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**.
- `EDA.ipynb` contains the exploratory data analysis. This file should contain all steps of data analysis that contributed or not to improve the accuracy. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs.
- **Submit your predictions on the Kaggle's competition platform**. Check your ranking and score in the leaderboard.
### Scores
In order to validate the project you will have to score at least **79% accuracy on the Leaderboard**:
- 79% accuracy is the minimum score to validate the project.
Scores indication:
- 79% difficult - minimum required
- 81% very difficult: smart feature engineering needed
- More than 83%: excellent that corresponds to the top 2% on Kaggle
- More than 85%: cheating
#### Cheating
It is impossible to get 100%. Who would have predicted that Rose wouldn't let [Jack on the door](https://www.insider.com/jack-and-rose-werent-on-a-door-in-titanic-2019-7) ?
All people having 100% of accuracy on the Leaderboard cheated, there's no point to compare with them or to cheat. The Kaggle community estimates that having more than 85% is almost considered as cheated submissions as they are element of luck involved in the surviving.
**You can't used external data sets than the ones provided in that competition.**
## The key points
- **Feature engineering**:
Put yourself in the shoes of an investigator trying to understand what happened exactly in that boat during the crash. Do not hesitate to watch the movie to try to find as many insights as possible. Without a smart the feature engineering there's no way to validate the project ;-)
- The leaderboard evaluates on test data for which you don't have the labels. It means that there's no point to over fit the train set. Check the over fitting on the train set by dividing the data and by cross-validating the accuracy.
## Advice
Don't try to build the perfect model the first day. Iterate a lot and test your assumptions:
Iteration 1:
- Predict all passengers die
Iteration 2
- Fit a logistic regression with a basic feature engineering
Iteration 3:
- Perform an EDA. Make assumptions and check them. Example: What if first class passengers survived more. Check the assumption through EDA and create relevant features to help the model capture the information.
- Run a gridsearch
Iteration 4:
- Good luck !
Loading…
Cancel
Save