1. This question is validated if the outputted plot looks like this:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the outputted plot looks like this:
![alt text][ex3q1]
[ex3q1]: ../w2_day2_ex3_q1.png "Scatter plot"
2. This question is validated if the coefficient and the intercept of the Logistic Regression are:
##### The question 2 is validated if the coefficient and the intercept of the Logistic Regression are:
```console
Intercept: [-0.98385574]
Coefficient: [[1.18866075]]
```
3. This question is validated if the plot looks like this:
##### The question 3 is validated if the plot looks like this:
![alt text][ex3q2]
[ex3q2]: ../w2_day2_ex3_q3.png "Scatter plot"
4. This question is validated if `predict_probability` outputs the same probabilities as `predict_proba`. Note that the values have to match one of the class probabilities, not both. To do so, compare your output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important.
##### The question 4 is validated if `predict_probability` outputs the same probabilities as `predict_proba`. Note that the values have to match one of the class probabilities, not both. To do so, compare your output with: `clf.predict_proba(X)[:,1]`. The shape of the arrays is not important.
5. This question is validated if `predict_class` outputs the same classes as `cfl.predict(X)`. The shape of the arrays is not important.
##### The question 5 is validated if `predict_class` outputs the same classes as `cfl.predict(X)`. The shape of the arrays is not important.
6. This question is validated if the plot looks like this:
##### The question 6 is validated if the plot looks like the plot below. As mentioned, it is not required to shift the class prediction to make the plot easier to understand.
##### The exercice is validated is all questions of the exercice are validated
1. This question is validated if X_train, y_train, X_test, y_test match this output:
##### The question 1 is validated if X_train, y_train, X_test, y_test match the output below. The proportion of class `1` is **0.125** in the train set and **1.** in the test set.
```console
X_train:
@ -26,6 +27,6 @@ y_test:
[1. 1.]
```
The proportion of class `1` is **0.125** in the train set and **1.** in the test set.
2. This question is validated if the proportion of class `1` is **0.3** for both sets.
##### The question 2 is validated if the proportion of class `1` is **0.3** for both sets.
1. This question is validated if the proportion of class `Benign` is 0.6552217453505007. It means that if you always predict `Benign` your accuracy would be 66%.
##### The exercice is validated is all questions of the exercice are validated
2. This question is validated if the proportion of one of the classes is the approximately the same on the train and test set: ~0.65. In my case:
##### The question 1 is validated if the proportion of class `Benign` is 0.6552217453505007. It means that if you always predict `Benign` your accuracy would be 66%.
##### The question 2 is validated if the proportion of one of the classes is the approximately the same on the train and test set: ~0.65. In my case:
- test: 0.6571428571428571
- train: 0.6547406082289803
3. This question is validated if the output is:
##### The question 3 is validated if the output is:
```console
# Train
@ -34,12 +36,11 @@ Score on test set:
```
Only the 10 first predictions are outputted. The score is computed on all the data in the folds.
For some reasons, you may have a different data splitting as mine. The requirement for this question is to have a score on the test set bigger than 92%.
If the score is 1, congratulation you've leaked your first target. Drop the target from the X_train or X_test ;) !
4. This question is validated if the confusion matrix on the train set is similar to:
##### The question 4 is validated if the confusion matrix on the train set is similar to:
1. This question is validated if the scaled train set is:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the scaled train set is as below. And by definition, the mean on the axis 0 should be `array([0., 0., 0.])` and the standard deviation on the axis 0 should be `array([1., 1., 1.])`.
1. Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting. This question is validated of the scores you got are close to:
##### The question is validated of the scores you output are close to the scores below. Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting.
```console
# Linear regression
@ -67,6 +67,6 @@
```
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot its overfitting ability.
However, Random Forest and Gradient Boosting propose a solid approach to correct the over fitting (in that case the parameters `max_depth` is set to None that is why the Random Forest over fits the data). These two algorithms are used intensively in Machine Learning Projects.
In my case, the `gridsearch` parameters are not interesting. Even if I reduced the overfitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
In my case, the `gridsearch` parameters are not interesting. Even if I reduced the over-fitting of the Random Forest, the score on the test is lower than the score on the test returned by the Gradient Boosting in the previous exercise without optimal parameters search.
3. This question is validated if the code used is:
##### The question 3 is validated if the code used is:
##### The question 1 is validated if the output is:
```console
Scores on validation sets:
@ -13,4 +13,4 @@ Standard deviation of scores on validation sets:
```
The model is consistent across folds: it is stable. That's a first sign that the model is not overfitted. The average R2 is 60% that's a good start ! To be improved.
The model is consistent across folds: it is stable. That's a first sign that the model is not over-fitted. The average R2 is 60% that's a good start ! To be improved...
1. This question is validated if the code that runs the grid search is similar to:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the code that runs the grid search is similar to:
```python
parameters = {'n_estimators':[10, 50, 75],
@ -16,7 +18,7 @@ gridsearch.fit(X_train, y_train)
The answers that uses another list of parameters are accepted too !
2. This question is validated if you called this attributes:
##### The question 2 is validated if you called this attributes:
```python
print(gridsearch.best_score_)
@ -30,4 +32,4 @@ The best models params are `{'max_depth': 10, 'n_estimators': 75}`.
As you may must have a different parameters list than this one, you should have different results.
3. This question is validated if you used the fitted estimator to compute the score on the test set: `gridsearch.score(X_test, y_test)`. The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted.
##### The question 3 is validated if you used the fitted estimator to compute the score on the test set: `gridsearch.score(X_test, y_test)`. The MSE score is ~0.27. The score I got on the test set is close to the score I got on the validation sets. It means the models is not over fitted.
1. This question is validated if the output of the first ten values of the train labels are:
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the output of the first ten values of the train labels are:
```
array([[0, 1, 0],
@ -13,7 +15,7 @@ array([[0, 1, 0],
[0, 0, 1]])
```
2. This question is validated if the accuracy on the test set is bigger than 90%. To evaluate the accuracy on the test set you can use: `model.evaluate(X_test_sc, y_test_multi_class)`.
##### The question 2 is validated if the accuracy on the test set is bigger than 90%. To evaluate the accuracy on the test set you can use: `model.evaluate(X_test_sc, y_test_multi_class)`.
Here is an implementation that gives 96% accuracy on the test set.
1. This question is validated if the ouptut of the NER is
##### The exercice is validated is all questions of the exercice are validated
##### The question 1 is validated if the ouptut of the NER is
```
Apple Inc. ORG
@ -25,7 +27,7 @@
Apple ORG
Apple II ORG
```
2. This question is validated if the output shows that the first occurence of apple is not a named entity. In my case here is what the NER returns:
##### The question 2 is validated if the output shows that the first occurence of apple is not a named entity. In my case here is what the NER returns: