feat(nlp-scraper): improve audit and subject

- add details for question about checking "overfitting" - remove not so clear suggestion - move creation of `topic_classifier.pkl` to audit phase
3 months ago · 700efcb57b
2 changed files with 8 additions and 8 deletions
--- a/subjects/ai/nlp-scraper/README.md
+++ b/subjects/ai/nlp-scraper/README.md
@ -143,7 +143,6 @@ project
 ├── nlp_enriched_news.py
 ├── README.md
 ├── results
-│   ├── topic_classifier.pkl
 │   ├── enhanced_news.csv
 │   └── learning_curves.png
 └── scraper_news.py
@ -169,7 +168,8 @@ python scraper_news.py
 2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
   should:

-   - Save a `DataFrame` with the following struct:
+   - Save a `DataFrame` with the following struct and store the result in a
+     `csv` file, `enhancend_news.csv`:

   ```
   Unique ID (`uuid` or `int`)
@ -215,10 +215,6 @@ python scraper_news.py
   Environmental scandal detected for <entity>
   ```

-> I strongly suggest creating a data structure (dictionary for example) to save
-> all the intermediate result. Then, a boolean argument `cache` fetched the
-> intermediate results when they are already computed.
-
 ### Notions

 - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
--- a/subjects/ai/nlp-scraper/audit/README.md
+++ b/subjects/ai/nlp-scraper/audit/README.md
@ -18,11 +18,15 @@

 ###### Are the learning curves provided?

-###### Do the learning curves prove the topics classifier is trained correctly - without overfitting?
+###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.
+
+> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".
+
+##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.

 ###### Can you run the topic classifier model on the test set without any error?

-###### Does the topic classifier score an accuracy higher than 95%?
+###### Does the topic classifier score an accuracy higher than 95% on the given datasets?

 ##### Scandal detection