Browse Source

feat(nlp-scraper): improve audit and subject

- add details for question about checking "overfitting"
- remove not so clear suggestion
- move creation of `topic_classifier.pkl` to audit phase
pull/2468/head
nprimo 3 months ago committed by Niccolò Primo
parent
commit
700efcb57b
  1. 8
      subjects/ai/nlp-scraper/README.md
  2. 8
      subjects/ai/nlp-scraper/audit/README.md

8
subjects/ai/nlp-scraper/README.md

@ -143,7 +143,6 @@ project
├── nlp_enriched_news.py
├── README.md
├── results
   ├── topic_classifier.pkl
   ├── enhanced_news.csv
   └── learning_curves.png
└── scraper_news.py
@ -169,7 +168,8 @@ python scraper_news.py
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
- Save a `DataFrame` with the following struct:
- Save a `DataFrame` with the following struct and store the result in a
`csv` file, `enhancend_news.csv`:
```
Unique ID (`uuid` or `int`)
@ -215,10 +215,6 @@ python scraper_news.py
Environmental scandal detected for <entity>
```
> I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
### Notions
- [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)

8
subjects/ai/nlp-scraper/audit/README.md

@ -18,11 +18,15 @@
###### Are the learning curves provided?
###### Do the learning curves prove the topics classifier is trained correctly - without overfitting?
###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.
> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".
##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.
###### Can you run the topic classifier model on the test set without any error?
###### Does the topic classifier score an accuracy higher than 95%?
###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
##### Scandal detection

Loading…
Cancel
Save