CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)

* chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject
3 months ago · c6d8ca334a
2 changed files with 75 additions and 128 deletions
--- a/subjects/ai/nlp-scraper/README.md
+++ b/subjects/ai/nlp-scraper/README.md
@ -1,4 +1,4 @@
-# NLP-enriched News Intelligence platform
+## NLP-enriched News Intelligence platform

 The goal of this project is to build an NLP-enriched News Intelligence
 platform. News analysis is a trending and important topic. The analysts get
@ -7,7 +7,8 @@ limitless. Having a platform that helps to detect the relevant information is
 definitely valuable.

 The platform connects to a news data source, detects the entities, detects the
-topic of the article, analyse the sentiment and ...
+topic of the article, analyses the sentiment and performs a scandal detection
+analysis.

 ### Scraper

@ -40,7 +41,7 @@ the stored data.

 Here how the NLP engine should process the news:

-### **1. Entities detection:**
+#### **1. Entities detection:**

 The goal is to detect all the entities in the document (headline and body). The
 type of entity we focus on is `ORG`. This corresponds to companies and
@ -51,7 +52,7 @@ organizations. This information should be stored.
 [Named Entity Recognition with NLTK and
 SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)

-### **2. Topic detection:**
+#### **2. Topic detection:**

 The goal is to detect what the article is dealing with: Tech, Sport, Business,
 Entertainment or Politics. To do so, a labelled dataset is provided: [training
@ -71,7 +72,7 @@ that the model is trained correctly and not overfitted.
  [following](https://www.kaggle.com/rmisra/news-category-dataset) which is
  based on 200k news headlines.

-### **3. Sentiment analysis:**
+#### **3. Sentiment analysis:**

 The goal is to detect the sentiment (positive, negative or neutral) of the news
 articles. To do so, use a pre-trained sentiment model. I suggest to use:
@ -85,29 +86,32 @@ articles. To do so, use a pre-trained sentiment model. I suggest to use:

 - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)

-### **4. Scandal detection **
+#### **4. Scandal detection**

 The goal is to detect environmental disaster for the detected companies. Here
 is the methodology that should be used:

 - Define keywords that correspond to environmental disaster that may be caused
-  by companies: pollution, deforestation etc ... Here is an example of disaster
-  we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to
-  not use ambiguous words that make sense in the context of an environmental
-  disaster but also in another context. This would lead to detect a false
-  positive natural disaster.
+  by companies: pollution, deforestation etc ... Here is [an example of
+  disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
+  attention to not use ambiguous words that make sense in the context of an
+  environmental disaster but also in another context. This would lead to detect
+  a false positive natural disaster.

- Compute the embeddings of the keywords.
+- Compute the [embeddings of the
+  keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).

- Compute the distance between the embeddings of the keywords and all sentences
-  that contain an entity. Explain in the `README.md` the embeddings chosen and
-  why. Similarly explain the distance or similarity chosen and why.
+- Compute the distance ([here some
+  examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
+  between the embeddings of the keywords and all sentences that contain an
+  entity. Explain in the `README.md` the embeddings chosen and why. Similarly
+  explain the distance or similarity chosen and why.

- Save the distance
+- Save a metric to unify all the distances calculated per article.

 - Flag the top 10 articles.

-### 5. **Source analysis (optional)**
+#### 5. **Source analysis (optional)**

 The goal is to show insights about the news' source you scraped.
 This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:

 ### Deliverables

-The structure of the project is:
+The expected structure of the project is:

 ```
 project
-│   README.md
-│   environment.yml
-│
-└───data
-│   │   topic_classification_data.csv
-│
-└───results
-│   │   topic_classifier.pkl
-│   │   learning_curves.png
-│   │   enhanced_news.csv
-|
-|───nlp_engine
-│
-
+.
+├── data
+│   └── date_scrape_data.csv
+├── nlp_enriched_news.py
+├── README.md
+├── results
+│   ├── topic_classifier.pkl
+│   ├── enhanced_news.csv
+│   └── learning_curves.png
+└── scraper_news.py
 ```

 1.  Run the scraper until it fetches at least 300 articles
@ -166,52 +166,60 @@ python scraper_news.py

 ```

-2. Run on these 300 articles the NLP engine.
+2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
+   should:

-Save a `DataFrame`:
+   - Save a `DataFrame` with the following struct:

-Date scraped (date)
-Title (`str`)
-URL (`str`)
-Body (`str`)
-Org (`str`)
-Topics (`list str`)
-Sentiment (`list float1 or `float`)
-Scandal_distance (`float`)
-Top_10 (`bool`)
+   ```
+   Unique ID (`uuid` or `int`)
+   URL (`str`)
+   Date scraped (`date`)
+   Headline (`str`)
+   Body (`str`)
+   Org (`list str`)
+   Topics (`list str`)
+   Sentiment (`list float` or `float`)
+   Scandal_distance (`float`)
+   Top_10 (`bool`)
+   ```

-```prompt
-python nlp_enriched_news.py
+   - Have a similar output while it process the articles

-Enriching <URL>:
+   ```prompt
+   python nlp_enriched_news.py

-Cleaning document ... (optional)
+   Enriching <URL>:

---------- Detect entities ----------
+   Cleaning document ... (optional)

-Detected <X> companies which are <company_1> and <company_2>
+   ---------- Detect entities ----------

---------- Topic detection ----------
+   Detected <X> companies which are <company_1> and <company_2>

-Text preprocessing ...
+   ---------- Topic detection ----------

-The topic of the article is: <topic>
+   Text preprocessing ...

---------- Sentiment analysis ----------
+   The topic of the article is: <topic>

-Text preprocessing ... (optional)
-The title which is <title> is <sentiment>
-The body of the article is <sentiment>
+   ---------- Sentiment analysis ----------

---------- Scandal detection ----------
+   Text preprocessing ... (optional)
+   The article <title> has a <sentiment> sentiment

-Computing embeddings and distance ...
+   ---------- Scandal detection ----------

-Environmental scandal detected for <entity>
-```
+   Computing embeddings and distance ...

-I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
+   Environmental scandal detected for <entity>
+   ```

-Resources:
+> I strongly suggest creating a data structure (dictionary for example) to save
+> all the intermediate result. Then, a boolean argument `cache` fetched the
+> intermediate results when they are already computed.

- https://www.youtube.com/watch?v=XVv6mJpFOb0
+### Notions
+
+- [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
+- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
--- a/subjects/ai/nlp-scraper/audit/README.md
+++ b/subjects/ai/nlp-scraper/audit/README.md
@ -2,25 +2,7 @@

 ##### Preliminary

-```
-project
-│   README.md
-│   environment.yml
-│
-└───data
-│   │   topic_classification_data.csv
-│
-└───results
-│   │   topic_classifier.pkl
-│   │   learning_curves.png
-│   │   enhanced_news.csv
-|
-|───nlp_engine
-│
-
-```
-
-###### Does the structure of the project look like the above?
+###### Does the structure of the project look like the one described in the subject?

 ###### Does the environment contain all libraries used and their versions that are necessary to run the code?

@ -28,7 +10,7 @@ project

 ##### There are at least 300 news articles stored in the file system or the database.

-##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually. 
+##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.

 ###### Does it run without any error and store the 3 files as expected?

@ -54,20 +36,7 @@ project

 ###### Does the DataFrame contain 300 different rows?

-###### Are the columns of the DataFrame as expected?
-
-```
-Date scraped (date)
-Title (str)
-URL (str)
-Body (str)
-Org (str)
-Topics (list str)
-Sentiment (list float or float)
-Scandal_distance (float)
-Top_10 (bool)
-
-```
+###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?

 ##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.

@ -75,36 +44,6 @@ Top_10 (bool)

 ###### Can you run `python nlp_enriched_news.py` without any error?

-###### Does the output of the NLP engine correspond to the output below?
-
-```prompt
-python nlp_enriched_news.py
-
-Enriching <URL>:
-
-Cleaning document ... (optional)
-
---------- Detect entities ----------
-
-Detected <X> companies which are <company_1> and <company_2>
-
---------- Topic detection ----------
-
-Text preprocessing ...
-
-The topic of the article is: <topic>
-
---------- Sentiment analysis ----------
-
-Text preprocessing ... (optional)
-The title which is <title> is <sentiment>
-The body of the article is <sentiment>
-
---------- Scandal detection ----------
-
-Computing embeddings and distance ...
-
-Environmental scandal detected for <entity>
-```
+###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?

 ##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.