Browse Source

CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)

* chore(nlp-scraper): fix small grammar mistakes and improve readability

* feat(nlp-scraper): add link to datasets provided

* feat(nlp-scraper): add clarification about sentiment analysis

* feat(nlp-scraper): define how many articles are expected to be scraped

* chore(nlp-scraper): improve grammar and readability

* chore(nlp-scraper): fix typos

* feat(nlp-scraper): add label to link

* feat(nlp-scraper): remove audit question not related to the project

* refactor(nlp-scraper): refactor question

* chore(nlp-scraper): fix small typos

* feat(nlp-scraper): add information on how to calculate scandal

* feat(nlp-scraper): adde details to the delivrable section

* feat(nlp-scraper): add reference to subject in audit

* feat(nlp-scraper): update project structure

- run prettier

* feat(nlp-scraper): complete sentence in subject intro

-make formatting consistent with 01 subject
pull/2437/head
Niccolò Primo 3 months ago committed by GitHub
parent
commit
c6d8ca334a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
  1. 134
      subjects/ai/nlp-scraper/README.md
  2. 69
      subjects/ai/nlp-scraper/audit/README.md

134
subjects/ai/nlp-scraper/README.md

@ -1,4 +1,4 @@
# NLP-enriched News Intelligence platform
## NLP-enriched News Intelligence platform
The goal of this project is to build an NLP-enriched News Intelligence
platform. News analysis is a trending and important topic. The analysts get
@ -7,7 +7,8 @@ limitless. Having a platform that helps to detect the relevant information is
definitely valuable.
The platform connects to a news data source, detects the entities, detects the
topic of the article, analyse the sentiment and ...
topic of the article, analyses the sentiment and performs a scandal detection
analysis.
### Scraper
@ -40,7 +41,7 @@ the stored data.
Here how the NLP engine should process the news:
### **1. Entities detection:**
#### **1. Entities detection:**
The goal is to detect all the entities in the document (headline and body). The
type of entity we focus on is `ORG`. This corresponds to companies and
@ -51,7 +52,7 @@ organizations. This information should be stored.
[Named Entity Recognition with NLTK and
SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
### **2. Topic detection:**
#### **2. Topic detection:**
The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training
@ -71,7 +72,7 @@ that the model is trained correctly and not overfitted.
[following](https://www.kaggle.com/rmisra/news-category-dataset) which is
based on 200k news headlines.
### **3. Sentiment analysis:**
#### **3. Sentiment analysis:**
The goal is to detect the sentiment (positive, negative or neutral) of the news
articles. To do so, use a pre-trained sentiment model. I suggest to use:
@ -85,29 +86,32 @@ articles. To do so, use a pre-trained sentiment model. I suggest to use:
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
### **4. Scandal detection **
#### **4. Scandal detection**
The goal is to detect environmental disaster for the detected companies. Here
is the methodology that should be used:
- Define keywords that correspond to environmental disaster that may be caused
by companies: pollution, deforestation etc ... Here is an example of disaster
we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to
not use ambiguous words that make sense in the context of an environmental
disaster but also in another context. This would lead to detect a false
positive natural disaster.
by companies: pollution, deforestation etc ... Here is [an example of
disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
attention to not use ambiguous words that make sense in the context of an
environmental disaster but also in another context. This would lead to detect
a false positive natural disaster.
- Compute the embeddings of the keywords.
- Compute the [embeddings of the
keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
- Compute the distance between the embeddings of the keywords and all sentences
that contain an entity. Explain in the `README.md` the embeddings chosen and
why. Similarly explain the distance or similarity chosen and why.
- Compute the distance ([here some
examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save the distance
- Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles.
### 5. **Source analysis (optional)**
#### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scraped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:
### Deliverables
The structure of the project is:
The expected structure of the project is:
```
project
│ README.md
│ environment.yml
└───data
│ │ topic_classification_data.csv
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
.
├── data
   └── date_scrape_data.csv
├── nlp_enriched_news.py
├── README.md
├── results
   ├── topic_classifier.pkl
   ├── enhanced_news.csv
   └── learning_curves.png
└── scraper_news.py
```
1. Run the scraper until it fetches at least 300 articles
@ -166,52 +166,60 @@ python scraper_news.py
```
2. Run on these 300 articles the NLP engine.
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
Save a `DataFrame`:
- Save a `DataFrame` with the following struct:
Date scraped (date)
Title (`str`)
URL (`str`)
Body (`str`)
Org (`str`)
Topics (`list str`)
Sentiment (`list float1 or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
Unique ID (`uuid` or `int`)
URL (`str`)
Date scraped (`date`)
Headline (`str`)
Body (`str`)
Org (`list str`)
Topics (`list str`)
Sentiment (`list float` or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
```prompt
python nlp_enriched_news.py
- Have a similar output while it process the articles
Enriching <URL>:
```prompt
python nlp_enriched_news.py
Cleaning document ... (optional)
Enriching <URL>:
---------- Detect entities ----------
Cleaning document ... (optional)
Detected <X> companies which are <company_1> and <company_2>
---------- Detect entities ----------
---------- Topic detection ----------
Detected <X> companies which are <company_1> and <company_2>
Text preprocessing ...
---------- Topic detection ----------
The topic of the article is: <topic>
Text preprocessing ...
---------- Sentiment analysis ----------
The topic of the article is: <topic>
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Sentiment analysis ----------
---------- Scandal detection ----------
Text preprocessing ... (optional)
The article <title> has a <sentiment> sentiment
Computing embeddings and distance ...
---------- Scandal detection ----------
Environmental scandal detected for <entity>
```
Computing embeddings and distance ...
I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
Environmental scandal detected for <entity>
```
Resources:
> I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
- https://www.youtube.com/watch?v=XVv6mJpFOb0
### Notions
- [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)

69
subjects/ai/nlp-scraper/audit/README.md

@ -2,25 +2,7 @@
##### Preliminary
```
project
│ README.md
│ environment.yml
└───data
│ │ topic_classification_data.csv
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
```
###### Does the structure of the project look like the above?
###### Does the structure of the project look like the one described in the subject?
###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -28,7 +10,7 @@ project
##### There are at least 300 news articles stored in the file system or the database.
##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
###### Does it run without any error and store the 3 files as expected?
@ -54,20 +36,7 @@ project
###### Does the DataFrame contain 300 different rows?
###### Are the columns of the DataFrame as expected?
```
Date scraped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
@ -75,36 +44,6 @@ Top_10 (bool)
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the output of the NLP engine correspond to the output below?
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.

Loading…
Cancel
Save