Browse Source

feat(nlp): update exercise 7 subject and audit

pull/2364/head
nprimo 5 months ago committed by Niccolò Primo
parent
commit
c2e60afc28
  1. 21
      subjects/ai/nlp/README.md
  2. 78
      subjects/ai/nlp/audit/README.md

21
subjects/ai/nlp/README.md

@ -196,7 +196,7 @@ Steps:
> Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns.
2. Using `from_spmatrix` from Pandas, create a DataFrame `count_vecotrized_df` using the output features names as column names. The final results should be similar to the below one.
| | and | boat | compute |
| --: | --: | ---: | ------: |
@ -206,16 +206,23 @@ Steps:
> Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.
3. Create a DataFrame with labels where:
3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet.
4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets
5. Add to your `count_vecotrized_df` a `label` column considering the following:
- 1: Positive
- 0: Neutral
- -1: Negative
| | Label |
| --: | ----: |
| 0 | -1 |
| 1 | 0 |
| 2 | 1 |
The final DataFrame should be similar to the below:
| | ... | label |
|---:|-------:|--------:|
| 0 | ... | 1 |
| 1 | ... | -1 |
| 2 | ... | -1 |
| 3 | ... | -1 |
_Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_

78
subjects/ai/nlp/audit/README.md

@ -183,26 +183,66 @@ Remove this from the sentence
##### The exercise is validated if all questions of the exercise are validated
###### For question 1, is the output of the CountVectorizer the following?
###### For question 1, is the output of the `CountVectorizer` the following?
```
<6588x500 sparse matrix of type '<class 'numpy.int64'>'
with 79709 stored elements in Compressed Sparse Row format>
with 37334 stored elements in Compressed Sparse Row format>
```
###### For question 2, is the output of `print(count_vecotrized_df.iloc[:3,400:403].to_markdown())` the following?
```python
| | someth | son | song |
|---:|---------:|------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
```
###### For question 3, is the output matching with the following one?
```python
cant 1
deal 1
end 1
find 1
keep 1
like 1
may 1
say 1
talk 1
Name: 3, dtype: Sparse[int64, 0]
```
###### For question 4, is the output matching with the following one?
```python
tomorrow 1126
go 733
day 667
night 641
may 533
tonight 501
see 439
time 429
im 422
get 398
today 389
game 382
saturday 379
friday 375
sunday 368
dtype: int64
```
###### For question 5, is the output of `print(count_vectorized_df.iloc[350:354,499:501].to_markdown())` the following?
```python
| | your | label |
|----:|-------:|--------:|
| 350 | 0 | 1 |
| 351 | 1 | -1 |
| 352 | 0 | 1 |
| 353 | 0 | 0 |
```
###### For question 2, is the output of `print(df.iloc[:3,400:403].to_markdown())` the following?
| | talk | team | tell |
|---:|-------:|-------:|-------:|
| 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 |
###### For question 3, is the shape of the wordcount DataFrame `(6588, 501)` and the output of `print(df.iloc[300:304,499:501].to_markdown())` the following?
| | youtube | label |
|----:|----------:|--------:|
| 300 | 0 | 0 |
| 301 | 0 | -1 |
| 302 | 1 | 0 |
| 303 | 0 | 1 |

Loading…
Cancel
Save