feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution

This commit is contained in:
nprimo 2024-02-08 11:59:33 +01:00 committed by Niccolò Primo
parent 700efcb57b
commit d40ec29cf3
2 changed files with 22 additions and 24 deletions

View File

@ -58,9 +58,10 @@ The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training
data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this
dataset, build a classifier that learns to detect the right topic in the
article. The trained model should be stored as `topic_classifier.pkl`. Make
sure the model can be used easily (with the preprocessing pipeline built for
instance) because the audit requires the auditor to test the model.
article. Save the training process to a python file because the audit requires
the auditor to test the model.
To proceed with the following instructions, save the model as
`topic_classifier.pkl`.
Save the plot of learning curves (`learning_curves.png`) in `results` to prove
that the model is trained correctly and not overfitted.
@ -139,10 +140,11 @@ The expected structure of the project is:
project
.
├── data
│   └── date_scrape_data.csv
│   └── ...
├── nlp_enriched_news.py
├── README.md
├── results
│   ├── training_model.py
│   ├── enhanced_news.csv
│   └── learning_curves.png
└── scraper_news.py

View File

@ -8,11 +8,9 @@
##### Scraper
##### There are at least 300 news articles stored in the file system or the database.
##### Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved.
##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
###### Does it run without any error and store the 3 files as expected?
###### Does it run without any error and store the articles as described in the subject?
##### Topic classifier
@ -28,6 +26,20 @@
###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
##### NLP engine output on 300 articles
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the DataFrame saved in the `csv` file contain 300 different rows?
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
###### Is the information presented consistent and accurate?
##### Scandal detection
###### Does the `README.md` explain the choice of embeddings and distance?
@ -35,19 +47,3 @@
###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?
###### Is the distance or similarity saved in the DataFrame?
##### NLP engine output on 300 articles
###### Does the DataFrame contain 300 different rows?
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
##### NLP engine on 3 articles
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.