From d40ec29cf3126e497aae2256b5e5222607376dba Mon Sep 17 00:00:00 2001
From: nprimo <primo.niccolo@gmail.com>
Date: Thu, 8 Feb 2024 11:59:33 +0100
Subject: [PATCH] feat(nlp-scraper): restructure subject and audit to avoid
 storing big files in solution

---
 subjects/ai/nlp-scraper/README.md       | 10 ++++---
 subjects/ai/nlp-scraper/audit/README.md | 36 +++++++++++--------------
 2 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/subjects/ai/nlp-scraper/README.md b/subjects/ai/nlp-scraper/README.md
index 9b210ce8f..1e73c95bc 100644
--- a/subjects/ai/nlp-scraper/README.md
+++ b/subjects/ai/nlp-scraper/README.md
@@ -58,9 +58,10 @@ The goal is to detect what the article is dealing with: Tech, Sport, Business,
 Entertainment or Politics. To do so, a labelled dataset is provided: [training
 data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this
 dataset, build a classifier that learns to detect the right topic in the
-article. The trained model should be stored as `topic_classifier.pkl`. Make
-sure the model can be used easily (with the preprocessing pipeline built for
-instance) because the audit requires the auditor to test the model.
+article. Save the training process to a python file because the audit requires
+the auditor to test the model.
+To proceed with the following instructions, save the model as
+`topic_classifier.pkl`.
 
 Save the plot of learning curves (`learning_curves.png`) in `results` to prove
 that the model is trained correctly and not overfitted.
@@ -139,10 +140,11 @@ The expected structure of the project is:
 project
 .
 ├── data
-│   └── date_scrape_data.csv
+│   └── ...
 ├── nlp_enriched_news.py
 ├── README.md
 ├── results
+│   ├── training_model.py
 │   ├── enhanced_news.csv
 │   └── learning_curves.png
 └── scraper_news.py
diff --git a/subjects/ai/nlp-scraper/audit/README.md b/subjects/ai/nlp-scraper/audit/README.md
index 6dd031407..e38aa8144 100644
--- a/subjects/ai/nlp-scraper/audit/README.md
+++ b/subjects/ai/nlp-scraper/audit/README.md
@@ -8,11 +8,9 @@
 
 ##### Scraper
 
-##### There are at least 300 news articles stored in the file system or the database.
+##### Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved.
 
-##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
-
-###### Does it run without any error and store the 3 files as expected?
+###### Does it run without any error and store the articles as described in the subject?
 
 ##### Topic classifier
 
@@ -28,6 +26,20 @@
 
 ###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
 
+##### NLP engine output on 300 articles
+
+###### Can you run `python nlp_enriched_news.py` without any error?
+
+###### Does the DataFrame saved in the `csv` file contain 300 different rows?
+
+###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
+
+###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
+
+##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
+
+###### Is the information presented consistent and accurate?
+
 ##### Scandal detection
 
 ###### Does the `README.md` explain the choice of embeddings and distance?
@@ -35,19 +47,3 @@
 ###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?
 
 ###### Is the distance or similarity saved in the DataFrame?
-
-##### NLP engine output on 300 articles
-
-###### Does the DataFrame contain 300 different rows?
-
-###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
-
-##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
-
-##### NLP engine on 3 articles
-
-###### Can you run `python nlp_enriched_news.py` without any error?
-
-###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
-
-##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.