From d40ec29cf3126e497aae2256b5e5222607376dba Mon Sep 17 00:00:00 2001 From: nprimo Date: Thu, 8 Feb 2024 11:59:33 +0100 Subject: [PATCH] feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution --- subjects/ai/nlp-scraper/README.md | 10 ++++--- subjects/ai/nlp-scraper/audit/README.md | 36 +++++++++++-------------- 2 files changed, 22 insertions(+), 24 deletions(-) diff --git a/subjects/ai/nlp-scraper/README.md b/subjects/ai/nlp-scraper/README.md index 9b210ce8f..1e73c95bc 100644 --- a/subjects/ai/nlp-scraper/README.md +++ b/subjects/ai/nlp-scraper/README.md @@ -58,9 +58,10 @@ The goal is to detect what the article is dealing with: Tech, Sport, Business, Entertainment or Politics. To do so, a labelled dataset is provided: [training data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this dataset, build a classifier that learns to detect the right topic in the -article. The trained model should be stored as `topic_classifier.pkl`. Make -sure the model can be used easily (with the preprocessing pipeline built for -instance) because the audit requires the auditor to test the model. +article. Save the training process to a python file because the audit requires +the auditor to test the model. +To proceed with the following instructions, save the model as +`topic_classifier.pkl`. Save the plot of learning curves (`learning_curves.png`) in `results` to prove that the model is trained correctly and not overfitted. @@ -139,10 +140,11 @@ The expected structure of the project is: project . ├── data -│   └── date_scrape_data.csv +│   └── ... ├── nlp_enriched_news.py ├── README.md ├── results +│   ├── training_model.py │   ├── enhanced_news.csv │   └── learning_curves.png └── scraper_news.py diff --git a/subjects/ai/nlp-scraper/audit/README.md b/subjects/ai/nlp-scraper/audit/README.md index 6dd031407..e38aa8144 100644 --- a/subjects/ai/nlp-scraper/audit/README.md +++ b/subjects/ai/nlp-scraper/audit/README.md @@ -8,11 +8,9 @@ ##### Scraper -##### There are at least 300 news articles stored in the file system or the database. +##### Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved. -##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually. - -###### Does it run without any error and store the 3 files as expected? +###### Does it run without any error and store the articles as described in the subject? ##### Topic classifier @@ -28,6 +26,20 @@ ###### Does the topic classifier score an accuracy higher than 95% on the given datasets? +##### NLP engine output on 300 articles + +###### Can you run `python nlp_enriched_news.py` without any error? + +###### Does the DataFrame saved in the `csv` file contain 300 different rows? + +###### Are the columns of the DataFrame as defined in the subject `Deliverable` section? + +###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section? + +##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched. + +###### Is the information presented consistent and accurate? + ##### Scandal detection ###### Does the `README.md` explain the choice of embeddings and distance? @@ -35,19 +47,3 @@ ###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal? ###### Is the distance or similarity saved in the DataFrame? - -##### NLP engine output on 300 articles - -###### Does the DataFrame contain 300 different rows? - -###### Are the columns of the DataFrame as defined in the subject `Deliverable` section? - -##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results. - -##### NLP engine on 3 articles - -###### Can you run `python nlp_enriched_news.py` without any error? - -###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section? - -##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.