mirror of https://github.com/01-edu/public.git
feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution
This commit is contained in:
parent
700efcb57b
commit
d40ec29cf3
|
@ -58,9 +58,10 @@ The goal is to detect what the article is dealing with: Tech, Sport, Business,
|
|||
Entertainment or Politics. To do so, a labelled dataset is provided: [training
|
||||
data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this
|
||||
dataset, build a classifier that learns to detect the right topic in the
|
||||
article. The trained model should be stored as `topic_classifier.pkl`. Make
|
||||
sure the model can be used easily (with the preprocessing pipeline built for
|
||||
instance) because the audit requires the auditor to test the model.
|
||||
article. Save the training process to a python file because the audit requires
|
||||
the auditor to test the model.
|
||||
To proceed with the following instructions, save the model as
|
||||
`topic_classifier.pkl`.
|
||||
|
||||
Save the plot of learning curves (`learning_curves.png`) in `results` to prove
|
||||
that the model is trained correctly and not overfitted.
|
||||
|
@ -139,10 +140,11 @@ The expected structure of the project is:
|
|||
project
|
||||
.
|
||||
├── data
|
||||
│ └── date_scrape_data.csv
|
||||
│ └── ...
|
||||
├── nlp_enriched_news.py
|
||||
├── README.md
|
||||
├── results
|
||||
│ ├── training_model.py
|
||||
│ ├── enhanced_news.csv
|
||||
│ └── learning_curves.png
|
||||
└── scraper_news.py
|
||||
|
|
|
@ -8,11 +8,9 @@
|
|||
|
||||
##### Scraper
|
||||
|
||||
##### There are at least 300 news articles stored in the file system or the database.
|
||||
##### Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved.
|
||||
|
||||
##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
|
||||
|
||||
###### Does it run without any error and store the 3 files as expected?
|
||||
###### Does it run without any error and store the articles as described in the subject?
|
||||
|
||||
##### Topic classifier
|
||||
|
||||
|
@ -28,6 +26,20 @@
|
|||
|
||||
###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
|
||||
|
||||
##### NLP engine output on 300 articles
|
||||
|
||||
###### Can you run `python nlp_enriched_news.py` without any error?
|
||||
|
||||
###### Does the DataFrame saved in the `csv` file contain 300 different rows?
|
||||
|
||||
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
|
||||
|
||||
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
|
||||
|
||||
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
|
||||
|
||||
###### Is the information presented consistent and accurate?
|
||||
|
||||
##### Scandal detection
|
||||
|
||||
###### Does the `README.md` explain the choice of embeddings and distance?
|
||||
|
@ -35,19 +47,3 @@
|
|||
###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?
|
||||
|
||||
###### Is the distance or similarity saved in the DataFrame?
|
||||
|
||||
##### NLP engine output on 300 articles
|
||||
|
||||
###### Does the DataFrame contain 300 different rows?
|
||||
|
||||
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
|
||||
|
||||
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
|
||||
|
||||
##### NLP engine on 3 articles
|
||||
|
||||
###### Can you run `python nlp_enriched_news.py` without any error?
|
||||
|
||||
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
|
||||
|
||||
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
|
||||
|
|
Loading…
Reference in New Issue