Merge branch '01-edu:master' into elementary.js-patch

This commit is contained in:
Natheer Radhi 2024-03-24 11:58:32 +03:00 committed by GitHub
commit 7e604fdcd8
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
20 changed files with 1293 additions and 491 deletions

View File

@ -1,40 +1,35 @@
# Credit scoring
## Credit scoring
The goal of this project is to implement a scoring model based on various source of data (check data documentation) that returns the probability of default. In a nutshell, credit scoring represents an evaluation of how well the bank's customer can pay and is willing to pay off debt. It is also required that you provide an explanation of the score. For example, your model returns that the probability that one client doesn't pay back the loan is very high (90%). The reason behind is that variable_xxx which represents the ability to pay back the past loan is low. The output interpretability will appear in a visualization.
The goal of this project is to implement a scoring model based on various source of data ([check data documentation](./readme_data.md)) that returns the probability of default. In a nutshell, credit scoring represents an evaluation of how well the bank's customer can pay and is willing to pay off debt. It is also required that you provide an explanation of the score. For example, your model returns that the probability that one client doesn't pay back the loan is very high (90%). The reason behind is that variable_xxx which represents the ability to pay back the past loan is low. The output interpretability will appear in a visualization.
The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generaly, more and more companies prefer transparency to black box models.
The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generally, more and more companies prefer transparency to black box models.
### Resources
Historical timeline of machine learning techniques applied to credit scoring
- https://hal.archives-ouvertes.fr/hal-02507499v3/document
- https://www.kaggle.com/c/home-credit-default-risk/data
# Deliverables
- [Machine Learning or Econometrics for Credit Scoring: Lets Get the Best of Both Worlds](https://hal.archives-ouvertes.fr/hal-02507499v3/document)
### Scoring model
The are 3 expected deliverables associated with the scoring model:
There are 3 expected deliverables associated with the scoring model:
- An exploratory data analysis notebook that describes the insights you find out in the data set.
- The trained machine learning model with the features engineering pipeline:
- Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering.**
- The model is validated if the **AUC on the test set is higher than 75%**.
- The labelled test data is not publicly available. However a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
- The labelled test data is not publicly available. However, a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
### Kaggle submission
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this resource that gives detailed explanations.
- https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
- Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore.
- A text document that describes the methodology used to train the machine learning model:
- Algorithm
- Why the accuracy shouldn't be used in that case ?
- Why the accuracy shouldn't be used in that case?
- Limit and possible improvements
### Model interpretability
@ -50,7 +45,7 @@ There are 2 tools you can use to analyse your model and its predictions: - Featu
Implement a program that takes as input the trained model, the customer id ... and returns:
- the score and the SHAP force plot associated with it
- Plotly visualisations that show:
- Plotly visualization that show:
- key variables describing the client and its loan(s)
- comparison between this client and other clients
@ -62,9 +57,7 @@ Choose the 3 clients of your choice, compute the score, run the visualizations o
### Optional
Implement a dashboard (using Dash) that takes as input the customer id and that returns the score and the required visualizations.
- https://stackoverflow.com/questions/54292226/putting-html-output-from-shap-into-the-dash-output-layout-callback
Implement a dashboard (using [Dash](https://dash.plotly.com/)) that takes as input the customer id and that returns the score and the required visualizations.
### Deliverables
@ -103,15 +96,14 @@ project
- `README.md` introduces the project and shows the username.
- `environment.yml` contains all libraries required to run the code.
- `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**.
- `EDA.ipynb` contains the exploratory data analysis. This file is should contain all steps of data analysis that contributed or not to improve the score of the model. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `EDA.ipynb` contains the exploratory data analysis. This file should contain all steps of data analysis that contributed or not to improve the score of the model. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs.
### Useful resources
- https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f
- [Interpreting machine learning models](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f)
### Files needed for this project
[File 1](https://assets.01-edu.org/ai-branch/project5/project05-20221024T130417Z-001.zip)
[File 2](https://assets.01-edu.org/ai-branch/project5/project05-20221024T130417Z-002.zip)
[Files](https://assets.01-edu.org/ai-branch/project5/home-credit-default-risk.zip)

View File

@ -11,8 +11,8 @@ focus on two tasks:
With the computing power exponentially increasing the computer vision field has been developing exponentially. This is a key element because the computer power allows using more easily a type of neural networks very powerful on images:
CNN's (Convolutional Neural Networks). Before the CNNs were democratized, the algorithms used relied a lot on human analysis to extract features which obviously time-consuming and not reliable. If you're interested in the "old
school methodology" [this article](towardsdatascience.com/classifying-facial-emotions-via-machine-learning-5aac111932d3)
explains it. The history behind this field is fascinating! [Here](https://kapernikov.com/basic-introduction-to-computer-vision/) is a short summary of its history.
school methodology" [this article](https://towardsdatascience.com/classifying-facial-emotions-via-machine-learning-5aac111932d3) explains it.
The history behind this field is fascinating! [Here](https://kapernikov.com/basic-introduction-to-computer-vision/) is a short summary of its history.
### Project goal and suggested timeline

View File

@ -1,4 +1,4 @@
# Forest Prediction
## Forest Prediction
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
@ -47,7 +47,7 @@ project
- Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated.
- *Hint: Examples of interesting features*
- _Hint: Examples of interesting features_
- `Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)`
- `Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways`
@ -79,15 +79,14 @@ DATA
- Split train test
- Cross validation: at least 5 folds
- Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. *Remember that for some model scaling the data is important and for others it doesn't matter.*
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. _Remember that for some model scaling the data is important and for others it doesn't matter._
- Train accuracy score < **0.98**. Train set (0). Write the result in the `README.md`
- Test (last day) accuracy > **0.65**. Test set (0). Write the result in the `README.md`
- Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label)
- Plot the learning curve for the best model
- Save the trained model as a [pickle](https://www.datacamp.com/community/tutorials/pickle-python-tutorial) file
- Save the trained model as a [pickle](https://docs.python.org/3/library/pickle.html) file
> Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results
> Advice: As the grid search takes time, I suggest preparing and test the code. Once you are confident it works, run the gridsearch at night and analyse the results
**Hint**: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details [here](https://serokell.io/blog/machine-learning-testing)

View File

@ -56,7 +56,7 @@ SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spa
The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training
data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this
data](bbc_news_train.csv) and [test data](bbc_news_tests.csv). From this
dataset, build a classifier that learns to detect the right topic in the
article. Save the training process to a python file because the audit requires
the auditor to test the model.
@ -68,11 +68,6 @@ that the model is trained correctly and not overfitted.
- Learning constraints: **Score on test: > 95%**
- **Optional**: If you want to train a news' topic classifier based on a more
challenging dataset, you can use the
[following](https://www.kaggle.com/rmisra/news-category-dataset) which is
based on 200k news headlines.
#### **3. Sentiment analysis:**
The goal is to detect the sentiment (positive, negative or neutral) of the news

View File

@ -95,7 +95,7 @@ The goal of this exercise is to learn to deal with punctuation. In Natural Langu
# Exercise 3: Tokenization
The goal of this exercise is to learn to tokenize as text. This step is important because it splits the text into token. A token could be a sentence or a word.
The goal of this exercise is to learn [to tokenize](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) as text. This step is important because it splits the text into token. A token could be a sentence or a word.
```
text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or group of people using the name Satoshi Nakamoto. The currency began use in 2009 when its implementation was released as open-source software."""
@ -106,8 +106,6 @@ text = """Bitcoin is a cryptocurrency invented in 2008 by an unknown person or g
2. Tokenize this text using `word_tokenize` from NLTK.
_Resources: [How to Get Started with NLP 6](https://www.analyticsvidhya.com/blog/2019/07how-get-started-nlp-6-unique-ways-perform-tokenization/)_
---
---
@ -206,23 +204,23 @@ Steps:
> Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.
3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet.
3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet.
4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets
4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets
5. Add to your `count_vecotrized_df` a `label` column considering the following:
- 1: Positive
- 0: Neutral
- -1: Negative
The final DataFrame should be similar to the below:
| | ... | label |
|---:|-------:|--------:|
| 0 | ... | 1 |
| 1 | ... | -1 |
| 2 | ... | -1 |
| 3 | ... | -1 |
| | ... | label |
| --: | --: | ----: |
| 0 | ... | 1 |
| 1 | ... | -1 |
| 2 | ... | -1 |
| 3 | ... | -1 |
_Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_

View File

@ -1,20 +1,7 @@
# NumPy
## NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Your first NumPy array
- Exercise 2: Zeros
- Exercise 3: Slicing
- Exercise 4: Random
- Exercise 5: Split, concatenate, reshape arrays
- Exercise 6: Broadcasting and Slicing
- Exercise 7: NaN
- Exercise 8: Wine
- Exercise 9: Football tournament
### Virtual Environment
- Python 3.x
@ -26,53 +13,52 @@ I suggest to use the most recent one.
### Resources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/
- https://jakevdp.github.io/PythonDataScienceHandbook/
- [Why Should We Use NumPy](https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9)
- [NumPy Documentation](https://numpy.org/doc/)
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
---
---
# Exercise 0: Environment and libraries
## Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercises of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects.
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. Notebook can be used for most of the exercises of the piscine as the goal is to experiment a lot. But no worries, you'll be asked to build a more robust structure for all the projects.
**Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries.
I recommend to use:
I suggest utilizing:
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise.
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recent versions of the libraries required
- The **latest stable version** of Python for your work. However, in this exercise, you'll install and use a specific Python version for educational purposes.
- Choose a virtual environment that aligns with your familiarity. Common choices among Data Science practitioners are `virtualenv` and `conda`.
- Install the most recent versions of the required libraries to ensure compatibility and access to the latest features
1. Create a virtual environment named `ex00`, with Python `3.8`, with the following libraries: `numpy`, `jupyter`. Save the installed packages in `requirements.txt` in the current directory.
1. Begin by creating a virtual environment named `ex00` that utilizes Python version `3.8`. Install the required libraries `numpy` and `jupyter`. Save the installed packages to a file named `requirements.txt`, located in the current directory.
2. Launch a `jupyter notebook` on port `8891` and create a notebook named `Notebook_ex00`. `JupyterLab` can be used instead of Jupyter Notebook here.
2. Launch a `jupyter` notebook or `JupyterLab` on port `8891`. Create a new notebook named `Notebook_ex00`.
3. Put the text `H1 TITLE` as **heading level 1** and `H2 TITLE` as **heading level 2** in the first cell.
3. In the first cell of the notebook, set `H1 TITLE` as a **heading level 1** and `H2 TITLE` as a **heading level 2**.
4. Run `print("Buy the dip ?")` in the second cell
4. Execute `print("Buy the dip ?")` in the second cell to display the message.
### Resources:
- https://www.python.org/
- https://docs.conda.io/
- https://jupyter.org/
- https://numpy.org/
- https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330
- https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2
- https://stackoverflow.com/questions/50777849/from-conda-create-requirements-txt-for-pip3
- [python](https://www.python.org/)
- [Conda Documentation](https://docs.conda.io/)
- [jupyter](https://jupyter.org/)
- [numpy](https://numpy.org/)
- [Jupyter Notebook Shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330)
- [Why You Should be Using Jupyter Notebooks](https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2)
---
---
# Exercise 1: Your first NumPy array
## Exercise 1: Your first NumPy array
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions.
The objective of this exercise is to familiarize yourself with incorporating various Python data types into **NumPy** arrays. **NumPy** arrays play a vital role in both **NumPy** and **Pandas**, offering flexibility and optimized functionalities.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean. Add the following code at the end of your python file or in a cell of the jupyter notebook:
1. Create a NumPy array that contains: an `integer`, a `float`, a `string`, a `dictionary`, a `list`, a `tuple`, a `set` and a `boolean`. Add the following code at the end of your python file or in a cell of the jupyter notebook:
```python
for i in your_np_array:
@ -83,7 +69,7 @@ for i in your_np_array:
---
# Exercise 2: Zeros
## Exercise 2: Zeros
The goal of this exercise is to learn to create a NumPy array with 0s.
@ -94,20 +80,44 @@ The goal of this exercise is to learn to create a NumPy array with 0s.
---
# Exercise 3: Slicing
## Exercise 3: Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. _Hint_: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. _Hint_: it takes one line
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is: `np.array([[1,0,3,4,0,...,0,99,100]])`
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is:
```console
[ 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
97 99]
```
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is:
```console
[100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66
64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30
28 26 24 22 20 18 16 14 12 10 8 6 4 2]
```
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is:
```console
[ 1 0 3 4 0 6 7 0 9 10 0 12 13 0 15 16 0 18
19 0 21 22 0 24 25 0 27 28 0 30 31 0 33 34 0 36
37 0 39 40 0 42 43 0 45 46 0 48 49 0 51 52 0 54
55 0 57 58 0 60 61 0 63 64 0 66 67 0 69 70 0 72
73 0 75 76 0 78 79 0 81 82 0 84 85 0 87 88 0 90
91 0 93 94 0 96 97 0 99 100]
```
---
---
# Exercise 4: Random
## Exercise 4: Random
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
@ -118,7 +128,7 @@ NumPy proposes a lot of options to generate random data. In statistics, assumpti
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://numpy.org/doc/stable/reference/random/generator.html
[Random Generator](https://numpy.org/doc/stable/reference/random/generator.html)
1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution
@ -129,7 +139,7 @@ https://numpy.org/doc/stable/reference/random/generator.html
---
# Exercise 5: Split, concatenate, reshape arrays
## Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays.
@ -142,21 +152,27 @@ The goal of this exercise is to learn to concatenate and reshape arrays.
4. Reshape the previous array into:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
[[ 1 2 3 4 5 6 7 8 9 10]
[ 11 12 13 14 15 16 17 18 19 20]
...
[ 81 82 83 84 85 86 87 88 89 90]
[ 91 92 93 94 95 96 97 98 99 100]]
```
Print what you've created in the previous steps.
---
---
# Exercise 6: Broadcasting and Slicing
## Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`.
2. Using **slicing**, output this array:
**Using a for loop is not allowed in this exercise.**
1. Generate a 2-dimensional array of size 9x9, with all elements initialized to 1 and of type `int8`.
2. Using **slicing**, create the following array:
```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
@ -167,38 +183,61 @@ The goal of this exercise is to learn to access values of n-dimensional arrays e
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=np.int8)
```
3. Using **broadcasting** create the ouptu matrix starting from these two arrays:
3. Using **broadcasting** create an output matrix based on the following two arrays:
```python
array_1 = np.array([1,2,3,4,5], dtype=int8)
array_2 = np.array([1,2,3], dtype=int8)
...
# output matrix
array([[ 1, 2, 3],
[ 2, 4, 6],
[ 3, 6, 9],
[ 4, 8, 12],
[ 5, 10, 15]], dtype=int8)
array_1 = np.array([1,2,3,4,5], dytpe=np.int8)
array_2 = np.array([1,2,3], dytpe=np.int8)
```
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting)
Expected output:
```console
[[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]]
[[1 1 1 1 1 1 1 1 1]
[1 0 0 0 0 0 0 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 0 1 0 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 0 0 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]]
[[ 1 2 3]
[ 2 4 6]
[ 3 6 9]
[ 4 8 12]
[ 5 10 15]]
```
### Resources
[Computation on Arrays: Broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/)
---
---
# Exercise 7: NaN
## Exercise 7: NaN
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays.
The goal of this exercise is to handle missing data in NumPy and manipulate arrays effectively.
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`.
Let's consider a 2-dimensional array containing grades from the last two exams. Some students missed the first exam, so their grades are replaced with `NaN`.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array.
**Using a for loop or if/else statement is not allowed in this exercise.**
To simulate this scenario, we'll create a mock dataset using NumPy. Here's a snippet of code to generate this dataset:
```python
import numpy as np
@ -209,46 +248,83 @@ grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades)
```
---
This code returns:
---
```console
[[ 7. 1.]
[nan 2.]
[nan 8.]
[ 9. 3.]
[ 8. 9.]
[nan 2.]
[ 8. 2.]
[nan 6.]
[ 9. 2.]
[ 8. 5.]]
```
# Exercise 8: Wine
1. Using `np.where`, create a third column that takes the grade of the first exam if available; otherwise, it uses the grade from the second exam. Add this column as the third column of the array.
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy.
**Using a for loop or if/else statement is not allowed in this exercise.**
The data set that will be used for this exercise is the red wine data set.
Expected output:
https://archive.ics.uci.edu/ml/datasets/wine+quality
How to tell if a given 2D array has null columns?
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes\*\*.
2. Print 2nd, 7th and 12th rows as a two dimensional array
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the mean of the pH
6. Compute the average quality of the wines having the 20% least sulphates
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
---
---
# Exercise 9: Football tournament
## Exercise 8: Wine
The goal of this exercise is to learn to use permutations, complex
The goal of this exercise is to perform fundamental data analysis on real data using NumPy.
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in [model_forecasts.txt](data/model_forecasts.txt).
The dataset chosen for this task was the [red wine dataset](./data/winequality-red.csv). You can find more info [HERE](./data/)
Using this output, what are the pairs that will give the most interesting matches ?
1. Load the data using `genfromtxt`, specifying the delimiter as ';', and optimize the numpy array size by reducing the data types. Use `np.float32` and verify that the resulting numpy array weighs **76800 bytes**.
2. Display the 2nd, 7th, and 12th rows as a two-dimensional array. Exclude `np.nan` values if present.
3. Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False.
4. Calculate the average alcohol percentage across all wines in the dataset. Exclude `np.nan` values if present.
5. Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values).
> _Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`._
6. Find the average quality score of wines with the 20% least sulphate content.
**Tip:** The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
7. Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.
**Tip:** This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
---
## Exercise 9: Football tournament
This exercise focuses on utilizing permutations and complex computations.
A Football tournament is underway in your city involving 10 teams. The tournament director seeks an engaging first round and has delegated the pairing decisions to you.
Leveraging your expertise as a former data scientist, you've developed a predictive model based on teams' current season performance. This model forecasts the score difference between any two teams.
The model generates a 2-dimensional array stored in [model_forecasts.txt](data/model_forecasts.txt). Each (i, j) entry in this matrix signifies the predicted score difference between Team i and Team j.
The objective is to determine the pairs that will result in the most interesting matches.
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
@ -256,13 +332,11 @@ The criteria that corresponds to **the pairs that will give the most interesting
The expected output is:
```console
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
[[m1_t1 m2_t1 m3_t1 m4_t1 m5_t1]
[m1_t2 m2_t2 m3_t2 m4_t2 m5_t2]]
```
- m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations**
https://docs.python.org/3.9/library/itertools.html
**Usage of for loop is not allowed, you may need to use the library [itertools](https://docs.python.org/3.9/library/itertools.html) to create permutations.**

View File

@ -1,7 +1,5 @@
#### Exercise 0: Environment and libraries
##### The exercise is validated if all questions of the exercise are validated
##### Install the virtual environment with `requirements.txt`
##### Activate the virtual environment. If you used `conda`, run `conda activate ex00`
@ -33,13 +31,13 @@
#### Exercise 1: Your first NumPy array
##### Add cell and run `type(your_numpy_array)`
##### Add a cell and execute `type(your_numpy_array)`.
###### Is the your_numpy_array an NumPy array? It can be checked with that should be equal to `numpy.ndarray`.
###### Is `your_numpy_array` identified as a NumPy array? It should display as `numpy.ndarray`.
##### Run all the cells of the notebook or `python main.py`
##### Execute all the cells within the notebook or use `python main.py`.
###### Are the types printed are as follows?
###### Can you confirm that the types printed match the following:
```
<class 'int'>
@ -60,11 +58,43 @@
#### Exercise 2: Zeros
##### The exercise is validated if all questions of the exercise are validated
###### For question 1, does the solution use `np.zeros` and is the shape of the array `(300,)`like bellow?
###### For question 1, does the solution use `np.zeros` and is the shape of the array `(300,)`?
```console
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
```
###### For question 2, does the solution use `reshape` and is the shape of the array `(3, 100)`?
###### For question 2, does the solution use `reshape` and is the shape of the array `(3, 100)` like bellow?
```console
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]]
```
---
@ -72,19 +102,44 @@
#### Exercise 3: Slicing
##### The exercise is validated if all questions of the exercise are validated
###### The exercise is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`. Are the previous requirements fulfilled?
###### For question 1, is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`. Were the previous requirements fulfilled?
###### For question 1, does the output look like bellow?
###### For question 2, is the solution `integers[::2]`?
```console
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100]
```
###### For question 3, is the solution `integers[::-2]`?
###### For question 2, does the output look like bellow?
###### For question 4, is the array `np.array([1,0,3,4,0,...,0,99,100])`? There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array:
```console
[ 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
97 99]
```
```python
mask = (integers+1)%3 == 0
integers[mask] = 0
###### For question 3, does the output look like bellow?
```console
[100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66
64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30
28 26 24 22 20 18 16 14 12 10 8 6 4 2]
```
###### For question 4, does the output look like bellow?
```console
[ 1 0 3 4 0 6 7 0 9 10 0 12 13 0 15 16 0 18
19 0 21 22 0 24 25 0 27 28 0 30 31 0 33 34 0 36
37 0 39 40 0 42 43 0 45 46 0 48 49 0 51 52 0 54
55 0 57 58 0 60 61 0 63 64 0 66 67 0 69 70 0 72
73 0 75 76 0 78 79 0 81 82 0 84 85 0 87 88 0 90
91 0 93 94 0 96 97 0 99 100]
```
---
@ -93,18 +148,16 @@ integers[mask] = 0
#### Exercise 4: Random
##### The exercise is validated if all questions of the exercise are validated
> Note: For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
###### For question 1, does the solution contain `np.random.seed(888)`?
###### For question 1, is the solution `np.random.seed(888)`?
###### For question 2, does the solution contain `np.random.randn(100)`?
###### For question 2, is the output of the solution the same as `np.random.randn(100)`? The value of the first element is `0.17620087373662233`.
###### For question 3, is the solution `np.random.randint(1,11,(8,8))`?
###### For question 3, does the solution contain `np.random.randint(1,11,(8,8))`?
```console
Given the NumPy version and the seed, you should have this output:
Given the NumPy version and the seed, this is my output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10],
[ 4, 1, 7, 4, 3, 5, 2, 8],
@ -116,10 +169,10 @@ integers[mask] = 0
[ 4, 4, 9, 2, 8, 5, 9, 5]])
```
###### For question 4, is the solution `np.random.randint(1,18,(4,2,5))`?
###### For question 4, does the solution contain `np.random.randint(1,18,(4,2,5))`?
```console
Given the NumPy version and the seed, you should have this output:
Given the NumPy version and the seed, this is my output:
array([[[14, 16, 8, 15, 14],
[17, 13, 1, 4, 17]],
@ -140,25 +193,34 @@ integers[mask] = 0
#### Exercise 5: Split, concatenate, reshape arrays
##### The exercise is validated if all questions of the exercise are validated
###### For question 1, is the generated array based on an iterator as `range` or `np.arange`? Check that 50 is part of the array.
###### For question 2, is the generated array based on an iterator as `range` or `np.arange`? Check that 100 is part of the array.
###### For question 3, is the array concatenated this way `np.concatenate(array1,array2)`?
###### For question 4, is the result the following?
###### Run the exercise and check if the output is the same as bellow:
```console
array([[ 1, ... , 10],
...
[ 91, ... , 100]])
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50]
[ 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
87 88 89 90 91 92 93 94 95 96 97 98 99 100]
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100]
[[ 1 2 3 4 5 6 7 8 9 10]
[ 11 12 13 14 15 16 17 18 19 20]
[ 21 22 23 24 25 26 27 28 29 30]
[ 31 32 33 34 35 36 37 38 39 40]
[ 41 42 43 44 45 46 47 48 49 50]
[ 51 52 53 54 55 56 57 58 59 60]
[ 61 62 63 64 65 66 67 68 69 70]
[ 71 72 73 74 75 76 77 78 79 80]
[ 81 82 83 84 85 86 87 88 89 90]
[ 91 92 93 94 95 96 97 98 99 100]]
```
The easiest way is to use `array.reshape(10,10)`.
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)
###### Can you confirm that the student didn't just printed the actual result?
---
@ -166,54 +228,44 @@ https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of Num
#### Exercise 6: Broadcasting and Slicing
##### The exercise is validated if all questions of the exercise are validated
###### For question 1, is the output the same as the following?
`np.ones([9,9], dtype=np.int8)`
###### For question 2, is the output the following?
###### Run the exercise and check if the output is the same as bellow:
```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 0, 1, 0, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
[[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]]
[[1 1 1 1 1 1 1 1 1]
[1 0 0 0 0 0 0 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 0 1 0 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 0 0 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]]
[[ 1 2 3]
[ 2 4 6]
[ 3 6 9]
[ 4 8 12]
[ 5 10 15]]
```
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
##### Check the solution for cheating like:
Here is an example of a possible solution:
- The values of the array have been changed one by one manually.
- The usage of the for loop, which is not allowed.
- Printing the full output given in the readme.
```python
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
```
###### For question 3, is the output the following?
```console
array([[ 1, 2, 3],
[ 2, 4, 6],
[ 3, 6, 9],
[ 4, 8, 12],
[ 5, 10, 15]], dtype=int8)
```
##### The solution of question 3 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```python
np.reshape(arr_1, (5, 1)) * arr_2
```
###### Can you confirm that there was no cheating in the solution?
---
@ -221,37 +273,19 @@ Here is an example of a possible solution:
#### Exercise 7: NaN
##### The exercise is validated if all questions of the exercise are validated
###### Without having used a for loop or having filled the array manually, is the output the following?
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
There are two steps in this exercise:
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`:
```python
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways:
```python
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2)
np.hstack((grades, new_vector[:, None]))
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
---
@ -260,60 +294,81 @@ There are two steps in this exercise:
#### Exercise 8: Wine
##### The exercise is validated if all questions of the exercise are validated
###### Was the text file successfully loaded into a NumPy array using `genfromtxt('winequality-red.csv', delimiter=';')` and optimized for memory usage, weighing `76800` bytes or less?
###### Has the text file successfully been loaded in a NumPy array with `genfromtxt('winequality-red.csv', delimiter=';')` and the reduced arrays weights **76800 bytes**?
Use this in the solution to confirm:
###### Is the output the following?
```Python
```python
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ],
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
# Check the optimized data size
optimized_size = optimized_data.nbytes
# Verify if the dataset size criterion is met
if optimized_size <= 76800:
print("Data optimized successfully.")
else:
print("Optimization criteria not met.")
```
This slicing gives the answer `my_data[[1,6,11],:]`.
##### For question 2:
###### Is the answer False? There are many ways to get the answer: find the maximum or check values greater than 20.
"Display the 2nd, 7th, and 12th rows as a two-dimensional array. Exclude `np.nan` values if present."
###### Is the answer 10.422983114446529?
###### Is the output in line with the data present in the provided dataset in the subject?
###### Is the answer the following?
##### For question 3:
"Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False."
###### Is the answer `False`?
##### For question 4:
"Calculate the average alcohol percentage across all wines in the dataset. Exclude `np.nan` values if present."
###### Is the answer `10.422984`?
##### For question 5:
"Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values)."
###### Check if you have the correct results as bellow?
```console
pH stats
25 percentile: 3.21
50 percentile: 3.31
75 percentile: 3.4
mean: 3.3111131957473416
75 percentile: 3.40
mean: 3.31
min: 2.74
max: 4.01
```
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.*
> _Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`._
###### Is the answer ~`5.2`? The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
##### For question 6:
"Find the average quality score of wines with the 20% least sulphate content."
###### Is the answer ~`5.2`?
##### For question 7:
Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.
###### Is the output for the best wines the following?
```python
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444,
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778,
12.09444444, 8. ])
```console
[ 8.566666 0.4233333 0.39111114 2.5777776 0.06844445 13.277778
33.444443 0.99521226 3.2672222 0.76777774 12.094444 8. ]
```
###### Is the output for the bad wines the following?
```python
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. ,
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ])
```console
[ 8.359999 0.8845 0.17099999 2.6350002 0.12249999 11.
24.9 0.997464 3.398 0.57000005 9.955 3. ]
```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
---
---

View File

@ -1,8 +1,8 @@
Citation Request:
This dataset is public available for research. The details are described in [Cortez et al., 2009].
This dataset is public available for research. The details are described in [Cortez et al., 2009].
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
@ -10,43 +10,43 @@ Citation Request:
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
1. Title: Wine Quality
1. Title: Wine Quality
2. Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
3. Past Usage:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
In the above reference, two datasets were created, using red and white wine samples.
The inputs include objective tests (e.g. PH values) and the output is based on sensory data
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
(median of at least 3 evaluations made by wine experts). Each expert graded the wine quality
between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model
these datasets under a regression approach. The support vector machine model achieved the
best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T),
etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity
analysis procedure).
4. Relevant Information:
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.
For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009].
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables
are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
These datasets can be viewed as classification or regression tasks.
The classes are ordered and not balanced (e.g. there are munch more normal wines than
excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent
or poor wines. Also, we are not sure if all input variables are relevant. So
it could be interesting to test feature selection methods.
it could be interesting to test feature selection methods.
5. Number of Instances: red wine - 1599; white wine - 4898.
5. Number of Instances: red wine - 1599; white wine - 4898.
6. Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort of
feature selection.
@ -66,7 +66,7 @@ Citation Request:
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
8. Missing Attribute Values: None

View File

@ -1,32 +1,28 @@
# Financial strategies on the SP500
## Financial strategies on the SP500
In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that overperforms the [SP500](https://en.wikipedia.org/wiki/S%26P_500).
In this project we will apply machine to finance. You are a Quant/Data Scientist and your goal is to create a financial strategy based on a signal outputted by a machine learning model that over-performs the [SP500](https://en.wikipedia.org/wiki/S%26P_500).
The Standard & Poors 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%11%.The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%.
The Standard & Poor's 500 Index is a collection of stocks intended to reflect the overall return characteristics of the stock market as a whole. The stocks that make up the S&P 500 are selected by market capitalization, liquidity, and industry. Companies to be included in the S&P are selected by the S&P 500 Index Committee, which consists of a group of analysts employed by Standard & Poor's.
The S&P 500 Index originally began in 1926 as the "composite index" comprised of only 90 stocks. According to historical records, the average annual return since its inception in 1926 through 2018 is approximately 10%11%. The average annual return since adopting 500 stocks into the index in 1957 through 2018 is roughly 8%.
As a Quant Researcher, you may beat the SP500 one year or few years. The real challenge though is to beat the SP500 consistently over decades. That's what most hedge funds in the world are trying to do.
The project is divided in parts:
- **Data processing and feature engineering**: Build a dataset: insightful features and the target
- **Machine Learning pipeline**: Train machine learning models on the dataset, select the best model and generate the machine learning signal.
- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you would have invested.
### Deliverables
Do not forget to check the ressources of W1D5 and espcially W1D5E4.
- **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you had invested.
### Data processing and features engineering
The first file contains SP500 index data (OHLC: 4 time-series) and the other file contains the OHLCV data on the SP500 contituents.
The file `HistoricalData.csv` contains the open-high-low-close (OHLC) SP500 index data and the other file, `all_stocks_5yr.csv`, contains the open-high-low-close-volume (OHLCV) data on the SP500 constituents.
- Split the data in train and test. The test set should set from **2017** .
- Your first priority is to build a dataset without leakage !!! NO LEAKAGE !!!
- Split the data in train and test. The test set should set from **2017**.
- Your first priority is to build a dataset without leakage.
Note: Financial data can be complex and tricky to analyse for a lot of reasons. In order to focus on Time Series forecasting, the project gives access to a "simplified" financial dataset. For instance, we consider the composition of the SP500 remains similar over time which is not true and which introduces a "survivor bias". Plus, the data during covid-19 was removed because it may have a significant impact on the backtesting.
Note: Financial data can be complex and tricky to analyse for a lot of reasons. In order to focus on Time Series forecasting, the project gives access to a "simplified" financial dataset. For instance, we consider the composition of the SP500 remains similar over time which is not true and which introduces a "survivor bias". Plus, the data during COVID-19 was removed because it may have a significant impact on the backtesting.
**"No leakage" [intro](<https://en.wikipedia.org/wiki/Leakage_(machine_learning)>) and small guide:**
We assume it is day D and we want to take a position on the next h days on the next day. The position starts on day D+1 (included). To decide wether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final dataframe schema is:
**"No leakage" [intro](<https://en.wikipedia.org/wiki/Leakage_(machine_learning)>).**
We assume it is day `D`, and we want to take a position on the next n days. The position starts on day D+1 (included). To decide whether we take a short or long position the return between day D+1 and D+2 is computed and used as a target. Finally, as features on day contain information until day D 11:59pm, target need to be shifted. As a result, the final DataFrame schema is:
| Index | Features | Target |
| ------- | :------------------------: | ---------------: |
@ -53,7 +49,7 @@ We assume it is day D and we want to take a position on the next h days on the n
- Time Series split (plot below)
- Make sure the last fold of the train set does not overlap on the test set.
- Make sure the folds do not contain data from the same day. The data should be split on the dates.
- Plot your cross validation as follow:
- Plot your cross validation as follows:
![alt text][blocking]
@ -63,12 +59,12 @@ We assume it is day D and we want to take a position on the next h days on the n
[timeseries]: Time_series_split.png "Time Series split"
Once you'll have run the gridsearch on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyper-parameters).
Once you'll have run the grid search on the cross validation (choose either Blocking or Time Series split), you'll select the best pipeline on the train set and save it as `selected_model.pkl` and `selected_model.txt` (pipeline hyperparameters).
**Note: You may observe that the selected model is not good after analyzing the ml metrics (ON THE TRAIN SET) and select another one. **
**Note: You may observe that the selected model is not good after analyzing the ML metrics (ON THE TRAIN SET) and select another one. **
- ML metrics and feature importances on the selected pipeline on the train set only.
- DataFrame with a Machine learning metrics on train et validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, LogLoss), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
- ML metrics and feature importance on the selected pipeline on the train set only.
- DataFrame with a Machine learning metrics to train and validation sets on all folds of the train set. Suggested format: columns: ML metrics (AUC, Accuracy, `LogLoss`), rows: folds, train set and validation set (double index). Save it as `ml_metrics_train.csv`
- Plot. Choose the metric you want. Suggested: AUC Save it as `metric_train.png`. The plot below shows how the plot should look like.
- DataFrame with top 10 important features for each fold. Save it as `top_10_feature_importance.csv`
@ -76,29 +72,29 @@ Once you'll have run the gridsearch on the cross validation (choose either Block
[barplot]: metric_plot.png "Metric plot"
- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points !**
- The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal. **The pipeline shouldn't be trained once and predict on all data points!**
**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset i increases between d+1 and d+2.**
**The output is a DataFrame or Series with a double index ordered with the probability the stock price for asset `i` increases between d+1 and d+2.**
- (optional): [Train a RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks: https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0.
- (optional): [Train an RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This is a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks. Here is an [interesting article](https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0) about the topic.
### Strategy backtesting
- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time t on asset i. The module returns the following metrics on the train set and the test set.
- PnL plot: save it as `strategy.png`
- Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time `t` on asset `i`. The module returns the following metrics on the train set and the test set.
- Profit and Loss (PnL) plot: save it as `strategy.png`
- x axis: date
- y axis1: PnL of the strategy at time t
- y axis2: PnL of the SP500 at time t
- y axis1: PnL of the strategy at time `t`
- y axis2: PnL of the SP500 at time `t`
- Use the same scale for y axis1 and y axis2
- add a line that shows the separation between train set and test set
- Pnl
- Max drawdown. https://www.investopedia.com/terms/d/drawdown.asp
- (Optional): add other metrics as sharpe ratio, volatility, etc ...
- [Max drawdown](https://www.investopedia.com/terms/d/drawdown.asp)
- (Optional): add other metrics as Sharpe ratio, volatility, etc ...
- Create a markdown report that explains and save it as `report.md`:
- the features used
- the pipeline used
- imputer
- scaler
- `Imputer`
- `Scaler`
- dimension reduction
- model
- the cross-validation used
@ -113,27 +109,27 @@ Once you'll have run the gridsearch on the cross validation (choose either Block
- Long only:
- Binary signal:
0: do nothing for one day on asset i
1: take a long position on asset i for 1 day
0: do nothing for one day on asset `i`
1: take a long position on asset `i` for 1 day
- Weights proportional to the machine learning signals
- invest x on asset i for on day
- invest x on asset `i` for on day
- Long and short: For those who search long short strategy on Google, don't get wrong, this has nothing to do with pair trading.
- Binary signal:
- -1: take a short position on asset i for 1 day
- 1: take a long position on asset i for 1 day
- -1: take a short position on asset `i` for 1 day
- 1: take a long position on asset `i` for 1 day
- Ternary signal:
- -1: take a short position on asset i for 1 day
- 0: do nothing for one day on asset i
- 1: take a long position on asset i for 1 day
- -1: take a short position on asset `i` for 1 day
- 0: do nothing for one day on asset `i`
- 1: take a long position on asset `i` for 1 day
Notes:
- Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing 1$ per day!
- Warning! When you don't invest on all stock as in the binary signal or the ternary signal, make sure that you are still investing $1 per day!
- In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by 20$ on one day. I earn 20$.
- In order to simplify the **short position** we consider that this is the opposite of a long position. Example: I take a short one AAPL stock and the price decreases by $20 on one day. I earn $20.
- Stock picking: Take a long position on the k best assets (from the machine learning signal) and short the k worst assets regarding the machine learning signal.
- Stock picking: Take a long position on the `k` best assets (from the machine learning signal) and short the `k` worst assets regarding the machine learning signal.
Here's an example on how to convert a machine learning signal into a financial strategy:
@ -164,44 +160,38 @@ Here's an example on how to convert a machine learning signal into a financial s
- Multiply it with the associated return.
Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked !
Don't forget the meaning of the signal on day d: it gives the return between d+1 and d+2. You should multiply the binary signal of day by the return computed between d+1 and d+2. Otherwise, it's wrong because you use your signal that gives you information on d+1 and d+2 on the past or present. The strategy is leaked!
**Assumption**: you have 1$ per day to invest in your strategy.
**Assumption**: you have $1 per day to invest in your strategy.
### Project repository structure:
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
└───results
│ │
| |───cross-validation
│ │ │ ml_metrics_train.csv
│ │ │ metric_train.csv
│ │ │ top_10_feature_importance.csv
│ │ │ metric_train.png
│ │
| |───selected model
│ │ │ selected_model.pkl
│ │ │ selected_model.txt
│ │ │ ml_signal.csv
│ │
| |───strategy
| | | strategy.png
│ │ │ results.csv
│ │ │ report.md
|
|───scripts (free format)
│ │ features_engineering.py
│ │ gridsearch.py
│ │ model_selection.py
│ │ create_signal.py
│ │ strategy.py
├── data
│   └── sp500.csv
├── environment.yml
├── README.md
├── results
│   ├── cross-validation
│   │   ├── metric_train.csv
│   │   ├── metric_train.png
│   │   ├── ml_metrics_train.csv
│   │   └── top_10_feature_importance.csv
│   ├── selected-model
│   │   ├── ml_signal.csv
│   │   ├── selected_model.pkl
│   │   └── selected_model.txt
│   └── strategy
│   ├── report.md
│   ├── results.csv
│   └── strategy.png
└── scripts
├── create_signal.py
├── features_engineering.py
├── gridsearch.py
├── model_selection.py
└── strategy
```
@ -209,4 +199,4 @@ Note: `features_engineering.py` can be used in `gridsearch.py`
### Files for this project
You can find the data required for this project in this [link]:(https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip)
You can find the data required for this project in this [link](https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip)

View File

@ -1,45 +1,8 @@
#### Financial strategies on the SP500
This documents is the correction of the project 4. Some steps are detailed in W1D5E4.
###### Is the structure of the project like the one presented in the `Project repository structure` in the subject?
```
project
│ README.md
│ environment.yml
└───data
│ │ sp500.csv
└───results
│ │
| |───cross-validation
│ │ │ ml_metrics_train.csv
│ │ │ metric_train.csv
│ │ │ top_10_feature_importance.csv
│ │ │ metric_train.png
│ │
| |───selected model
│ │ │ selected_model.pkl
│ │ │ selected_model.txt
│ │ │ ml_signal.csv
│ │
| |───strategy
| | | strategy.png
│ │ │ results.csv
│ │ │ report.md
|
|───scripts (free format)
│ │ features_engineering.py
│ │ gridsearch.py
│ │ model_selection.py
│ │ create_signal.py
│ │ strategy.py
```
###### Is the structure of the project like above?
###### Does the readme file summarize how to run the code and explain the global approach?
###### Does the README file summarize how to run the code and explain the global approach?
###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -47,11 +10,11 @@ project
##### **Data processing and feature engineering**
###### Is the data splitted in a train set and test set?
###### Is the data split in a train set and test set?
###### Is the last day of the train set D and the first day of the test set D+n with n>0? Splitting without considering the time series structure is wrong.
###### Is there no leakage? unfortunately there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow:
###### Is there no leakage? Unfortunately, there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follows:
| Index | Features | Target |
| ------- | :------------------------: | ---------------: |
@ -71,9 +34,9 @@ project
###### Do all train folds have more than 2y history? If you use time series split, checking that the first fold has more than 2y history is enough.
###### Does the last validation set of the train set not overlap on the test set?
###### Can you confirm that the last validation set of the train data is not overlapping with the test data?
###### Do all of the folds not contain data from the same day? The split should be done on the dates.
###### Are all the data folds split by date? A fold should not contain repeated data from the same date and ticker.
###### Is There a plot showing your cross-validation? As usual, all plots should have named axis and a title. If you chose a Time Series Split the plot should look like this:
@ -85,13 +48,13 @@ project
###### Has the test set not been used to train the model and select the model?
###### Is the selected model saved in the pkl file and described in a txt file?
###### Is the selected model saved in a `pkl` file and described in a `txt` file?
##### Selected model
###### Are the ml metrics computed on the train set agregated? sum or median.
###### Are the ML metrics computed on the train set aggregated (sum or median)?
###### Are the ml metrics saved in a csv file?
###### Are the ML metrics saved in a `csv` file?
###### Are the top 10 important features per fold saved in `top_10_feature_importance.csv`?
@ -119,7 +82,7 @@ _Note that, this can be done also on the test set **IF** this hasn't helped to s
###### Is the Pnl computed as: strategy \* futur_return?
###### Does the strategy give the amount invested at time t on asset i?
###### Does the strategy give the amount invested at time `t` on asset `i`?
###### Does the plot `strategy.png` contain an x axis: date?
@ -135,7 +98,7 @@ _Note that, this can be done also on the test set **IF** this hasn't helped to s
###### Does the report detail the features used?
###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model)?
###### Does the report detail the pipeline used (`Imputer`, `Scaler`, dimension reduction and model)?
###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot)?

View File

@ -179,7 +179,7 @@ classifier.fit(X_train_scaled, y_train)
![alt text][logo_ex4]
[logo_ex4]: ./w2_day4_ex4_q3.png 'ROC AUC '
[logo_ex4]: ./w2_day4_ex4_q3.png "ROC AUC "
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html

View File

@ -115,7 +115,7 @@ array([[37, 2],
![alt text][logo_ex4]
[logo_ex4]: ../w2_day4_ex4_q3.png 'ROC AUC '
[logo_ex4]: ../w2_day4_ex4_q3.png "ROC AUC "
Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On real data sets, always check if there's any leakage while having such a high ROC AUC score.
@ -128,70 +128,66 @@ Having a 99% ROC AUC is not usual. The data set we used is easy to classify. On
###### For question 1, are the scores outputted close to the scores below? Some of the algorithms use random steps (random sampling used by the `RandomForest`). I used `random_state = 43` for the Random Forest, the Decision Tree and the Gradient Boosting.
```console
# Linear regression
~~~
Linear Regression
TRAIN
r2 on the train set: 0.34823544284172625
MAE on the train set: 0.533092001261455
MSE on the train set: 0.5273648371379568
r2 score: 0.6054131599242079
MAE: 0.5330920012614552
MSE: 0.5273648371379568
TEST
r2 on the test set: 0.3551785428138914
MAE on the test set: 0.5196420310323713
MSE on the test set: 0.49761195027083804
# SVM
r2 score: 0.6128959462132963
MAE: 0.5196420310323714
MSE: 0.49761195027083804
~~~
SVM
TRAIN
r2 on the train set: 0.6462366150965996
MAE on the train set: 0.38356451633259875
MSE on the train set: 0.33464478671339165
r2 score: 0.749610858293664
MAE: 0.3835645163325988
MSE: 0.3346447867133917
TEST
r2 on the test set: 0.6162644671183826
MAE on the test set: 0.3897680598426786
MSE on the test set: 0.3477101776543003
# Decision Tree
r2 score: 0.7295080649899683
MAE: 0.38976805984267887
MSE: 0.3477101776543005
~~~
Decision Tree
TRAIN
r2 on the train set: 0.9999999999999488
MAE on the train set: 1.3685733933909677e-08
MSE on the train set: 6.842866883530944e-14
r2 score: 1.0
MAE: 4.221907539810565e-17
MSE: 9.24499456646287e-32
TEST
r2 on the test set: 0.6263651902480918
MAE on the test set: 0.4383758696244002
MSE on the test set: 0.4727017198871596
# Random Forest
r2 score: 0.6228217144931267
MAE: 0.4403051356589147
MSE: 0.4848526395290697
~~~
Random Forest
TRAIN
r2 on the train set: 0.9705418471542886
MAE on the train set: 0.11983836612191189
MSE on the train set: 0.034538356420577995
r2 score: 0.9741263135396302
MAE: 0.12000198560508221
MSE: 0.03458015083247723
TEST
r2 on the test set: 0.7504673649554309
MAE on the test set: 0.31889891600404635
MSE on the test set: 0.24096164834441108
# Gradient Boosting
r2 score: 0.8119778189909694
MAE: 0.3194169859011629
MSE: 0.24169750554364758
~~~
Gradient Boosting
TRAIN
r2 on the train set: 0.7395782392433273
MAE on the train set: 0.35656543036682264
MSE on the train set: 0.26167490389525294
r2 score: 0.8042086499063386
MAE: 0.35656543036682264
MSE: 0.26167490389525294
TEST
r2 on the test set: 0.7157456298013534
MAE on the test set: 0.36455447680396397
MSE on the test set: 0.27058170064218096
r2 score: 0.7895081234643192
MAE: 0.36455447680396397
MSE: 0.27058170064218096
```
It is important to notice that the Decision Tree overfits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot because of its overfitting ability.

View File

@ -29,4 +29,4 @@ $
### References
[string concatenation](https://www.w3schools.com/python/ref_string_concatenation.asp)
[string concatenation](https://docs.python.org/3/tutorial/introduction.html#text)

View File

@ -24,10 +24,9 @@ A `README.md` file and all files used to create, delete and manage the student i
###### Does the `README.md` file contain all the required information to run and manage the solution (prerequisites, configuration, setup, usage, etc)?
#### Check the student infrastructure:
##### Check the student infrastructure.
The student must implement this architecture:
![architecture](../pictures/architecture.png)
###### Does the student architecture reflect the infrastructure enforced by the subject?
##### Run the student infrastructure:
@ -44,8 +43,6 @@ api-gateway-app ... done
user:~$
```
###### Does the student architecture reflect the infrastructure enforced by the subject?
###### Does the infrastructure start correctly?
##### Ask the following questions to the group or student

View File

@ -4,32 +4,32 @@ In this exercise, you will learn to create a complex player movement from scratc
### Objectives
For this project you will implement a fully playable character using what we call Animation Blueprint, Aim Offset and PlayerCharacter Blueprint.
For this project you will implement a fully playable character using what we call Animation Blueprint, Aim Offset and Player Character Blueprint.
### Instructions
Starting from an empty project, after creating a level with a floor (nothing else is really required), you should:
- create a Third Player Character Blueprint and apply the Countess mesh to it.
- Create a Third Player Character Blueprint and apply the Countess mesh to it.
- give the character the ability to:
- Give the character the ability to:
- move forward, backward, left and right using the WASD keys.
- look around and change direction using the mouse.
- jump using Space.
- attack using the left mouse click.
- Move forward, backward, left and right using the WASD keys.
- Look around and change direction using the mouse.
- Jump using Space.
- Attack using the left mouse click.
- create an animation blueprint file to animate the character so she can be animated while running, jumping, attacking, etc...
- Create an animation blueprint file to animate the character, so she can be animated while running, jumping, attacking, etc...
- implement the Aim Offset, so that when the player moves the mouse, the head of Countess follow the direction were looking at, in a 180-degree angle.
- Implement the Aim Offset or a Control Rig, so that when the player moves the mouse, the head of Countess follow the direction were looking at, in a 180-degree angle.
- separate the upper and lower body part, so that the character is able to walk and attack at the same time without any animation problems.
- Separate the upper and lower body part, so that the character is able to walk and attack at the same time without any animation problems.
- make Countess lean according to the mouse direction, while running forward.
- Make Countess lean according to the mouse direction, while running forward.
- use Animation Blendspace to organize your movements.
- Use Animation Blend Spaces to organize your movements.
- use Animation Blueprint variables to handle different animation states.
- Use Animation Blueprint variables to handle different animation states.
After downloading and unzipping this [file](https://assets.01-edu.org/Unreal-Engine-Piscine/ArmyOfOne.zip), you can copy its content to your project Content folder.

View File

@ -8,17 +8,17 @@
###### While running, does moving the mouse left and right change the player's leaning angle and direction?
###### Is the animation used for the countess in the animation blueprint stored in a BlendSpace file?
###### Is the animation used for the countess in the animation blueprint stored in a Blend Space file?
###### Does the Countess head follow the mouse orientation?
###### Is an Aim Offset being used to move the Countess head according to the mouse movement?
###### Is an Aim Offset or Control Rig being used to move the Countess head according to the mouse movement?
###### Can the Countess character attack using the blades when clicking on the left mouse button?
###### Are the body transitions smooth when starting an attack or jumping, etc…?
###### Can you attack and move around at the same time without damaging the animations performance?
###### Can you attack and move around at the same time without damaging the animations' performance?
###### Does the Countess body blends between two animations (are blend nodes being used)?
@ -28,6 +28,6 @@
#### Bonus
###### Can the Countess character execute more than 3 attacks?
###### +Can the Countess character execute more than 3 attacks?
###### Are they at least two different Countess skins being used?
###### +Are they at least two different Countess skins being used?

335
subjects/git/README.md Normal file
View File

@ -0,0 +1,335 @@
## Git Ready
### Introduction
This project is designed to introduce you to the world of version control and collaboration using **Git**. Git is a powerful and widely used tool for tracking changes in your projects, collaborating with others, and ensuring the integrity of your code.
Throughout this project, you will embark on a journey of progressively building your Git skills. Starting from the basics, you'll gradually explore more advanced topics, equipping yourself with the essential knowledge and practices for effective version control and collaboration.
Let's Git ready for it!
### Instructions
To begin, create a `work` directory and organize all your tasks within it. Each exercise should be encapsulated in its own file, named after the corresponding task for clarity and ease of reference.
Accompanying your work, provide documentation or a report detailing the process followed for each exercise. This documentation should include any challenges faced, solutions implemented, and lessons learned. It could be in the form of a README file or a separate document. Make sure to show it to the auditor during evaluation.
> ⚠️ Your completion of tasks will be evaluated based on the commit history reflecting the changes made throughout the exercises and the presence of accompanying documentation detailing the process followed.
Here is an example of a file that you can deliver to your auditor to help with the review process:
```md
#### Conflicts, merging and rebasing
# Merge Main into Greet Branch
<Write the command here>
# Switch to main branch and make changes to hello.sh file
<Write the command here>
# Merging Main into Greet Branch (Conflict)
<Write the command here>
# Resolve the conflict (manually or using merge tools)
<Write the command here>
# After resolving, stage the changes and commit
<Write the command here>
# Rebasing Greet Branch
<Write the command here>
# Merging Greet into Main
<Write the command here>
```
#### Setting Up Git
- Install Git on your local machine by following the instructions for your operating system on the official [Git website](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git).
- Configure Git with your username and email address.
#### Git commits to commit
- Within the `work` directory, establish a subdirectory named `hello`. Inside this directory, generate a file titled `hello.sh` and input the following content:
```sh
echo "Hello, World"
```
- Initialize the git repository in the `hello` directory.
- Check the status and act accordingly with the output of the executed command.
- Change the `hello.sh` content to the following:
```sh
#!/bin/bash
echo "Hello, $1"
```
- Stage the changed file and commit the changes, the working tree should be clean.
- Modify the `hello.sh` file to include comments and stage it.
```sh
#!/bin/bash
# Default is "World"
name=${1:-"World"}
echo "Hello, $name"
```
- Make two separate commits:
- The first commit should be for the comment in line 1.
- The second commit should include changes made to lines 3 and 4.
#### History
- Show the history of the working directory.
- Show One-Line History for a condensed view showing only commit hashes and messages.
- **Controlled Entries**:
- You need to customize the log output by specifying the number of entries or a time range. Customize it to display the last `2 entries` and to view the `commits made within the last 5 minutes`.
- **Personalized Format**:
- Show logs in a personalized format, including the commit hash, date, message, branch information, and author name, resembling `* e4e3645 2023-06-10 | Added a comment (HEAD -> main) [John Doe]`
#### Check it out
- **Restore First Snapshot**:
- Revert the working tree to its initial state, as captured in the first snapshot, and then print the content of the `hello.sh` file.
- **Restore Second Recent Snapshot**:
- Revert the working tree to the second most recent snapshot and print the content of the `hello.sh` file.
- **Return to Latest Version**:
- Ensure that the working directory reflects the latest version of the `hello.sh` file present in the main branch, without referring to specific commit hashes.
#### TAG me
- **Referencing Current Version**:
- Tag the current version of the repository as `v1`.
- **Tagging Previous Version**:
- Tag the version immediately prior to the current version as `v1-beta`, without relying on commit hashes to navigate through the history.
- **Navigating Tagged Versions**:
- Move back and forth between the two tagged versions, `v1` and `v1-beta`.
- **Listing Tags**:
- Display a list of all tags present in the repository to verify successful tagging.
#### Changed your mind?
- **Reverting Changes**:
- Modify the latest version of the file with unwanted comments, then revert it back to its original state before staging using a `Git` command.
```sh
#!/bin/bash
# This is a bad comment. We want to revert it.
name=${1:-"World"}
echo "Hello, $name"
```
- **Staging and Cleaning**:
- Introduce unwanted changes to the file, stage them, then clean the staging area to discard the changes.
```sh
#!/bin/bash
# This is an unwanted but staged comment
name=${1:-"World"}
echo "Hello, $name"
```
- **Committing and Reverting**:
- Add the following unwanted changes again, stage the file, commit the changes, then revert them back to their original state.
```sh
#!/bin/bash
# This is an unwanted but committed change
name=${1:-"World"}
echo "Hello, $name"
```
- **Tagging and Removing Commits**:
- Tag the latest commit with `oops`, then remove commits made after the `v1` version. Ensure that the `HEAD` points to `v1`.
- **Displaying Logs with Deleted Commits**:
- Show the logs with the deleted commits displayed, particularly focusing on the commit tagged `oops`.
- **Cleaning Unreferenced Commits**:
- Ensure that unreferenced commits are deleted from the history, meaning there should be no logs for these deleted commits.
- **Author Information**:
- Add an author comment to the file and commit the changes.
```sh
#!/bin/bash
# Default is World
# Author: Jim Weirich
name=${1:-"World"}
echo "Hello, $name"
```
- Oops the author email was forgotten, update the file to include the email without making a new commit, but include the change in the last commit.
#### Move it
- **Moving hello.sh**:
- Using Git commands, move the program `hello.sh` into a `lib/` directory, and then commit the move.
- Create a `Makefile` in the root directory of the repository with the provided content and commit it to the repository.
```sh
TARGET="lib/hello.sh"
run:
bash ${TARGET}
```
#### blobs, trees and commits
- **Exploring `.git/` Directory**:
- Navigate to the `.git/` directory in your project and examine its contents.You will have to explain the purpose of each subdirectory, including `objects/`, `config`, `refs`, and `HEAD` in the audit.
- **Latest Object Hash**:
- Find the latest object hash within the `.git/objects/` directory using Git commands and print the type and content of this object using Git commands.
- **Dumping Directory Tree**:
- Use Git commands to dump the directory tree referenced by this commit.
- Dump the contents of the `lib/` directory and the `hello.sh` file using Git commands.
#### Branching
Its time to do a major rewrite of the hello world functionality. Since this might take a while, youll want to put these changes into a separate branch to isolate them from changes in the main branch.
- **Create and Switch to New Branch**:
- Create a local branch named `greet` and switch to it.
- In the `lib` directory, create a new file named `greeter.sh` and add the provided code to it. Commit these changes.
```sh
#!/bin/bash
Greeter() {
who="$1"
echo "Hello, $who"
}
```
- Update the `lib/hello.sh` file by adding the content below, stage and commit the changes.
```sh
#!/bin/bash
source lib/greeter.sh
name="$1"
if [ -z "$name" ]; then
name="World"
fi
Greeter "$name"
```
- Update the `Makefile` with the following comment and commit the changes.
```sh
# Ensure it runs the updated lib/hello.sh file
TARGET="lib/hello.sh"
run:
bash ${TARGET}
```
- Switch back to the `main` branch, compare and show the differences between the `main` and `greet` branches for `Makefile`, `hello.sh`, and `greeter.sh` files.
- Generate a `README.md` file for the project with the provided content. Commit this file.
```console
This is the Hello World example from the git project.
```
- Draw a commit tree diagram illustrating the diverging changes between all branches to demonstrate the branch history.
#### Conflicts, merging and rebasing
- **Merge Main into Greet Branch**:
- Start by merging the changes from the `main` branch into the `greet` branch.
- Switch to `main` branch and make the changes below to the `hello.sh` file, save and commit the changes.
```sh
#!/bin/bash
echo "What's your name"
read my_name
echo "Hello, $my_name"
```
- **Merging Main into Greet Branch (Conflict)**:
- Attempt to merge the `main` branch into `greet`. Bingooo! There you have it, a `conflict`.
- Resolve the conflict (manually or using graphical merge tools), accept changes from `main` branch, then commit the conflict resolution.
- **Rebasing Greet Branch**:
- Go back to the point before the initial merge between `main` and `greet`.
- Rebase the `greet` branch on top of the latest changes in the `main` branch.
- **Merging Greet into Main**:
- Merge the changes from the `greet` branch into the `main` branch.
- **Understanding Fast-Forwarding and Differences**:
- Explain fast-forwarding and the difference between merging and rebasing.
#### Local and remote repositories
- In the `work/` directory, make a clone of the repository `hello` as `cloned_hello`. (Do not use `copy` command)
- Show the logs for the cloned repository.
- Display the name of the remote repository and provide more information about it.
- List all remote and local branches in the `cloned_hello` repository.
- Make changes to the original repository, update the `README.md` file with the provided content, and commit the changes.
```
This is the Hello World example from the git project.
(changed in the original)
```
- Inside the cloned repository (`cloned_hello`), fetch the changes from the remote repository and display the logs. Ensure that commits from the `hello` repository are included in the logs.
- Merge the changes from the remote `main` branch into the local `main` branch.
- Add a local branch named `greet` tracking the remote `origin/greet` branch.
- Add a `remote` to your Git repository and push the `main` and `greet` branches to the `remote`.
- Be ready for this question in the audit!
**"What is the single git command equivalent to what you did before to bring changes from remote to local `main` branch?"**
#### Bare repositories
- What is a bare repository and why is it needed?
- Create a bare repository named `hello.git` from the existing `hello` repository.
- Add the bare `hello.git` repository as a remote to the original repository `hello`.
- Change the `README.md` file in the original repository with the provided content:
```
This is the Hello World example from the git project.
(Changed in the original and pushed to shared)
```
- Commit the changes and push them to the shared repository.
- Switch to the cloned repository `cloned_hello` and pull down the changes just pushed to the shared repository.
### Submission and Evaluation
Your work must be submitted at the `gitea` link provided. The evaluation will be carried out based on your submission and according to the following criteria:
- Correctness of the git commands you are using.
- Clear understanding of the git commands and concepts.
### Notions
- [Git Branching](https://learngitbranching.js.org/)
- [Git Gud](https://github.com/benthayer/git-gud)

View File

@ -0,0 +1,187 @@
### Git
> ⚠️ The student must provide you with a file containing the solutions for each task. Furthermore, they should showcase their commit history on GitHub, facilitating your review of the evolution of their work and the strategies employed to complete each task. This commit history is crucial to the evaluation process. Please ensure that the submission includes both the solution file and the link to the GitHub repository containing the commit history. In the absence of the link, kindly request the student to provide it.
#### Setup and Installation
###### Did the student successfully install Git on their local machine?
###### Did the student configure Git with a valid username and email address?
#### Git commits to commit
###### Did the student navigate to the `work` directory and create a subdirectory named `hello`?
###### Did the student generate a file named `hello.sh` with the content `puts "Hello, World"` inside the `hello` directory?
###### Did the student initialize a Git repository in the `hello` directory?
###### Did the student use the `git status` command to check the status of the repository?
###### Did the student modify the `hello.sh` file content with the provided `puts "Hello, #{ARGV.first}!"`?
###### Did the student stage the modified `hello.sh` file, commit the changes to the repository, and ensure that the working tree is clean afterward?
###### Did the student further modify the `hello.sh` file to include comments, and then make two separate commits as instructed?
###### Did the student make two separate commits, with the first commit for the comment in line 1 and the second commit for the changes made to lines 3 and 4, as instructed?
#### History
###### Did the student display the Git history of the working directory with the `git log` command?
###### Did the student successfully display a condensed view of the Git history, showing only commit hashes and messages using the "One-Line History" format?
###### Was the student able to customize the log output to display the last 2 entries?
###### Did the student successfully demonstrate viewing commits made within the last 5 minutes?
###### Did the student successfully customize the format of Git logs and display them according to this example `* e4e3645 2023-06-10 | Added a comment (HEAD -> main) [John Doe]`?
#### Check it out
###### Did the student successfully restore the first snapshot of the working tree and print the content of `hello.sh`?
###### Did the student successfully restore the second recent snapshot and print the content of `hello.sh`?
###### Did the student ensure that the working directory reflects the latest version of `hello.sh` from the main branch without using commit hashes?
#### TAG me
###### Did the student successfully tag the current version of the repository as `v1`?
###### Did the student successfully tag the version immediately prior to the current version as `v1-beta`, without relying on commit hashes?
###### Did the student navigate back and forth between the two tagged versions, `v1` and `v1-beta`?
###### Did the student display a list of all tags present in the repository to verify successful tagging?
#### Changed your mind?
###### Did the student successfully revert the modifications made to the latest version of the file, restoring it to its original state before staging using a `Git` command?
###### Did the student introduce unwanted changes to the file, stage them, and then successfully clean the staging area to discard the changes?
###### Did the student add unwanted changes again, stage the file, commit the changes, and then revert them back to their original state?
###### Did the student tag the latest commit with oops and remove commits made after the v1 version, ensuring that the HEAD points to v1?
###### Did the student display the logs with the deleted commits, particularly focusing on the commit tagged `oops`?
###### Did the student ensure that unreferenced commits were deleted from the history, with no logs remaining for these deleted commits?
###### Did the student add author information to the file and commit the changes?
###### Did the student update the file to include the author email without making a new commit, but included the change in the last commit?
#### Move it
###### Did the student successfully move the `hello.sh` program into a `lib/` directory using Git commands?
###### Did the student commit the move of `hello.sh`?
###### Did the student create and commit a `Makefile` in the root directory of the repository with the provided content?
#### blobs, trees and commits
##### Ask the student to navigate to the `.git/` directory and explain to you the purpose of each subdirectory, including `objects/`, `config`, `refs`, and `HEAD`.
###### Was the student able to explain the purpose of each subdirectory, including `objects/`, `config`, `refs`, and `HEAD`?
###### Did the student successfully find the latest object hash within the `.git/objects/` directory using Git commands?
###### Was the student able to print the type and content of this object using Git commands?
###### Did the student use Git commands to dump the directory tree referenced by a specific commit?
###### Were they able to dump the contents of the `lib/` directory and the `hello.sh` file using Git commands?
#### Branching, Merging & Rebasing
###### Did the student successfully create and switch to a new branch named `greet`?
###### Did the student create and commited a new file named `greeter.sh` in the `lib` directory with the provided code in it?
###### Did the student update the `lib/hello.sh` file with the provided content, stage, and commit the changes?
###### Did the student update the `Makefile` with the comment, stage, and commit the changes?
###### Was the student able to compare and show the differences between the `main` and `greet` branches for the `Makefile`, `hello.sh`, and `greeter.sh` files?
###### Did the student generate a `README.md` file with the provided content and commit it?
###### Did the student draw a commit tree diagram illustrating the diverging changes between all branches to demonstrate the branch history?
#### Conflicts, merging and rebasing
###### Did the student successfully merge the changes from the `main` branch into the `greet` branch?
###### Did the student make the specified changes to the `hello.sh` file in the `main` branch and commit them?
###### Did the student attempt to merge the `main` branch into the `greet` branch creating a conflict during the merge?
###### Did the student successfully resolve the conflict, accepting changes from the `main` branch?
###### Did the student commit the conflict resolution changes?
###### Did the student return to the point before the initial merge between `main` and `greet`?
###### Did the student rebase the `greet` branch on top of the latest changes in the `main` branch?
###### Did the student successfully merge the changes from the `greet` branch into the `main` branch?
##### Ask the student to explain the difference between merging and rebasing and if he understand Fast-Forwarding.
###### Did the student demonstrate an understanding of fast-forwarding?
###### was the student able to explain the difference between merging and rebasing?
#### Local & Remote Repositories
###### Did the student complete the cloning process of the `hello` repository to `cloned_hello`?
###### Did the student fetch and merge changes from the remote repository into the `main` branch?
###### Did the student list both remote and local branches, make changes to the original repository, and synchronize the cloned repository with remote changes?
###### Did the student successfully clone the `hello` repository into the `work/` directory as `cloned_hello`, without using the `copy` command?
###### Did the student show the logs for the `cloned_hello` repository?
###### Did the student display the name of the remote repository (`origin`) and provide more information about it?
###### Did the student list all remote and local branches in the `cloned_hello` repository?
###### Did the student make changes to the original repository, update the `README.md` file with the provided content, and commit the changes?
###### Inside the cloned repository (`cloned_hello`), did the student fetch the changes from the remote repository and display the logs, ensuring commits from the `hello` repository are included?
###### Did the student merge the changes from the remote `main` branch into the local `main` branch?
###### Did the student add a local branch named `greet` tracking the remote `origin/greet` branch?
###### Did the student add a `remote` reference to their Git repository?
###### Did the student push the `main` and `greet` branches to the `remote` repository?
##### Ask the following question to the student:
**What is the single git command equivalent to what you did before to bring changes from remote to local `main` branch?**
###### Did the student provide an accurate response?
#### Bare Repositories
##### Ask the following question to the student:
**What is a bare repository and why is it needed?**
###### Did the student correctly explain what a bare repository is and why it is needed?
###### Did the student successfully create a bare repository named `hello.git` from the existing `hello` repository?
###### Did the student add the bare `hello.git` repository as a remote to the original repository `hello`?
###### Did the student change the `README.md` file in the original repository, commit the change, and push it to the shared repository?
###### Did the student switch to the cloned repository `cloned_hello` and successfully pull down the changes just pushed to the shared repository?

View File

@ -0,0 +1,60 @@
## Let's Travel
### Objectives
This phase aims to enhance the Travel Management System by integrating key features that focus on engaging travelers, offering personalized travel recommendations, and ensuring secure transactions. It highlights specific functionalities for Admins, Travel Managers, and Travelers, providing a tailored experience for each role.
### Instructions
Expand the Travel Management System by incorporating essential features and defining clear roles and responsibilities. Ensure each role is granted access to functionalities relevant to their needs, with Admins and Travel Managers having additional privileges for comprehensive system management.
#### 1. Feature Development and Integration by Roles
##### Admin:
- View top-ranking managers and travels, including reports on income for the last months and the number of organized travels.
- Access a detailed travel history list and feedbacks to assess user satisfaction.
- See a list of managers ordered based on their performance score, taking into account travel feedbacks, income, and other relevant metrics.
- Include a section for admins to review reports filed by travelers against travels or managers.
- Have the ability to perform all actions available to Travel Managers and Travelers, ensuring full oversight of the system.
##### Travel Manager:
- Create and manage personal travel offerings, with the capability to view feedback specific to their organized travels.
- Access a dashboard that displays key statistics linked to their travels, such as income, number of trips, and number of travelers.
- Manage subscriber lists for each travel, with options to view profiles or unsubscribe travelers from the travel.
- Have access to all functionalities available to a Traveler, enhancing their understanding of the user experience.
- Utilize detailed analytics and feedback to inform future travel planning and management strategies.
##### Traveler:
- Integrate an Elasticsearch-based travel search with autocomplete for smooth, dynamic querying across all travel details, ensuring swift and accurate results.
- Browse available travels and receive personalized suggestions based on previous feedback and participation (use at least 3 fields of the travel), utilizing Neo4j for customization.
- Subscribe and unsubscribe from travels with a cutoff period of 3 days before the travel start date for flexibility.
- Execute payments for subscriptions using various methods, catering to user convenience and security.
- Provide feedback on participated travels, contributing to the community's overall quality and trustworthiness.
- Access a Travel Manager page to view statistics, past travel ratings, and the number of reports, fostering transparency and accountability.
- Report Travel Managers or other travelers, ensuring a safe and respectful community environment.
- View personal statistics, including past travel participation, report counts, subscription cancellations, and preferred payment methods, for a personalized experience.
#### 2. Responsive and Intuitive UI
Design a user interface that is responsive, intuitive, and accessible across various devices and browsers.
Ensure seamless user experience from search to booking, with efficient navigation and relevant information presentation.
#### 3. Testing and Quality Assurance
- Develop comprehensive unit, integration, and end-to-end tests for all new features.
- Employ continuous integration practices to automate testing and ensure code quality throughout the development process.
#### 4. Security and Compliance
- Implement robust security measures to protect traveler data and transaction details, with a focus on authentication, payment processing, and privacy.
- Adhere to legal standards and industry best practices for data protection and online transactions, ensuring compliance and user trust.
- Ensure secure data transmission with SSL/TLS protocols.
### Bonus Features
- Explore Progressive Web App (PWA) technologies for an improved mobile user experience.
- Introduce multilingual support to accommodate a global user base.
- Develop any innovative feature that significantly boosts user engagement, platform functionality, or overall value.

View File

@ -0,0 +1,161 @@
#### Comprehension
##### Ask the students to elaborate on how Elasticsearch contributes to the system's search and autocomplete features.
###### Are the students able to explain how Elasticsearch contributes to the system's search and autocomplete features?
##### Inquire about the students' understanding of Neo4j's role in delivering personalized travel suggestions.
###### Can the students detail Neo4j's role in personalizing travel suggestions?
##### Question the students on their knowledge of the scalability and operational independence of each service (Elasticsearch, Neo4j).
###### Do the students comprehend how each service is scaled and operates independently within the system?
##### Ask the students about the methods used to ensure data consistency between PostgreSQL, Neo4j, and Elasticsearch.
###### Are students aware of the techniques employed to maintain data consistency across databases?
##### Ask the students to describe the specific functionalities and permissions assigned to Admins, Travel Managers, and Travelers.
###### Do the students understand the distinct functionalities and permissions for each user role?
#### Functional
##### Verify if the Elasticsearch search functionality accurately returns results based on user queries.
###### Does the Elasticsearch functionality effectively and accurately return search results?
##### Check the relevance and speed of autocomplete suggestions provided to the user.
###### Are autocomplete suggestions both relevant and promptly provided?
##### Sign in with a user account. Verify the precision of travel recommendations from the Neo4j database by evaluating user feedback and past participation. Afterward, switch to a different account with varying feedback and participation levels to conduct another evaluation.
###### Does Neo4j deliver precise travel suggestions tailored to user preferences?
##### Confirm the presence of a comprehensive overview on the Admin dashboard.
###### Is the dashboard complete and showing all the information as defined in the subject [Admin section](../README.md#admin) ?
##### Assess the details of the travel management statistics available on the Travel Manager dashboard.
###### Are travel management statistics detailed and helpful for Travel Managers?
##### Ensure that personalized recommendations and travel history are easily accessible on the Traveler dashboard.
###### Can Travelers easily access personalized recommendations and their travel history?
##### Test the ease of navigation and subscription to available travels for Travelers.
###### Can Travelers easily subscribe to and navigate available travel options?
##### Check how subscription cancellations are handled, especially regarding the 3-day cutoff period.
###### Are subscription cancellations processed correctly, adhering to the specified cutoff period?
##### Evaluate the security and user-friendliness of the payment process.
###### Is the payment process secure and accommodating of various payment methods?
##### Test if Travelers can submit feedback on their travel experiences without issues.
###### Can Travelers submit feedback easily?
##### Verify if the feedback is visible to relevant parties such as Travel Managers and Admins for quality assurance.
###### Is the feedback accessible for quality control by Travel Managers and Admins?
##### Assess the ability of Travel Managers to create and manage travel listings effectively.
###### Do Travel Managers have the ability to manage travel listings effectively?
##### Verify if Travel Managers can view and interact with subscriber lists for their travels.
###### Can Travel Managers interact with subscriber lists effectively?
##### Verify if Travel Managers can access analytics about their travel listings and subscriber feedback.
###### Do Travel Managers have access to analytics on their listings?
##### Ensure that Traveler profiles are comprehensive, displaying past participations, feedback given, and reports made.
###### Are Traveler profiles detailed and informative?
##### Check the security and straightforwardness of the login process for all user roles.
###### Is the login process secure and straightforward for all user roles?
##### Confirm that role-based access controls are correctly enforced, preventing unauthorized actions across the system.
###### Are role-based access controls properly enforced?
##### Verify if data is transmitted securely using SSL/TLS encryption.
###### Is data transmission secured with SSL/TLS encryption?
##### Check if sensitive data and credentials are managed securely, adhering to best practices.
###### Are sensitive data and credentials handled securely?
##### Assess if the system handles high traffic volumes effectively without significant performance degradation.
###### Can the system effectively handle high traffic volumes while ensuring actions within the app can be completed in under 5 seconds without performance degradation?
##### Check if there is a fallback mechanism to ensure continuity of core functionalities in case of a service failure.
###### Is there a fallback mechanism for service failures?
##### Verify if the user interface is responsive across different devices and screen sizes.
###### Is the UI responsive on various devices?
##### Ensure the UI facilitates easy navigation and access to features for users of all roles.
###### Does the UI support easy navigation for all user roles?
##### Check if the system adheres to data protection regulations and privacy laws.
###### Does the system comply with data protection
##### Read random parts from the code base.
###### Is the code readable and simple to understand?
###### Is the code well separated? Did the students convince you about that?
##### Verify if the platform is protected against SQL injection and XSS.
###### Can the student prove that the platform is protected against SQL injection?
###### Can the student prove that the platform is protected against XSS?
##### Verify if Passwords are encrypted.
###### Are passwords encrypted?
#### Bonus
##### Review if the system has been implemented as a Progressive Web App (PWA) to enhance the mobile user experience.
###### +Has the system been implemented as a Progressive Web App (PWA) to enhance the mobile user experience?
###### +Do PWA features such as offline functionality, background sync, and push notifications work correctly across various mobile devices?
###### +Is the application's load time optimized for mobile use, following PWA best practices?
##### Verify if multilingual support is integrated into the system to cater to a global user base effectively.
###### +Is multilingual support integrated into the system to cater to a global user base effectively?
###### +Can users seamlessly switch between languages, and is the language preference persistently stored for future sessions?
###### +Does multilingual support cover all aspects of the platform, including user interface elements, notifications, and user-generated content?
##### Review if any innovative features have been introduced that significantly enhance user engagement or the value of the platform.
###### +Has the project introduced any innovative features that significantly enhance user engagement or the value of the platform?
###### +Are these innovative features functioning as intended without causing any errors or issues within the system?
###### +Do these features demonstrate a clear understanding and application of current technologies or methodologies to solve user needs or improve the platform?