Fix(Pipeline): fix datafile data info and example do not match

This commit is contained in:
oumaimafisaoui 2024-08-26 18:51:49 +01:00 committed by Oumaima Fisaoui
parent f26da6368e
commit fe5f82edcf
2 changed files with 36 additions and 30 deletions

View File

@ -1,19 +1,25 @@
# Pipeline
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
## Learning goals:
Today we will focus on the data preprocessing and discover the Pipeline object from `scikit learn`.
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
## Context:
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
> These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model. Scikitlearn proposes an object: Pipeline.
As we know, the model evaluation methodology requires splitting the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
@ -259,16 +265,16 @@ input: ohe.transform(X_test[ohe_cols])[:10]
output:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1.],
[0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1.],
[1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])
[1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.]])
input: ohe.get_feature_names(ohe_cols)
input: ohe.get_feature_names_out(ohe_cols)
output:
array(['node-caps_no', 'node-caps_yes', 'breast_left', 'breast_right',
'breast-quad_central', 'breast-quad_left_low',
@ -351,7 +357,7 @@ Preliminary:
X[[40,135], 3] = np.nan
```
- Split the data set in a train set and test set (33%), fit the Pipeline on the train set and predict on the test set. Use `random_state=43`.
- Split the data set in a train set and test set (33%), fit the Pipeline on the train set and predict on the test set. Use ``random_state=43``.
The pipeline you will implement has to contain 3 steps:

View File

@ -4,7 +4,7 @@
##### Activate the virtual environment. If you used `conda` run `conda activate your_env`.
##### Run `python --version`.
##### Run ``python --version``.
###### Does it print `Python 3.x`? x >= 8
@ -146,14 +146,14 @@ dtype: int64
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1.],
[0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 1.],
[1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 0., 1., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])
[1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.]])
```
@ -162,16 +162,16 @@ array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0.],
```console
#First 10 rows:
array([[1., 2., 5., 0., 1.],
[1., 3., 4., 0., 1.],
[1., 2., 4., 0., 1.],
[1., 3., 2., 0., 1.],
[1., 4., 3., 0., 1.],
[1., 4., 5., 0., 0.],
[2., 5., 4., 0., 1.],
[2., 5., 8., 0., 1.],
[0., 2., 3., 0., 2.],
[1., 3., 6., 4., 2.]])
array([[2., 5., 2., 0., 1.],
[2., 5., 2., 0., 0.],
[2., 5., 4., 5., 2.],
[1., 4., 5., 1., 1.],
[2., 5., 5., 0., 2.],
[1., 2., 1., 0., 1.],
[1., 2., 8., 0., 1.],
[2., 5., 2., 0., 0.],
[2., 5., 5., 0., 2.],
[1., 2., 3., 0., 0.]])
```
@ -180,8 +180,8 @@ array([[1., 2., 5., 0., 1.],
```console
# First 2 rows:
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 2., 5., 0., 1.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 1., 3., 4., 0., 1.]])
array([[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 2., 5., 2., 0., 1.],
[1., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 2., 5., 2., 0., 0.]])
```
---