Fix(Pipeline): Fix irradiat attribute values

This commit is contained in:
oumaimafisaoui 2024-08-27 09:47:58 +01:00 committed by Oumaima Fisaoui
parent 00813d29e9
commit 9c9adb1c88
1 changed files with 6 additions and 10 deletions

View File

@ -1,25 +1,19 @@
# Pipeline
## Learning goals:
Today we will focus on the data preprocessing and discover the Pipeline object from `scikit learn`.
Today we will focus on the data preprocessing and discover the Pipeline object from scikit learn.
1. Manage categorical variables with Integer encoding and One Hot Encoding
2. Impute the missing values
3. Reduce the dimension of the data
4. Scale the data
## Context:
- The **step 1** is always necessary. Models use numbers, for instance string data can't be processed raw.
- The **steps 2** is always necessary. Machine learning models use numbers, missing values do not have mathematical representations, that is why the missing values have to be imputed.
- The **step 3** is required when the dimension of the data set is high. The dimension reduction algorithms reduce the dimensionality of the data either by selecting the variables that contain most of the information (SelectKBest) or by transforming the data. Depending on the signal in the data and the data set size the dimension reduction is not always required. This step is not covered because of its complexity. The understanding of the theory behind is important. However, I suggest to give it a try during the projects.
- The **step 4** is required when using some type of Machine Learning algorithms. The Machine Learning algorithms that require the feature scaling are mostly KNN (K-Nearest Neighbors), Neural Networks, Linear Regression, and Logistic Regression. The reason why some algorithms work better with feature scaling is that the minimization of the loss function may be more difficult if each feature's range is completely different.
> These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model. Scikitlearn proposes an object: Pipeline.
These steps are sequential. The output of step 1 is used as input for step 2 and so on; and, the output of step 4 is used as input for the Machine Learning model.
Scikitlearn proposes an object: Pipeline.
As we know, the model evaluation methodology requires splitting the data set in a train set and test set. **The preprocessing is learned/fitted on the training set and applied on the test set**.
@ -247,8 +241,10 @@ breast: One Hot
breast-quad: One Hot
['right_low' 'left_low' 'left_up' 'central' 'right_up']
irradiat: One Hot
['yes' 'no']
Class: Target (One Hot)
['recurrence-events' 'no-recurrence-events']
```