03 > Datasets > Continuous Attributes - Lesson Error (data leakage)?
In the lesson we are shown this example with this message:
--code
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_features.values)
"Once it has statistics (computed on the training set only), you can apply it to all your datasets:"
standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)
--end
My Understanding
I have highlighted in bold the wording of concern. My understanding is we are to use the fit function on the training data, which will keep calculated values stored in the standard scalar instance, then apply it to the other datasets with the transform function.
The Issue
What we appear to be doing - in the function they have given as an example - is to refit the data every time we pass a new vector.
See code below, check bolded part:
--code

--end
Solution?
I think a solution is just to indent the fit so it is only fit on the first data passed (the training data as desired) because there will be no instance of scalar, then pass an instance of the scalar object when using on validation & test data.
I may be overlooking something but if I can get some feedback or thoughts from others I'd appreciate it.
NOTE: I realized that indents are not copied in here so I am adding a screenshot of the function.
Best Answer
-
Hi @p.hanel ,
Thank you very much for reporting this.
You're absolutely right - as it is, we're fitting it every time and leaking data - it shouldn't be like that. Unfortunately, the call to
fit()was misplaced, it should only happen inside the if statement, that is, only when we're creating the scaler for the first time (for the training set).The correct code should look like this instead:

Apologies for the confusion. We'll be updating this snippet to fix this.
Best,
Daniel1
Answers
-
My bad if I didn't put this in the correct category, if so please lmk.
0
Categories
- All Categories
- 176 LFX Mentorship
- 176 LFX Mentorship: Linux Kernel
- 750 Linux Foundation IT Professional Programs
- 373 Cloud Engineer IT Professional Program
- 169 Advanced Cloud Engineer IT Professional Program
- 74 DevOps IT Professional Program - Discontinued
- 4 DevOps & GitOps IT Professional Program
- 99 Cloud Native Developer IT Professional Program
- 7.6K Training Courses & Learning Paths
- 1 AI & ML Training
- 1 Blockchain & Decentralized Identity Training
- 3 Cloud & Containers Training
- 1 Cybersecurity Training
- 1 DevOps & Site-Reliability Training
- 1 Linux Kernel Development Training
- 1 Networking Training
- 1 Open Source Best Practice Training
- 1 System Administration Training
- 1 System Engineering Training
- 1 Web & Application Development Training
- 792 Hardware
- 202 Drivers
- 68 I/O Devices
- 37 Monitors
- 95 Multimedia
- 173 Networking
- 91 Printers & Scanners
- 87 Storage
- 768 Linux Distributions
- 81 Debian
- 67 Fedora
- 22 Linux Mint
- 13 Mageia
- 24 openSUSE
- 150 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 356 Ubuntu
- 465 Linux System Administration
- 31 Cloud Computing
- 73 Command Line/Scripting
- Github systems admin projects
- 98 Linux Security
- 78 Network Management
- 101 System Management
- 46 Web Management
- 106 Mobile Computing
- 18 Android
- 73 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 392 Off Topic
- 121 Introductions
- 181 Small Talk
- 29 Study Material
- 949 Programming and Development
- 310 Kernel Development
- 621 Software Development
- 982 Software
- 374 Applications
- 182 Command Line
- 5 Compiling/Installing
- 68 Games
- 317 Installation
- Archived
- 2 LFD140 Class Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)

