03 > Datasets > Continuous Attributes - Lesson Error (data leakage)?

p.hanel · November 2025

In the lesson we are shown this example with this message:

--code

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_features.values)

"Once it has statistics (computed on the training set only), you can apply it to all your datasets:"

standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)

--end

My Understanding
I have highlighted in bold the wording of concern. My understanding is we are to use the fit function on the training data, which will keep calculated values stored in the standard scalar instance, then apply it to the other datasets with the transform function.

The Issue
What we appear to be doing - in the function they have given as an example - is to refit the data every time we pass a new vector.

See code below, check bolded part:

--code

--end

Solution?

I think a solution is just to indent the fit so it is only fit on the first data passed (the training data as desired) because there will be no instance of scalar, then pass an instance of the scalar object when using on validation & test data.

I may be overlooking something but if I can get some feedback or thoughts from others I'd appreciate it.

NOTE: I realized that indents are not copied in here so I am adding a screenshot of the function.

dvgodoy · November 2025

Hi @p.hanel ,

Thank you very much for reporting this.

You're absolutely right - as it is, we're fitting it every time and leaking data - it shouldn't be like that. Unfortunately, the call to fit() was misplaced, it should only happen inside the if statement, that is, only when we're creating the scaler for the first time (for the training set).

The correct code should look like this instead:

Apologies for the confusion. We'll be updating this snippet to fix this.

Best,
Daniel

p.hanel · November 2025

My bad if I didn't put this in the correct category, if so please lmk.

Flavia · November 2025

Hi @p.hanel

Thank you for flagging this. It has been fixed.

Regards,
Flavia
Linux Foundation Education

03 > Datasets > Continuous Attributes - Lesson Error (data leakage)?

Best Answer

Answers

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)