Welcome to the Linux Foundation Forum!

03 > Datasets > Continuous Attributes - Lesson Error (data leakage)?

In the lesson we are shown this example with this message:

--code

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_features.values)

"Once it has statistics (computed on the training set only), you can apply it to all your datasets:"

standardized_data = {}
standardized_data['train'] = scaler.transform(train_features)
standardized_data['val'] = scaler.transform(val_features)
standardized_data['test'] = scaler.transform(test_features)

--end

My Understanding
I have highlighted in bold the wording of concern. My understanding is we are to use the fit function on the training data, which will keep calculated values stored in the standard scalar instance, then apply it to the other datasets with the transform function.

The Issue
What we appear to be doing - in the function they have given as an example - is to refit the data every time we pass a new vector.

See code below, check bolded part:

--code

--end

Solution?

I think a solution is just to indent the fit so it is only fit on the first data passed (the training data as desired) because there will be no instance of scalar, then pass an instance of the scalar object when using on validation & test data.

I may be overlooking something but if I can get some feedback or thoughts from others I'd appreciate it.

NOTE: I realized that indents are not copied in here so I am adding a screenshot of the function.

Best Answer

  • dvgodoy
    dvgodoy Posts: 12
    Answer ✓

    Hi @p.hanel ,

    Thank you very much for reporting this.

    You're absolutely right - as it is, we're fitting it every time and leaking data - it shouldn't be like that. Unfortunately, the call to fit() was misplaced, it should only happen inside the if statement, that is, only when we're creating the scaler for the first time (for the training set).

    The correct code should look like this instead:

    Apologies for the confusion. We'll be updating this snippet to fix this.

    Best,
    Daniel

Answers

  • p.hanel
    p.hanel Posts: 8

    My bad if I didn't put this in the correct category, if so please lmk.

  • fcioanca
    fcioanca Posts: 2,397

    Hi @p.hanel

    Thank you for flagging this. It has been fixed.

    Regards,
    Flavia
    Linux Foundation Education

Categories

Upcoming Training