INNOVESH

The written lesson, interview questions and practice are below — sign in to play the video lesson.

Most “amazing” models are silently cheating. Leakage — letting information from the future or the answer sneak into training — produces great offline numbers and disasters in production. This is the single most important engineering skill in ML.

Split before you touch the data

Train

fit the model

Validation

tune + choose

Test

touch ONCE, at the end

python

from sklearn.model_selection import train_test_split

# Time matters? Split by time, never randomly.
# Random split on temporal data = leakage from the future.
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_te, y_val, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=42)

Classic leakage sources

Scaling/encoding fit on the FULL dataset before splitting (fit on train only).
A feature that is a proxy for the label (e.g. "refund_amount" predicting "was_refunded").
Future information: using a value only known after the prediction moment.
Duplicate rows spanning train and test (the same user in both).

Fit on train, transform everywhere

Every transformation (scaler, encoder, imputer) learns parameters. Learn them on TRAIN only, then apply to val/test. A Pipeline enforces this — use one.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),     # fit on train inside .fit()
    ("clf",   LogisticRegression()),
])
pipe.fit(X_tr, y_tr)                  # leakage-safe by construction

Takeaway

Split first (by time if time matters), fit transforms on train only, and hunt for label-proxy features. A Pipeline makes leakage-safety the default.

← Lesson 1: When ML Is the Right Tool (and When It Isn’t)Previous Lesson 3: Features That Actually HelpNext →

Data Splits, Leakage & the Pipeline

Split before you touch the data

Classic leakage sources