Video lesson coming soon
We're filming this one. The full written lesson below is ready to study right now.
Most “amazing” models are silently cheating. Leakage — letting information from the future or the answer sneak into training — produces great offline numbers and disasters in production. This is the single most important engineering skill in ML.
Split before you touch the data
Train
fit the model
Validation
tune + choose
Test
touch ONCE, at the end
python
from sklearn.model_selection import train_test_split
# Time matters? Split by time, never randomly.
# Random split on temporal data = leakage from the future.
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_te, y_val, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=42)Classic leakage sources
- Scaling/encoding fit on the FULL dataset before splitting (fit on train only).
- A feature that is a proxy for the label (e.g. "refund_amount" predicting "was_refunded").
- Future information: using a value only known after the prediction moment.
- Duplicate rows spanning train and test (the same user in both).
✕
Fit on train, transform everywhere
Every transformation (scaler, encoder, imputer) learns parameters. Learn them on TRAIN only, then apply to val/test. A Pipeline enforces this — use one.
python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scale", StandardScaler()), # fit on train inside .fit()
("clf", LogisticRegression()),
])
pipe.fit(X_tr, y_tr) # leakage-safe by constructionTakeaway
Split first (by time if time matters), fit transforms on train only, and hunt for label-proxy features. A Pipeline makes leakage-safety the default.