Course 11 · Lesson 2 of 9

Free preview

Data Splits, Leakage & the Pipeline

Video lesson coming soon
We're filming this one. The full written lesson below is ready to study right now.

Most “amazing” models are silently cheating. Leakage — letting information from the future or the answer sneak into training — produces great offline numbers and disasters in production. This is the single most important engineering skill in ML.

Split before you touch the data

Train
fit the model
Validation
tune + choose
Test
touch ONCE, at the end
python
from sklearn.model_selection import train_test_split

# Time matters? Split by time, never randomly.
# Random split on temporal data = leakage from the future.
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_te, y_val, y_te = train_test_split(X_tmp, y_tmp, test_size=0.5, random_state=42)

Classic leakage sources

  • Scaling/encoding fit on the FULL dataset before splitting (fit on train only).
  • A feature that is a proxy for the label (e.g. "refund_amount" predicting "was_refunded").
  • Future information: using a value only known after the prediction moment.
  • Duplicate rows spanning train and test (the same user in both).
Fit on train, transform everywhere

Every transformation (scaler, encoder, imputer) learns parameters. Learn them on TRAIN only, then apply to val/test. A Pipeline enforces this — use one.

python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),     # fit on train inside .fit()
    ("clf",   LogisticRegression()),
])
pipe.fit(X_tr, y_tr)                  # leakage-safe by construction
Takeaway

Split first (by time if time matters), fit transforms on train only, and hunt for label-proxy features. A Pipeline makes leakage-safety the default.