294 papers. Leaked.

Most ML pipelines leak. Most teams don't know.

Published papers retracted or undermined by data leakage. Across 17 scientific fields. Not typos. Structural errors that existing tools made easy to commit.

Kapoor & Narayanan, Leakage and the reproducibility crisis in machine-learning-based science, Patterns, 2023.

Models trained on leaked data report inflated metrics. Teams ship products based on numbers that don't replicate. Decisions get made on evidence that was never real. The tools could have warned you. They didn't. That was the problem.

Paper

A Grammar of Machine Learning Workflows

Roth, S. (2026). EPAGOGY.

Seven functions decompose supervised learning into a typed workflow that rejects data leakage at call time. Three pre-registered predictions, two confirmed, one falsified — the falsification is published too.

Read the paper

Foundations

Biased Machines in the Realm of Politics

GSDS, University of Konstanz, 2022

Doctoral dissertation. ML methodology applied to political science, where the leakage problem was first observed in practice.

Roth, S. (2022). KOPS Archive

What's next

Decision science

Once you have an honest prediction, how do you make an honest decision? When does a model output justify action?

Most deployed ML systems stop at prediction. Arguably the hard part is the step after: thresholds, costs, fairness constraints, human override. I'm working on it.

Open Questions

What I don't know yet

Honest research starts with what's unknown. These are the questions driving my current work.

Gaming the honest workflow

The API enforces the right order. But what stops you from running evaluate() fifty times, picking the best number, and then calling assess()? The structure prevents accidental leakage. Can it prevent intentional gaming?

Prediction vs. decision

A model says "80% probability." Should you act? The answer depends on costs, stakes, and alternatives. None of which live in the model. Can the workflow enforce that distinction?

Replication at scale

The experiments cover 2,047 experimental instances across 4 algorithms. Does the effect hold for deep learning? For time series? For domains where leakage looks different?

How I work

EPAGOGY METHOD

επαγωγή — Aristotle's term for reasoning from particular observations to general principles. Prior Analytics, II.23.

Pre-registered predictions

Hypotheses registered before experiments run. Falsification is a valid outcome. Two confirmed, one falsified — I report both.

Open methodology

Code, data, and analysis scripts are public. Replication is not a request — it's the default.

Implementation-first

Every idea ships as working code. Theory without implementation is speculation. Code without theory is a library.

Questions, critique, collaboration.

Get in touch GitHub Twitter