294 papers. Leaked.

Most ML pipelines leak. Most teams don't know.

Published papers retracted or undermined by data leakage. Across 17 scientific fields. Not typos. Structural errors that existing tools made easy to commit.

Kapoor & Narayanan, Leakage and the reproducibility crisis in machine-learning-based science, Patterns, 2023.

Models trained on leaked data report inflated metrics. Teams ship products based on numbers that don't replicate. Decisions get made on evidence that was never real. The tools could have warned you. They didn't. That was the problem.

Biased Machines in the Realm of Politics

GSDS, University of Konstanz, 2022

Doctoral dissertation. ML methodology applied to political science, where the leakage problem was first observed in practice.

Roth, S. (2022). KOPS Archive

Decision science

Once you have an honest prediction, how do you make an honest decision? When does a model output justify action?

Most deployed ML systems stop at prediction. Arguably the hard part is the step after: thresholds, costs, fairness constraints, human override. I'm working on it.

What I don't know yet

Honest research starts with what's unknown. These are the questions driving my current work.

Gaming the honest workflow
The API enforces the right order. But what stops you from running evaluate() fifty times, picking the best number, and then calling assess()? The structure prevents accidental leakage. Can it prevent intentional gaming?
Prediction vs. decision
A model says "80% probability." Should you act? The answer depends on costs, stakes, and alternatives. None of which live in the model. Can the workflow enforce that distinction?
Replication at scale
The experiments cover 2,047 experimental instances across 4 algorithms. Does the effect hold for deep learning? For time series? For domains where leakage looks different?

EPAGOGY METHOD

επαγωγή — Aristotle's term for reasoning from particular observations to general principles. Prior Analytics, II.23.

Pre-registered predictions
Hypotheses registered before experiments run. Falsification is a valid outcome. Two confirmed, one falsified — I report both.
Open methodology
Code, data, and analysis scripts are public. Replication is not a request — it's the default.
Implementation-first
Every idea ships as working code. Theory without implementation is speculation. Code without theory is a library.

Questions, critique, collaboration.