In many real-world cases, the researcher won’t have made notes or remember exactly what she did, so even she won’t be able to reproduce the model. Even if she can, the frameworks the model code depend on can change over time, sometimes radically, so she’d need to also snapshot the whole system she was using to ensure that things work. I’ve found ML researchers to be incredibly generous with their time when I’ve contacted them for help reproducing model results, but it’s often months-long task even with assistance from the original author.
Why does this all matter? I’ve had several friends contact me about their struggles reproducing published models as baselines for their own papers. If they can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms. At that point your model moves from being a high-interest credit card of technical debt to something more like what a loan-shark offers. It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.