This is actually really similar to something I've been wanting to build for a long time. In my case I've thought it would be useful to have a way to calculate the likelihood for a given change to break things based on the history of breaking changes in the same file or area of the file. Basically a riskiness score for each change. The risk score could be associated with each PR and would provide a signal for reviewers about which code should get a bit of extra attention as well as highlighting the risky changes when they are being deployed.
The tricky part would be tracking the same part of the code as it moves up and down because of insertions/deletions above it which would cause problems for a naive algorithm based on line numbers.
Just doing it at the file level, like this does, might be good enough to be useful though.
We do it on a symbol level after statically analyzing each change, and everything in the monorepo daily. Our remedy to high risk changes is to run more tests, client tests not unit tests. Sometimes there are 100k client tests to pick, so we rank them and run a small subset.
It is a hard problem. One interesting observation is that there is a culprit symbol or two in the culprit change, but its connectivity is very similar to non culprits in the same change.
Another observation is that the transitively modified callgraph after a change is pretty big, a depth of 50 is not unusual. It is hard to get many useful signals out of it beyond amount of overlap in transitively affected symbols between change and test.
We found file level and build target level to be too coarse, but AST symbols are working.
Really interesting!! I wanted to implement this kind of system at Wikimedia but I quit my release engineering job at the beginning of 2022. Still think about this specific problem pretty often though. I never thought to use the score in order to determine how much testing needs to be done! That's actually really genius! If I had thought of that I probably could have pitched it and gotten more people behind the whole risk-scoring idea since overall testing times were getting really long on Wikimedia's codebase and targeted testing could have some real benefits in velocity of changes through the pipeline (with associated knock-on effects on developer productivity and job satisfaction).
We add support by project, and the prototypical project we started with had 1M test reverse dependencies, a quarter of that was eligible test targets that we could recommend (based on language written in). This is probably the biggest project that we would ever find to support in the monorepo.
Some are UI tests, but we don't recommend those, because we found they don't catch breakages as often so we don't support the language they're written in. The tests we recommend are often integration type tests in that they call very higher order functions and often many of them.
A friend working in office of a big-tech company located in Denmark said "one bad engineer like me working in Copenhagen can put food on the table for 20 Bulgarians working in customer support".
Since that day I always wanted to get into FAANG-type companies, writing buggy code is basically philantropy.
Facebook release engineering famously kept riskiness scores for each developer and used that as a signal for whether that developer's changes got deployed or instead received extra scrutiny based on a history of broken deployments.
The tricky part would be tracking the same part of the code as it moves up and down because of insertions/deletions above it which would cause problems for a naive algorithm based on line numbers.
Just doing it at the file level, like this does, might be good enough to be useful though.