Blog of Rob Galanakis (@robgalanakis)

TDD for legacy code, graphics code, and legacy graphics code?

We’re currently undergoing a push to do more ‘Agile testing’ at work. At CCP, we “do” Agile pretty well, but I don’t think we “code” Agile really well. A large part (the largest part?) of Agile coding and Agile testing is TDD/Unit Testing, of which I’m a huge fan if not an experienced practitioner.

I know how to do TDD for what I do- that is, side-by-side replacement of an internal legacy codebase in a high-level language. What I don’t have experience in is TDD for expansion and maintenance of a huge, high performance, very active, legacy codebase, and specifically the graphics components and the C++ bits.

So if you have experience in these sorts of things, I’d love to hear them.

At this point I’m sticking my neck out as a TDD and unit testing advocate studio-wide, but am reluctant to evangelize too strongly outside of my areas of expertise. I don’t think it’d be fair to the very talented people in those areas, and I also don’t want to be wrong, even if I know I’m right :) So I’d really like to hear about your experiences with TDD and unit testing in the games and graphics space, and on legacy codebases, because people are coming to me with questions and I don’t have good answers. I’d love to give them places to turn, articles to read, people to contact.

Thanks for any help.

10 thoughts on “TDD for legacy code, graphics code, and legacy graphics code?

  1. I can heartily recommend the book “Working Effectively with Legacy Code” by Michael Feathers. The whole book is basically dedicated to getting legacy code under test. It’s a good read, even if it focuses on compiled languages (Java and C++ specifically).

  2. Adam Skutt says:

    You can’t do what you want, unless you have carte blanche to put the entire codebase under automated test and the specifications to do so. The odds of you having both are virtually nil. So first, you need to figure out what you want to accomplish and what you think you’ll gain from doing it. Without knowing how they already test their software and their development process, it’s hard to provide meaningful input.

  3. Adam, that is a chicken-and-egg scenario and something I’d like to skirt around. You’re right, it isn’t realistic to get the current codebase under test. What is realistic is to, first, understand conceptually how to TDD graphics, gameplay, and legacy code. Then, to educate the programmers about it (another area that would benefit from some anecdotes), get feedback, and start to meld the conceptual onto the practical. So the question is actually easier than usual: whiteboard answers or pure anecdotes here are better than in-depth specific ones for my situation because I don’t even know enough about my situation to judge such a specific/nuanced answer.

    Sebastian, thanks for the book reference.

  4. Adam Skutt says:

    That blog post covers most of the ideas I would have. This isn’t my area of expertise so I can’t offer anything more clever, unfortunately. However, I will say that I think the most important thing to consider is the cost involved in setting up such schemes and whether it’s worth it or not.

    For example, coding up a tool to render some static scenes and compare them to reference images probably isn’t all that difficult. However, figuring out where to set the pass/fail thresholds is pretty difficult and may not be possible. If set of reference images ends up changing all of the time, then automated the processing is largely pointless. If setting useful thresholds is too difficult, then a human will have to review the results all the time anyway.

    As a practical example, I write numerical analysis and DSP software professionally. Most of our code doesn’t have substantial amounts of automated tests around the core algorithms for the reasons I mentioned above. Doing the analysis to figure out acceptable thresholds for change due to floating-point error is very difficult and not worth our time. If we break something that badly, it’ll be plainly apparent when we do the manual testing.

    Even if those tests existed, we’d expect them to frequently fail when we changed the algorithms on purpose, since our goal is to improve performance in a statistically significant fashion. However, not all improvements would cause the tests to fail since there’s no such thing a singular representative dataset. Some changes will not improve all datasets.

    Since we’re constantly improving the algorithms (it’s a large part of what we’re paid to do), we’d end up changing the reference results and increasing their size all of the time. Some of our datasets are very large and are unsuitable for long-term usage as a reference dataset, especially if they only demonstrate a few cases.

    In the end, automating the analysis doesn’t buy me anything, as I’m going to have to do it by hand anyway in order to be confident about my code changes. What does provide me value is tools to make the analysis easy: load the dataset, perform the calculations, and show me the results so I can quickly draw my own conclusions and show them to others.

    So make sure whatever you do has a clear cost/benefit. A tool to generate images and compare them to reference images probably has lots of value. Extending to the tool to tell you whether the two images are “too different” may or may not have any value, depending on the circumstance.

  5. Yup that’s a good point and we have the same situation in games for performance- can’t really test that at the unit level, but we can track it over time and warn if we get a bad spike. People are expected to profile their individual performance-impacting changes by hand as covering it via testing would be tricky if not impossible. We just track overall and system performance at a high level.

    Likewise it’s important that the tests don’t become a burden, like you describe for algorithm tests that would need to change frequently, or regarding the size of the datasets. These are two things our old unit tests failed at miserably and one reason there’s some hesitance towards resurrecting a focus on testing here (brittle tests and tests that ran on app startup, slowing everyone down).

  6. +1 for the “Working Effectively with Legacy Code” book. “Debug It!” might be another useful book (don’t let the title mislead you, it’s a lot about testing)

    One important thing when working with legacy code is to resist the temptation to make the whole thing testable – just change/make testable the part that you need to touch for the new feature/bugfix. Otherwise, you’ll think Sisyphos had it easy :)

    For a practical example, take a look at the “Double Dawg Dare” video/blog series at http://anarchycreek.com/2009/06/02/the-double-dawg-dare/ – even though it’s in Java, the concepts are applicable everywhere.

    If you are comfortable with polyglot programming, I http://approvaltests.sourceforge.net/ might be something to look at – there are videos in the bottom. Or you could just use the approach of saving rendered images and diffing the actual vs. the expected in python (there might even be a library to do github style image diffing).

    HTH,

    Peter

  7. Erik Lundh says:

    Back in 2006 I had some teams working on full scale simulator systems. Some of the front-end people claimed that they could not do tdd and other agile practices because visual tests was not possible to automate. It was my good fortune that I knew Thomas Akenine-Moller, one of the authors of the modern classic “Real-Time Rendering”. He sent me a paper about a tool created at Dreamworks that was able to do a “perceptual image diff” on two bitmaps that did not have to match bitwise, a lot of math and filtering transformed the comparsion info to *what is relevant to the eye*. And then dreamworks relesead the code on sourceforge (pdiff.sf.net)
    However, when we offered the pdiff tool to the team that said “agile is impossible beacuse…”, it turned out that they just used it as an excuse to not change their hero-programmer behavior. That was the real problem, back then, in that particular team.
    Dreamworks developed pdiff to monitor rendering piplines to trigger alarms when rendering models went bad. They had generated milestone key frames that they could compare to actual rendering output and trigger alerts to whoever was on call.

    /Erik Lundh
    Still recovering from XP2012, the original annual agile conference, held in Europe since 2000

  8. Phillip says:

    Very informative post and very beneficial for all readers. We can work more effectively with large, untested legacy code bases. The average software project, in our industry, was written under some aspect of code and-fix, and without automated unit tests and we can’t just throw this code away; it represents a significant effort debugging and maintaining. It contains many latent requirements decisions. Agile processes and adoption are incremental.

Leave a Reply