Archive of articles classified as' "testing"

Back home

Behavioral testing is the bee’s knees


I have been doing Test Driven Development (TDD) with xUnit-based frameworks (like unittest and NUnit) for a number of years now, and started with RSpec in Ruby and Jasmine in JavaScript a few months ago. RSpec and Jasmine are behavioral testing frameworks that facilitate Behavioral Driven Development (BDD). BDD is really no different from “normal” TDD except for the frameworks used. BDD frameworks facilitate a different way of designing tests.

Behavioral testing is awesome and makes xUnit-style testing seem barbaric and uncivilized in comparison. I didn’t see the big deal for a long time. My xUnit style tests served me just fine. But now I’m hooked.

If you are a TDD person (or do any unit testing), I encourage you to try out BDD. You should get the hang of it quickly. It can take a little while to learn BDD idioms, but once you start doing things like custom matchers, it’s absolutely amazing.

In future posts I’ll try to discuss more of the differences between TDD and BDD, but for now I just provide a hearty endorsement of BDD and frameworks like RSpec and Jasmine.


Two weeks is the worst sprint length


Mike Cohn over at Mountain Goat Software says this in My Primary Criticism of Scrum:

In today’s version of Scrum, many teams have become overly obsessed with being able to say they finished everything they thought they would. This leads those teams to start with the safe approach. Many teams never try any wild ideas that could lead to truly innovative solutions.

I believe a great deal of this is the result of the move to shorter sprints, with two weeks being the norm these days. Shorter sprints leave less time to recover if a promising but risky initial approach fails to pan out.

Most teams I have joined have been working in two week sprints, and it has always seemed like the wrong length. In fact, at CCP, I championed “anything but two weeks.” I saw teams unable to get features done and tested in two weeks, which caused teams to not take the “shippable increment” idea very seriously. Nearly every sprint was a failed sprint in one way or another.

My preferred solution was to move to one week sprints. I’ve written about one week sprints before. If Agile is all about feedback, then halving your sprint length gives you double the feedback and chances to improve. You can try something out for a week, find it doesn’t work, and move on without putting a release at risk. A sprint twice as long carries twice the risk. I feel this, more than longer sprints, much better addresses Mike’s goal to get back to creativity. That said, it requires an intense commitment from the team to adopt Agile and eXtreme Programming development practices, such as through automated testing, shared code ownership, and more. It also requires a team that can function and perform well, and someone to coach. Without these things, one week sprints will fail.

My backup solution, when I feel a team isn’t yet ready to go through the transformation one-week sprints bring about, is to move to longer sprints. This does not encourage the behavior I ultimately want to see, but at least it allows more experimentation. More importantly, teams can take the “shippable increment” ideal seriously.

This reasoning can be taken to absurdity, but avoid it. I once had someone seriously suggest two releases a year was too often, and we should go down to one release a year, so we could make sure it was high quality and bug free. These people just need a reminder of the history of software (and often their own company). Even stuff like printer firmware can be iterated on less than 4 weeks.

In the end, I still prefer one week sprints, which isn’t surprising as I’m an XP advocate. But if a team isn’t able or willing to move to one week sprints, don’t default to two weeks.

1 Comment

A short letter to a unit testing newcomer


One of my friends asked how to get started with unit testing and Test Driven Development and figured I could write a short post. I also mention TDD a few times in my book so I think it could use some more attention.

I got started with unit testing when I wrote a smallish Python project, and realized I couldn’t maintain large Python projects without tests*. So I wrote unit tests for my code. I then quickly got into TDD. I don’t remember what resources I used, but there are plenty.

I’d encourage you to start unit testing by writing your code first, then the tests. As soon as this becomes easy and is paying off, start writing the tests first (TDD). You will come to appreciate that this is a powerful way of thinking and working. Even if you don’t ultimately stick with this approach, it is worth becoming skilled with it. You will uncover many suboptimal areas in your coding process that are otherwise extremely difficult to find and fix.

Keep in mind that learning TDD isn’t like already knowing Mercurial, then reading a book about Git**, and then being skilled with Git because you are skilled with Mercurial. You are embarking on a long journey, and will need to refer to many tutorials, blogs, and books. You will do some TDD, and look back on that code and process in a year and be horrified, just like you were when you started programming. Do not think of unit testing and TDD like learning a new framework or library. Think of it like learning how to program all over again.

So I don’t know exactly where to get started, only that you must, and keep going once you do start.

* I’d soon realize that no project was really maintainable without tests.

** Pro Git is my favorite Git book, by the way.

1 Comment

Mike Bland’s profound analysis of the Apple SSL bug


Mike Bland has done a series of articles and presentations about the Apple SSL bug over on his blog. To be frank, it’s pretty much the only good coverage of the bug I’ve seen. He’s submitted an article to the ACM; crossing my fingers it gets printed, because it’s an excellent analysis. I’ve been frustrated it hasn’t gotten more traction.

Mike’s take is that this bug could have easily been caught with simple unit testing. It’s sad this has been so overlooked. We gobble up the explanations that do not force us to confront our bad habits. We can point and laugh at the goto fail (“hahaha, never use goto!”), or just shrug at the merge fail (“we’ve all mistakes merging”). It’s much more difficult to wrestle with the fact that this error- like most- is because we are lazy about duplicating code and writing tests.

I have no problem categorically saying anyone who doesn’t blame the SSL bug on lack of automated testing and engineering rigor is making excuses. Not for Apple but for his or herself.

No Comments

Using code metrics effectively


As the lead of a team and then a director for a project, code metrics were very important to me. I talked about using Sonar for code metrics in my previous post. Using metrics gets to the very core of what I believe about software development:

  • Great teams create great software.
  • Most people want to, and can be, great.
  • Systematic problems conspire to cheat people into doing less than their best work.

It’s this last point that is key. Code metrics are a conversation starter. Metrics are a great way to start the conversation that says, “Hey, I notice there may be a problem here, what’s up?” In this post, I’ll go through a few cases where I’ve used metrics effectively in concrete ways. This is personal; each case is different and your conversations will vary.

Helping to recognize bad code

A number of times, I’ve worked with one of those programmers who can do amazing things but write code that is unintelligible to mortal minds. This is usually a difficult situation, because they’ve been told what an amazing job they’ve been doing, but people who have to work with that code know otherwise. Metrics have helped me have conversations with these programmers about how to tell good code apart from bad.

While we may not agree what good code is, we all know bad code when we see it, don’t we? Well, it turns out we don’t. Or maybe good code becomes bad code when we aren’t looking. I often use cyclomatic complexity (CC) as a way to tell good code from bad. There is almost never good code with a high CC. I help educate programmers about what CC is and how it causes problems, giving ample references for further learning. I find that because metrics have a basis in numbers and science, they can counteract the bad behaviors some programmers have that are continually reinforced because they get their work done. These programmers cannot argue against CC, and without exception have no desire to. They’re happy to have learned how they can keep themselves honest and write better code.

It’s important to help these programmers change their style. I demonstrate basic strategies for reducing CC. Usually this just means helping them split up monolithic functions or methods. Eventually I segue into more advanced techniques. I’ve seen lightbulbs go off, and people go from writing monolithic procedures to well-designed functions and classes, just because of a conversation based in code metrics and followup mentoring.

I use CC to keep an eye on progress. If the programmer keeps writing code with high CC, I have to work harder. Maybe we exclusively pair until they can stand on their own feet again. Bad code is a cancer, so I pay attention to the CC alarm.

Writing too much code

A curious thing happens in untested codebases: code grows fast. I think this happens because the code cannot be safely reused, so people copy and paste with abandon (also, the broken windows theory is alive and well). I’ve used lines of code (LoC) growth to see where it seems too much code is being written. Maybe a new feature should grow a thousand lines a week (based on your gut feeling), but if it grows 3000 lines for the last few weeks, I must investigate. Maybe I learn about some deficiency in the codebase that caused a bunch of code to be written, maybe I find a team that overlooked an already available solution, maybe I find someone who copy and pasted a bunch of stuff because they didn’t know better.

Likewise, bug fixing and improvements are good, so I expect some growth in core libraries. But why are a hundred lines a week consistently added to some core library? Is someone starting to customize it for a single use case? Is code going into the right place, do people know what the right place is, and how do they find out?

LoC change is my second favorite metric after CC, especially in a mature codebase. It tells me a lot about what sort of development is going on. While I usually can’t pinpoint problems from LoC like I can with CC, it does help start a conversation about the larger codebase: what trends are going on, and why.

Tests aren’t being written

A good metrics collection and display will give you a very clear overview on what projects or modules have tests and which do not. Test count and coverage numbers and changes can tell you loads about not just the quality of your code, but how your programmers are feeling.

If coverage is steadily decreasing, there is some global negative pressure you aren’t seeing. Find out what it is and fix it.

  • Has the team put themselves into a corner at the end of the release, and are now cutting out quality?
  • Is the team being required to constantly redo work, instead of releasing and getting feedback on what’s been done? Are they frustrated and disillusioned and don’t want to bother writing tests for code that is going to be rewritten?
  • Are people writing new code without tests? Find out why, whether it’s due to a lack of rigor or a lack of training. Work with them to fix either problem.
  • Is someone adding tests to untested modules? Give them a pat on the back (after you check their tests are decent).

Driving across-the-board change

I’ll close with a more direct anecdote.

Last year, we ‘deprecated’ our original codebase and moved new development into less coupled Python packages. I used all of the above techniques along with a number of (private) metrics to drive this effort, and most of them went up into some visible information radiators:

  • Job #1 was to reduce the LoC in the old codebase. We had dead code to clean up, so watching that LoC graph drop each day or week was a pleasure. Then it became a matter of ensuring the graph stayed mostly flat.
  • Job #2 was to work primarily in the new codebase. I used LoC to ensure the new code grew steadily; not too fast (would indicate poor reuse), and not too slow relative to the old codebase (would indicate the old codebase is being used for too much new code).
  • Job #3 was to make sure new code was tested. I used test count and coverage, both absolute numbers and of course growth.
  • Job #4 was to make sure new code was good. I used violations (primarily cyclomatic complexity) to know when bad code was submitted.
  • Job #5 was to fix the lowest-hanging debt, whether in the new or old codebase. Sometimes this was breaking up functions that were too long, more often it was merely breaking up gigantic (10k+ lines) files into smaller files. I was able to look at the worst violations to see what to fix, and work with the programmers on fixing them.

Aside from the deleting of dead code, I did only a small portion of the coding work directly. The real work was done by the project’s programmers. Code metrics allowed me to focus my time where it was needed in pairing, training, and mentoring. Metrics allowed the other programmers to see their own progress and the overall progress of the deprecation. Having metrics behind us seemed to give everyone a new view on things; people were not defensive about their code at all, and there was nowhere to hide. It gave the entire effort an air of believably and achievability, and made it seem much less arbitrary that it could have been.

I’ve used metrics a lot, but this was certainly the largest and most visible application. I highly suggest investing in learning about code metrics, and getting something like Sonar up on your own projects.

1 Comment

Using Sonar for static analysis of Python code


I’ve been doing static analysis for a while, first with C# and then with Python. I’ve even made an aborted attempt at a Python static code quality analyzer (pynocle, I won’t link to it because it’s dead). About a year ago we set up Sonar ( to analyze the Python code on EVE Online. I’m here to report it works really well and we’re quite happy with it. I’ll talk a bit about our setup in this post, and a future post will talk more about code metrics and how to use them.

Basic Info and Setup

Sonar consists of three parts:

  • The Sonar web interface, which is the primary way you interact with the metrics.
  • The database, which stores the metrics (Sonar includes a demonstration DB, production can run on any of the usual SQL DBs).
  • The Sonar Runner, which analyzes your code and sends data to the database. The runner also pulls configuration from the DB, so you can configure it locally and through the DB.

It was really simple to set up, even on Windows. The web interface has some annoyances which I’ll go over later, and sometimes the system has some unintuitive behavior, but everything works pretty well. There are also a bunch of plugins available, such as for new widgets for the interface or other code metrics checks. It has integrations with many other languages. We are using Sonar for both C++ and Python code right now. Not every Sonar metric is supported for Python or C++ (I think only Java has full support), but enough are supported to be very useful. There are also some worthless metrics in Python that are meaningful in Java, such as lines in a file.

The Sonar Runner

I’ll cover the Runner and then the Website. Every night, we have a job that runs the Runner over our codebase as a whole, and each sub-project. Sonar works in terms of “projects” so each code sub-project and the codebase as a whole have individual Sonar projects (there are some misc projects in there people manage themselves). This project setup gives higher-level people the higher-level trends, and gives teams information that is more actionable.

One important lesson we learned was, only configure a project on the runner side, or the web site. An example are exclusions: Sonar will only respect exclusions from the Runner, or the Web, so make sure you know where things are configured.

We also set up Sonar to collect our Cobertura XML coverage and xUnit XML test result files. Our Jenkins jobs spit these out, and the Runner needs to parse them. This caused a few problems. First, due to the way files and our projects were set up, we needed to do some annoying copying around so the Runner could find the XML files. Second, sometimes the files use relative or incomplete filenames, so parsing of the files could fail because the Python code they pointed to was not found. Third, the parsing errors were only visible if you ran the Runner with DEBUG and VERBOSE, so it took a while to track this problem down. It was a couple days of work to get coverage and test results hooked into Sonar, IIRC. Though it was among the most useful two metrics and essential to integrate, even if we already had them available elsewhere.

The Sonar Website

The Website is slick but sometimes limited. The limitations can make you want to abandon Sonar entirely :) Such as the ability to only few metrics for three time periods; you cannot choose a custom period (in fact you can see the enum value of the time period in the URL!). Or that the page templates cannot be configured differently for different projects (ie, the Homepage for the ‘Entire Codebase’ project must look the same as the Homepage for the ‘Tiny Utility Package’ project). Or that sometimes things just don’t make sense.

In the end, Sonar does have a good deal of configuration and features available (such as alerts for when a metric changes too much between runs). And it gets better each release.

The Sonar API

Sonar also has an API that exposes a good deal of metrics (though in traditional Sonar fashion, does not expose some things, like project names). We hook up our information radiators to display graphs for important trends, such as LoC and violations growth. This is a huge win; when we set a goal of deleting code or having no new violations, everyone can easily monitor progress.


If you are thinking about getting code metrics set up, I wholeheartedly recommend Sonar. It took a few weeks to get it to build up an expertise with it and configure everything how we wanted, and since then it’s been very little maintenance. The main struggle was learning how to use Sonar to have the impact I wanted. When I’ve written code analysis tools, they have been tailored for a purpose, such as methods/functions with the highest cyclomatic complexity. Sonar metrics end up giving you some cruft, and you need to separate the wheat from the chaff. Once you do, there’s no beating its power and expansive feature set.

My next post will go into more details about the positive effects Sonar and the use of code metrics had on our codebase.


Teaching TDD: The importance of expectations


I read this interesting article from Justin Searls about the failures of teaching TDD, and his proposed improvements. Uncle Bob wrote up an excellent response on how Justin’s grievances are valid but his solutions misguided.

Justin’s article included an excellent image which he calls the “WTF now, guys?” learning curve. I think it sums up the problems with teaching TDD perfectly, and its existence is beyond dispute. I’ll call the gap in learning curve the WTF gap.

WTF now, guys?  learning curve

Uncle Bob, in his post linked above, touches on a very important topic:

Learning to refactor is a hill that everyone has to climb for themselves. We, instructors, can do little more than make sure the walking sticks are in your backpack. We can’t make you take them out and use them.

Certainly, giving a student the tools to do TDD is essential; but it is equally essential to stress the inadequacies of any tool an instructor can provide. A walking stick is not enough. A good instructor must lay down a set of expectations about TDD for students, managers, and organizations so they can deal with the frustration that arises once the WTF gap is hit.

First expectation: TDD describes a set of related and complicated skills. TDD involves learning the skills of writing code, test design, emergent design, refactoring, mocking, new tools, testing libraries, and more. You cannot teach all of these in a week, or even on their own. Instructors must introduce them via TDD, slowly. Until the student has proficiency with all of these topics, they cannot get over the WTF gap. This fundamentally changes how TDD must be practiced; students need a mentor to pair with them on work assignments when new TDD skills are introduced. If an organization or team is starting TDD, they need to have a mentor, or budget considerable time for learning.

Second expectation: Your early TDD projects will suck. Just like you look back in horror on your projects from when you first learned programming, or learned a new language, your first projects you use TDD on will have lots of problems. Tests that are slow, brittle, too high-level, too granular, code that has bugs and is difficult to change. This is normal. The expectation must be set that the first couple libraries or projects you use TDD for are going to be bad; the goal is to learn from it. The WTF gap is real; people must expect it and persevere past it.

Third expectation: If your organization isn’t supportive, you will fail. If you are one lonely person using TDD on a team of people who are not, you will fail. If you are one lonely team trying to use TDD in part of a larger codebase where others are not, you will fail. If your organization sets ridiculous deadlines and does not allow you to learn the TDD skillset over time, being slower initially, you will fail. If you want to do TDD and are not enabled, go somewhere you will be, budget time for changing the culture, or live with endless frustration and broken dreams. You need help to bridge the WTF gap.

Fourth expectation: TDD for legacy code is considerably more difficult. Learning environments are pristine; our codebases are not. There is a different set of strategies for working with legacy code (see Michael Feathers’ book) that require a pretty advanced TDD skillset. Beginners should stay away; use TDD for new things and do not worry about refactoring legacy code at first. At some point, pulling apart and writing tests for tangled systems will get easier, and may even become a hobby. The WTF gap is big enough; don’t make it more difficult than it need be by involving legacy code.

My feeling on teaching TDD is that no matter how you teach it, whether with some of Justin’s flawed ideas or Bob’s proven ones, you need to set proper expectations for students, managers, and teams.

No Comments

TDD via Tic-Tac-Toe


For the last few years, I’ve done a fair bit of teaching of TDD, usually 1-on-1 during pair programming sessions but also through workshops. I’ve tried out lots of different subject matter for teaching TDD, but my favorite has been Tic-Tac-Toe (or whatever your regional variation of it is). It has these benefits:

  1. It’s a game. People like programming a game. Learning through games is good. It also reinforces the idea of ‘testability.’ A number of times I had students combine the ‘playing’ of the game with the ‘logic’ of the game. Teaching them to split out the algorithm from the input/driver was a useful exercise.
  2. It’s not trivial. It is a problem with some real meat that can be solved in a number of different ways. It’s significant enough that a real design and architecture emerges, unlike TDD for a single function. A a bonus, the diverse answers make it fun to teach because I keep being led to new solutions by students.
  3. The rules are just right. Simple and clear. It has a clear ‘happy path’ and ‘unhappy path’.
  4. It’s bounded. You can know when it’s good enough, but there are also endless exceptions to check if you want. Which is a great way to teach people to know when something is complete enough.
  5. It has no dependencies. I prefer not to introduce people to TDD and mocking at the same time.

In the workshops I ran, we did a ‘mob programming’ of a Tic-Tac-Toe game, and then students paired up to develop an AI. While the AI was fun, it was developing the game that was probably a better exercise. And like I mentioned already, I’ve done lots of introductory TDD pairing sessions using it, recently with someone interviewing here which is when I think the Tic-Tac-Toe game really proved itself a successful subject matter. I highly suggest you give Tic-Tac-Toe a try if you’re looking for a way to demonstrate/teach TDD to someone.

If you’re interested, the code and slides I created for the workshops is here: I may make a blog post in the future with some more detail about how the workshops went, if there’s any interest.


The perils of bunch/bundle?


I had some discussion on Friday about static vs. dynamic typing (grass is always green, eh?), and our frequent use of the “dictionary with attribute-like notation” object came up (commonly called a “bunch” or “bundle”). A number of people were saying how bad/unsafe it is. Not convinced, I brought the source code up on the projector and asked people: “If I were to set the ‘__’ key, what would happen?” My thinking was that, if this is a data container like a dictionary, what happens when the data conflicts with non-data keys?

A room full of programmers, and no one could answer (actually I still don’t know, nor do I remember if it raised a KeyError and AttributeError in the appropriate circumstances, but I’d guess not). I pointed out the reason this type of object is a problem is because, even though we have several thousand uses of it and it’s been around for a decade, how it works isn’t clear or documented. Static typing doesn’t change that. Bad code is bad code and there’s heaps of bad code in wonderful statically typed languages.

It so happens we have several more implementations of the same object. The only one that has documented edge case behavior is in probably the highest quality area of the codebase. Incidentally, it is only used during testing. So maybe bunches aren’t that useful after all…

1 Comment

Maya and PyCharm


Recently at work I had the opportunity of spending a day to get Maya and PyCharm working nicely. It took some tweaking but we’re at a point now where we’re totally happy with our Maya/Python integration.

Remote Debugger: Just follow the instructions here: We ended up making a Maya menu item that will add the PyCharm helpers to the syspath and start the pydev debugger client (basically, the only in-Maya instructions in that link).

Autocomplete: To get autocomplete working, we just copied the stubs from “C:\Program Files\Autodesk\Maya2013\devkit\other\pymel\extras\completion\py” and put them into all of the subfolders under: “C:\Users\rgalanakis\.PyCharm20\system\python_stubs” (obviously, change your paths to what you need). We don’t use the mayapy.exe because we do mostly pure-python development (more on that below), and we can’t add the stubs folder to the syspath (see below), so this works well enough, even if it’s a little hacky.

Remote Commands: We didn’t set up the ability to control Maya from PyCharm, we prefer not to develop that way. Setting that up (commandPort stuff, PMIP PyCharm plugin, etc.)  was not worth it for us.

Copy/Paste: To get copy/paste from PyCharm into Maya working, we followed the advice here: Note, we made a minor change to the code in the last post, because we found screenshots weren’t working at all when Maya was running. We early-out the function if “oldMimeData.hasImage()”.

And that was it. We’re very, very happy with the result, and now we’re running PyCharm only (some of us were using WingIDE or Eclipse or other stuff to remote-debug Maya or work with autocomplete).


A note about the way we work- we try to make as much code as possible work standalone (pure python), which means very little of our codebase is in Maya. Furthermore we do TDD, and we extend that to Maya, by having a base TestCase class that sends commands to Maya under the hood (so the tests run in Maya and report their results back to the pure-python host process). So we can do TDD in PyCharm while writing Maya code. It is confusing to describe but trivial to use (inherit from MayaTestCase, write code that executes in Maya, otherwise treat it exactly like regular python). One day I’ll write up how it works in more depth.