One of my favourite parts of research, probably because I don’t get to do it often, is trying to come up with a graph or similar visualisation to display some results. You often have a lot of choice in how you could visualise some data; the art is in trying to find the best way to convey the interesting aspect of the results so that it’s obvious to the reader’s gaze. I thought it might be interesting to demonstrate this through an example. In one strand of our work, we look at frequency of different programming mistakes in our Blackbox data set. We’ve recently added a location API for the data so as a descriptive statistic, I wanted to see if the frequency differed by geography. To that end, I produced a table of mistake (A-R) versus continent:
The data is already normalised by total number of mistakes for each continent, so each column will sum to 1. One option would be simply to use the table to portray the data. This is often the best option for single-column data, or for small tables (e.g. 4×4). But it’s too difficult to get any sense of the pattern from that table. So we need a graph of some sort. The particular challenge with this data is that although it’s only a two-dimensional table, both dimensions are categorical, with no obvious ordering available.
Since each column sums to one, we could use a stacked bar chart:
I don’t like this option, though, for several reasons. 18 different stacked bars is too visually noisy, and we still need to add labels, too. Stacked bar charts also prevent comparison: is there much of a difference in the proportion of C (the light blue bar, third from bottom) between areas? It’s too hard to tell when stacked.
If stacked doesn’t work, it’s tempting to try a clustered bar chart, with the bars for each area alongside each other instead. That looks like this:
This time, you can see the difference between the proportion of C between areas. But… it’s still too noisy. It’s too difficult to take in a glance, and annoying fiddly to look at each cluster in turn and check the proportions. You actually get a reasonable sense of difference between mistakes (A-R) rather than between areas. You could thus try to cluster in the other dimension, giving you six clusters of eighteen rather than eighteen clusters of six:
A first obvious problem is that the colours repeat. It sounds like a simple fix to just pick more colours, but picking eighteen distinct colours would be very difficult. And even then, comparing one set of eighteen bars to another to look for differences is just too fiddly.
So maybe we should forget the bars. An alternative is to use line graphs. Now, technically, you shouldn’t use line graphs for this data because a line typically indicates continuous data rather than categorical. But I think this should be considered a strong guideline rather than absolute rule; sometimes with difficult data, it’s about finding the most reasonable (or “least worst”) approach when nothing is ideal. So here’s a line graph with mistakes on the X and proportions on the Y:
Not really very clear; any differences at a given point are obscured by the lines rapidly merging again before and after. Also, I don’t like the mistakes on the X axis like this, as it does strongly suggest some sort of relation between adjacent items. Maybe it would help to make it radial:
Ahem. Let’s just forget I considered it making it radial, alright? Looking back at the previous line graph, I said one issue was that the lines jump around too much, merging after a difference. We could try to alleviate this by sorting the mistakes by overall frequency (note the changed X axis):
Once the labels get added, it’s probably the one I like most so far. You at least get a sense, for the most frequent mistakes, of some difference in the pattern between the continents (one line per continent).
At this point, we’ve exhausted the bar and line graph possibilities, with either mistake or continent on the X axis, a different bar/line for the other variable, and proportion always on the Y axis. A final possibility I’m going to consider is a scatter graph. In this case, we can use continent as the X axis, and use the mistake label as the point itself on the scatter graph, with the proportion again on the Y:
At first glance, it’s a bit of a mess. Several of the letters overlap. But as it happens, I’m less interested in these. Minor differences in the low frequency mistakes are probably not that interesting for this particular data. The difference between the most frequent mistakes is, I think, more clearly displayed in this graph than in any of the others.
There is one finishing touch to add. One issue with normalising the data by total frequency for that continent is that it obscures the fact that we have quite different amounts of data for each continent. Happily, this scatter graph gives us the opportunity to scale the size of the points to match how much data we have for that continent. (There are two ways to do this: scale the height of each character by amount of data, or the area of each character by amount of data. Having tried both, the latter seemed to be a better choice.) That leaves us with this:
Easy to see where most of our data comes from! I may yet add some colour, but really the final decision is: is this interesting enough to include in a paper, or should I just scrap my work on this? Such is research.