Recently I was asked about use of hex and floating point literals (especially “E” notation) in the Blackbox data set: do beginners use them? I was intrigued enough to knock up a simple program to find out. My method is quite straightforward: I take the latest version of each source file which successfully compiled, run it through a Java lexer and pick out the literals. This gives us about 40 million source files to look at.
Before we get into the results, here’s some predictions I made beforehand about our data (where most users are assumed to be programming novices):
- Very few users use hex literals
- Most hex literals are 0xFF or similar bitmasks
- Almost no-one uses underscores (Java lets you write numbers with underscores, e.g. 1_000_000)
- Almost no-one uses E notation (and when they do, mainly 1e-6 for epsilon values in floating point comparison)
- Most floating point values are between 0 and 1
Let’s start with hexadecimal integer literals. There were 814,920 hex integer literals, compared to 29,044,559 decimal integer literals. So 2.7% of hex/decimal integer literals were in hex. (I didn’t bother going into octal, but there were a handful of uses. I suspect many of these were an accident.). That is a bit higher than I was expecting, admittedly. In terms of their value, here’s the top five:
- 0xFF: frequency 89,663
- 0x0: frequency 52,732
- 0x30: frequency 16,742
- 0xF: frequency 16,009
- 0x1: frequency 13,799
There are two F bitmask values there as predicted. I was a bit surprised by how many zeroes and ones were in there: why write them as hex (0x0) and not just decimal (0)? My guess is that they are working with bitmasks nearby, and out of habit/consistency write the values as hex.
There’s not too much to say about decimal integer literals, but I will mention the most frequent items. It’s a sequence that runs as you might expect (zero being most frequent, and increasing numbers being less frequent), punctuated by some numbers which testify to computing’s love of powers of two. Most frequent first:
0, 1, 2, 255, 128, 3, 4, 5, 256, 8, 10, 7, 6, 100, 127, 16, 20, 1000, 9, 50
The frequency is a decreasing power law (1 is half that of 0, 2 is half that of 1, then the tail begins to flatten out).
Underscores are a relatively recent addition to Java (added in Java 7) and little-known. Indeed, only 692 decimal literals had underscores: 0.002% of all decimal literals. Oddly, 737 hex literals had underscores, which as a proportion is much higher: 0.09%. I suspect this is because both underscores and hex literals are both used by more advanced users. Generally though, our users are clearly not making much use of this underscore feature.
Decimal Floating Point
There were 1,791,915 floating point decimal literals. Of these, only 3,002 used the “E” notation (e.g. 1.15E12): 0.16%. Clearly not a very used feature. As for their values, the top five were: 1e-3, 1e-8, 1e-6, 1e6, 1e-20. I’d say my prediction about the use for epsilon values was borne out.
Regardless of notation, across all floating point decimal literals, the most frequent values were: 1.0, 0.0, 100.0, 3.0, 2.0. Technically, my prediction that most values were between zero and one was almost correct: 47% of values were between zero and one. But really, this is only because 23% of them were zero or one. As a last side note on these literals: 7,130 (0.40%) started with a dot (e.g. “.5”) — something we disbarred in Stride due to the awkwardness of parsing that in expressions. But actually we could have banned E notation (also a pain) with less immediate impact.
Hexadecimal Floating Point
If you even knew that hexadecimal floating point notation was a thing in Java, then give yourself a pat on the back. Added in Java 5, they look like “0x1.fe2p5”, where p takes the place of the usual “E” notation because E is of course a valid hex character. I only know about this because we have a parser in BlueJ, which does accept these. I found precisely four uses of this notation, which is probably more than expected.
This is a pretty cursory look at literals with a fairly crude methodology. Note that although we only looked at the latest version of each source file, source files in Blackbox are not independent of each other (e.g. if a teacher gives out a project with a floating point literal, that will show up identically in each student’s copy). For example, the four hex float point literals were the same value, suggesting they are not independent. And on a related note, I’ve only looked at source files regardless of whether they come from the same user or not, so we’re only measuring source occurrences here, not the number of users who use a particular notation. But I think our N is high enough that individual users cannot tilt the statistics.