The Importance of Types

This is a post of two halves. I will start by explaining why I think types are so useful in professional programming, and then later discuss their place in learning to program.

I ♥ Types

From a software engineering viewpoint, I am a strong proponent of types. Unfortunately, most non-functional programming languages have included types in a fairly lacklustre way. When I say that types are useful, people may think about the distinction between integers, floats and strings (classic question: what type is a telephone number?), or the distinction between different record/class types. These are very useful distinctions but they are relatively basic uses of types. Let’s explore some more powerful uses of types.

NASA’s Mars Climate Orbiter project famously crashed due to a mixup in the units being used. The compiler did not complain because all the numbers involved were typed as floating point numbers, and you are permitted to manipulate them as you please. However, the Orbiter failure should be seen as a typing failure. The F# language has a units feature that can prevent such mixups. One number can be typed as a float<meters>. Another can be typed as a float<seconds>. Divide the former by the latter and you have a float<meters/seconds>. If you try to add that to an acceleration number (typed as float<meters/seconds^2>) then you will get a compile error. This cleverer, more thorough use of types begins to illustrate the benefits.

The NASA Mars Climate Orbiter, which crashed due to a mixup between pounds and Newtons.
The NASA Mars Climate Orbiter, which crashed due to a mixup between pounds and Newtons.

Many problems in programming arise from letting too many variables in your program have a plain string type. A URL, a file path, a file’s contents, a MySQL query string, a GUI label and someone’s name are all text. Letting them all have the same type and be manipulated in the same ways leads to all sorts of accidental errors. Concatenating the contents of two files might make sense, but concatenating two absolute directory names or two URLs does not. Many, if not all, SQL injection bugs can be seen as the concatenation of two incompatible types: a string originating from user or external input (which could be typed as such) and a query string, resulting in a query string. To avoid SQL injection, you should not allow any non-escape user input into a query string. Similar logic applies to injecting user-originated Javascript content into webpages. And in fact, several dynamic languages have tried to cut this out by having a dirty flag on strings that originate externally; they then prevent the use of (non-escaped) dirty strings in the wrong place like HTML generation.

Just as not all strings are not the same, not all integers are either. Recently, I have been developing some software that reads from a database. Each table has a 64-bit integer primary key named id (NB: I didn’t design the schema). You might be tempted to read that field into a plain 64-bit integer type. But, as well as permitting nonsensical addition and multiplication of ids, this means that you might accidentally read the id field from the users table and use it to find an entry in the posts table. A classic way this can happen is that you have a function that takes two ids, one for users and one for posts, and you pass the parameters in the wrong order. The better way to program this system is to use a different type for the id of the users table (e.g. id<users>) and for the posts table (id<posts>). That way your method can have two parameters with different types, and the compiler can issue an error if you get the order wrong.

Types prevent errors, and static types prevent errors very early on in the development process: at compile-time. Static type declarations also serve as a form of documentation. What’s more, I know that the types must be kept up-to-date, unlike documentation. Out of date documentation requires a programmer to notice; out of date types cause a compiler error. And if you want some really interesting type-related programming, there are actually some functions where there is only a single implementation for a given type, and thus writing a type is enough for a programmer or a tool to work out the function implementation. The djinn tool provides an implementation.

And Yet…

You get the picture: I’m a big fan of using types. However, there is one case where there I’m not as certain about how prominently types should feature: programming education.

I accept that a good strategy for teaching is to pare down what the students are exposed to. You start by teaching the students variables, or method calls, and you leave out other concepts (like loops) until students have got the hang of learning the first concept. So should types be one of the later concepts, that are omitted until students are ready for them? Should we start in languages where the types are relatively hidden and then move to more obviously typed languages later on?

Early on, this is the distinction between writing:

x = 5
y = "hello"

and:

int x = 5
string y = "hello"

Obviously, the former involves less concepts to start with. But does the latter help your understanding in the long run? We get into further differences with the question of whether you can write:

x = 5
x = "hello"

Is a variable a place for storing anything, or does a variable have a type? If you start with the former, is it difficult to later teach the latter? Similarly, can you have heterogenously-typed lists with all sorts of different elements:

x = [ 5, "hello", [3.0, 5.6] ]

Broadly, what I’m wondering is: are dynamically/flexibly typed systems a benefit to learners by hiding complexity, or are they a hindrance because they hide the types that are there underneath? (Aside from the lambda calculus and basic assembly language, I can’t immediately think of any programming languages that are truly untyped. Python, Javascript et al do have types; they are just less apparent and more flexible.) Oddly, I haven’t found any research into these specific issues, which I suspect is because these variations tend to be per-language, and there are too many other confounds in comparing, say, Python and Java — they have many more differences than their type system. I’m interested to hear anyone’s thoughts on this issue.

12 thoughts on “The Importance of Types

  1. For the most part, I learned to program in Python. Shortly after I made the transition to Java for classes at University.

    At first, the transition from such a concise language to one with a lot of type-related boilerplate was very frustrating. I felt like I was constantly fighting with the type checker, especially when using generics. I eventually turned to the internet to see if I was alone in these complaints and to see why, if they required so much more work, anyone still used statically typed languages.

    It didn’t take long to find that I was not alone; a lot of people do not like how types are done in Java. Rather than blaming static typing, however, they blamed the particular implementation in Java, and pointed to other resources for looking into “better” type systems. What followed was a several years long exploration of languages like ML, F#, Haskell, and more recent languages like Scala.

    I still do not like Java’s type system – it feels too much like a compromise. Type information is overly verbose when it’s obvious (local variable declarations, for instance) and not expressive enough to cover even the majority of use cases. What you end up with is a lot of work for relatively little payoff. But I love other statically typed languages.

    I think the hinderance was in moving from Python to Java, not from a Dynamically to Statically typed language. I wasn’t able to see the benefits of using static typing because they were obscured by the low payoff scenario mentioned above. It would have been nicer, I think, to start somewhere closer to two extreme examples in the static/dynamic spectrum and work your way in. There are headaches at both ends; sometimes it’s unclear what’s going on in very dynamic code and sometimes a static type system fundamentally limits your ability to express a solution. And there are headaches if you compromise too much and end up in the middle, like with Java. But only by experiencing all three of those points in the spectrub was I able to appreciate the tradeoffs between them.

  2. Two things:

    First, the Climate Orbiter didn’t crash because of a units problem — it crashed because of a much larger problem with how NASA tested its code at the time. At the time, due to insufficent funding and a tight schedule, they really only tested code that was on the actual launch vehicles. Code that stayed on the ground was not rigorously tested.

    The units mixup, had it been on the spacecraft code, would have been easily spotted. The problem was that NASA was not testing /all/ their code. There were further problems with understaffing, and a lack of communication between the people responsible for the ground code, and the people responsible for the spacecraft code. One team was using one set of units, the other another — and it was the intersection that was the problem — and lack of communication there.

    To say it was a units problem is an oversimplification. It was ultimately a problem with team communication and how code was tested.

    Secondly:

    As for research on static vs. dynamic tying, check out: http://courses.cs.washington.edu/courses/cse590n/10au/hanenberg-oopsla2010.pdf
    They found the typing doesn’t make a difference with regard to development time or quality.

    1. Thanks for your comment. So was the NASA problem in the code or in a misreading of the specification? If it was in the code, the units mechanism still would have found the problem — they weren’t testing the ground code, but they must at least have compiled it.

      Thanks for the second link. Do you know if there is any research specifically into typing with learners? I had a look around the ACM DL and couldn’t seem to find anything. It seems odd that no-one has looked into this.

      1. Misreading of the spec. The spacecraft was sending NASA ground control the data (trajectory information) in one set of units, in the form of a raw text file with no annotations, and ground control was assuming the other units when reading in that text file. When ground control was tracking the trajectory and making flight adjustments, their information was hence faulty — sending the craft the wrong adjustments.

        I haven’t seen any literature with learning and types, unfortunately. If you find any, I’d love to see it. My impression is that it doesn’t seem to make a difference one way or the other.

    2. Hey Elizabeth,

      I would be very cautious of drawing too many conclusions from that paper by Stefan Hanenberg. It’s nice work, and it’s that people are exploring this scientifically. However, they only look at very small programs which is a significant weakness of the study. In general, the benefits from static typing become more pronounced as programs become larger (e.g. static types provide documentation, IDE’s exploit static types to help navigation, bugs take exponentially longer to fix, etc).

      Cheers,

      Dave

  3. There are good type-inferring languages that allow (x = 7 ; y = “hello”;) and also ensure (x + y) is a type error at compile time.

    In partiuclar, Haskell is use in programming education as several Universities.

  4. For years, I’ve taught intro programming in Scheme/Racket, which lacks type annotations but checks types on values at run-time. I expect students to provide type signatures of functions in comments in a stylized yet unchecked form (ie, we grade it, but can be flexible).

    I love this approach when we are first starting and students are struggling to master core ideas. The types avoid additional syntax that can get in the way (but we still ask them to think about the concepts through the type comments).

    I love this approach when we want to mix up our data a bit (lists with multiple types of content) to practice different approaches to modeling the data in a given problem.

    I love not having types when we first discuss search and want to return a signal such as “false” to indicate that an item was not found. No need to cajole a type system into accepting this (but our commenting syntax can handle it).

    I miss types when we start creating rich data on which students are more likely to make typos. For example, once we are creating nested record types (2nd week) or trees (5th week), students often make typos when defining data for writing test cases — mix up orders of fields, etc. Once we’re into structured data, I want sanity checkers on data to prevent students from trying to “fix” perfectly good code over an error in creating data. This is a fairly limited use of type checking, but one that would save a lot of heartache.

    I’m quite drawn to the work on gradual type systems (type annotations may be included and are checked if included, but are not required). I’ve never gotten to use one of these systems in teaching, but think this flexibility would be a fantastic benefit pedagogically.

    Kathi

  5. An advantage of dyn langs is that program symbols (variables) have and keep names and types. You can deal with them, write them, compare them, check them… Traditional static lang var lose names and types, meaning all semantic information. Too bad…

  6. There is great benefit in languages that support polymorphic data types (in all its forms), but this is not the question. If “2”, 2, and 2.0 become “inter-changeable” in language then so do “hello”, 4 and 3.4E–38….. Except then they don’t make sense.

    I’m all for operator overloading: e.g. 2+3, “hello ” + “world”, strict type rules can still be observed. Similarly with functions: write(2.0434); write(array); Easy for the programmer.

    But heightAbovePlanet:Natural should not be assigned -200m; not unless the Thunderbird Mole is employed!!

  7. I find compile-time type checking invaluable on large projects (ones that span multiple programmers or multiple years), but irritating on tiny projects (single programmer working for a week or less). Most of my prototyping work is done in Python, which has strong types for objects, but no types for variables. When I need to make something bigger, I use c++ (because I’ve used it for decades, not because it is the “best” statically typed language).

    There are times when I would like to have units or other more detailed information about how to interpret an object. I have used the Unum package in Python, to add units to physics simulations. It is a little unsatisfying, because it adds a lot of unnecessary run-time overhead, when all that is really needed is compile-time checking.

    Incidentally, “what type is a telephone number?” is the wrong question. The question is “what type do we use to represent a telephone number?” or “what operations do we want to perform on a type that represents telephone numbers?”

  8. For the Java course I’m currently helping teach, I agree that types are sort of a pain for the first half of the course. But in the second half where we have linked lists, data structures (generics), and multi-level APIs as part of assignments it’s indispensable in helping the students debug quickly. It’s also useful for our testing of student code: if they have the right API then most type errors won’t be an issue. I do love Python. But I wish you could add types to it to help do sanity checking.

Leave a reply to Ian Hartley Cancel reply