Skip to content


Excellent error messages

Error messages are pervasive throughout programming, yet little has been written on the design of error messages for languages, libraries, and APIs. Much good advice can be found via simple web search on good error messages as shown to end users in GUIs, but standards for error messages intended for an audience of programmers is hard to find. This is not due to a lack of attention to error messages. There are certainly places where error messages are neglected, but neglect is far from universal. In fact, some of the best discussions on good error messages come from specific efforts by big projects to improve their error messages.

The sort-within-groups problem

There is an interesting edge case in data grammars when a grouped data table is sorted by non-group columns. For example, what should the following dplyr code produce?


df <- data.frame(
  group = c(2, 2, 1, 1, 2, 2),
  value = c(3, 4, 3, 1, 1, 3)

df %>% group_by(group) %>% arrange(value)

Leap Years: Something the Gregorian Calendar Gets Right

Calendars coordinate people with people. It is better to be on vacation the same week your family is also. It is better for kids to be in school the same days as their teachers. It is better to be at work when everyone else is. (Though it is worse to be driving to said work when everyone else is.) In the modern world, it can be easy to think that coordinating people with people is all calendars do. If that were all, we could certainly do with a much simpler calendar—4 weeks to a month, 12 months to a year, no leap days, no irregularities forever. I won't argue that it couldn't be simpler, but I will argue that it cannot be perfectly simple. Because calendars also coordinate people with nature.

Finally, a synthetic organism that worries me

It was recently in the news that researchers have genetically engineered tobacco with 40% more efficient photosynthesis. I had first seen this kind of research at a seminar during my graduate years at MIT. The presenter noted that the genetically engineered plants grew faster and showed memorable side-by-side pictures of scrawny-looking normal plants next to their larger and lusher engineered brothers. I tried to find out more after the fact, but had forgotten the name of the presenter and lab. When I searched online for the research, I found lots of people proposing doing this kind of thing, but not the lush success story I had just seen. I wanted to learn more not because it was super cool and hugely important (which it was), but because it was the first (and to date only) example of what I thought was a dangerous genetically modified organism.

Multidimensional Pairing Functions

In the previous post, I compared ways to take two infinite streams and generate a new stream that is all possible combinations of the elements in those streams. This post takes it up another level and generalizes this procedure to an arbitrarily long list of infinite streams. This is a trickier task than the 2-dimensional case, utilizing recursion into each dimension to cleanly generate all combinations.

Throughout this post, the caret ^ will indicate exponentiation and parentheses list(index) will be used to indicate indexing a list. As before, 0-indexing will be used because it makes the math simpler. 

Superior Pairing Function

Given two sequences of objects, it is often desirable to generate a sequence which is all possible pairwise combinations of those sequences—the Cartesian product. If the sequences are finite in length, then it is a trivial function to write in any programming language. The function even exists in many standard libraries and packages, such as itertools.product in Python. But if the sequences are infinite in length (that is, they are streams rather than arrays, depending on your terminology), the typical approach fails. Finding a way to iterate over all pairs of two infinitely long sequences is called a "pairing function" in mathematics and has practical uses. There are some existing pairing functions, but many have limitations. I describe the properties of a superior pairing function and a couple of methods that satisfy them.

Ready for a National ID

When Yahoo was hacked, we threw away our passwords and got new ones. When Target was hacked, we threw away our credit cards and got new ones. Now that Equifax has been hacked, we'll have to throw out our social security cards and get new ones. Alas, such a thing is not currently possible, and that's a big problem. It's not that we shouldn't have a national ID number. A robust credit system requires (1) a standardized system to identify who owes what so the government knows whose stuff to take if a debt is not paid and (2) a standardized system for recording past and current credit so that borrowers can support their creditworthiness. It was point (2) that got hacked, but it was the design of point (1) that makes the hack such a big problem. The social security number (SSN) is poorly suited for its role. As long as the SSN is both the account number and the unchangeable password for all our financial instruments, we will endure costly and rampant fraud. Just as the size of the Target hack forced the US to finally rethink credit card security, the size of the Equifax hack should force us to rethink our national ID security.

The Sanctification Button

I have a browser extension installed on my work laptop that blocks my access to Reddit, Facebook, and other news and social media sites. My employer didn't install it; I did. I have the same extension installed at home, albeit blocking a more limited set of time-wasting sites. On its face, setting up a system that does nothing but restrict my future options seems like a waste of time. Wouldn't it be easier just to choose to not visit those sites at imprudent times? In theory, sure. But I don't trust my future self, and restraint is taxing. I'd have trouble explaining why. Admittedly it's weird, but I think I can assume that you, dear human reader, at least understand where I am coming from, regardless of how much you rely on such things yourself. In behavioral economics, we call these kinds of mechanisms "precommitments". Odysseus bound himself to the mast before sailing past the Sirens. Gamblers leave behind their checkbooks and credit cards before a casino vacation. I am one of many who find it prudent to occasionally bind my future actions, restrict my future options, or simply nudge my older self in a certain direction. There remain plenty of mistakes that I would like to prevent, but for which there is no mechanism to preempt. Inevitably, technology will improve; new products will become available. Some of these, like the browser blocker, will be increasingly capable precommittment tools. How far should you go with this? Artificial intelligence combined with cybernetics could make any undesirable behavior potentially preemptable. Leaving aside the technical difficulties, if you could not only end your ability to lie, cheat, and steal, but also gossip and insult, should you? Taken to its extreme, if there were a button that removed your ability to sin, would you push it?

Marginal vs. Conditional Subtyping

In computer programming, a type is a just a set of objects—not a physical set of objects in memory, but a conceptual set of objects. The type Integer is the set {0, 1, -1, ...}. Types are used to reason about the correctness of programs. A function that accepts an argument of type Integer, will work for any value 0, 1, -1, etc., for some definition of "works". Subtyping is used to define relationships between types. Type B is a subset of a type A if the set of objects defined by B is a subset of the objects defined by A. Every function that works on an instance of A also works on an instance of B, for some definition of "works". If we were just doing math, subsets would be the end of subtyping. But types as pure sets exist only conceptually. Actual types must be concretely defined. It is not very efficient to define the Integer class by writing all possible Integers! In most programs, the possible values of a type are constrained by their fields. Subtypes and supertypes typically differ by having more or fewer fields than the other. Sometimes, the subtype has more fields, and sometimes, the supertype has more fields. Many of you may be thinking, "What? It's always one way, not the other!" The funny thing is that some of you think the subtype always has more fields and others think the supertype always has more fields. That's because there are two kinds of subtyping. The first is what I call "marginal subtyping", which is encountered in application programming and is well modeled by inheritance. The second is what I call "conditional subtyping", which is encountered in mathematical programming and is well modeled by implicit conversions. Depending on the genre of programming you work in, the other kind of subtyping may be unknown and the language features needed to implement it may be maligned. But both needs are real and both features are necessary.

The Missing 23rd of the Month

Previously, I explained why the 11th of most months is mentioned far less than the other days in the Google Ngrams database of English literature from 1800-2008. This was to solve a long-standing question posed in an xkcd comic. While researching this, I encountered another mystery: the 2nd, 3rd, 22nd, and 23rd are unusually low as well—but only until the 1930s, at which point they become perfectly normal days. Last time, I set this question aside to focus on the 11th. In this installment, I explain the strange behavior of these four days.