Will It Learn? Can It Learn?

OK, we’re going to get a bit esoteric this morning. There are all kinds of things going on in the world, but I’m going to seek refuge for a little bit in abstraction, and if that’s your sort of thing, then let’s lift off. This is broadly on the hot topic of machine learning, which we will define non-rigorously as “having software dig through a large data set in search of rules, trends, and correlations”. We all have such large data piles in this business, and we all feel that there must be a lot of those correlations that we have missed. Certainly there are humungous gaps in our knowledge, so perhaps some more knowledge is hiding in there and needs the diligence of a machine to make it clear? If you announce yourself as a biopharma company interested in machine learning and with money to spend, then that long lonely period of no one wanting to set up meetings with you are at an end. There are more folks than I can count who are ready to sell you software, hardware, consultancies, conferences, and for all I know matching T-shirts and coffee mugs.

Now, for a machine learning algorithm to be able to extract something useful from a set of data, there has to be something useful in there in the first place. You do not want your snazzy new ML program announcing that it has found rules and patterns in piles of random noise (and in fact, deliberately dumping a pile of random noise into its hopper is a useful reality check; you want to be sure that it comes up blank). That’s for starters. You also have to wonder if these correlations you’re looking for are, in fact, extractable from the data you have. That’s a bit more of a stare-out-the-window idea: of course, if there are trends, the program will find them. . .right? This is the topic of “learnability”, and it’s intimately related to what sort of machine-learning model your system is using.

That already takes you a bit further under the hood than most people like to get. But there are all sorts of algorithmic choices for ML. For some datasets, it almost doesn’t seem to matter much which of the well-known ones you use; the program will extract the rule(s) for you. Other problems are pickier (and not least pickier in terms of computational time, because some of the methods are just more efficient than others for different sorts of problems). But can there ever be a complete mismatch? That is, could you have a dataset that really does have a rule in it that could theoretically be discovered, but you are luckless enough to have picked an ML algorithm that is incapable of finding such a rule? How general is learnability?

A recent paper came up with a surprising answer to that question. (Update: here’s Ash/Wavefunction with a post on this as well). The authors have a particular learning model (“estimating the maximum”, or EMX) and investigate its behavior with data produced by several different underlying “rule” functions. And they show – as in no escape, mathematically prove – that there are functions for which it is absolutely impossible to know whether they are learnable by the EMX algorithm or not. Not just difficult to know, not computationally hard until someone comes up with a better way to evaluate the problem, none of that: the question of learnability in those cases is formally undecidable.

The mathematically inclined will be wondering if I mean that term the way it sounds like, and the answer is, yep, Kurt Gödel rides again. At this point, it might be advisable to go do something else if you’ve been wondering whether to stick with this post. But if you’re game and want to know more, Gödel is famous for his 1931 proof (yeah, proof) that any mathematical system that is complex enough to at least encompass the rules of arithmetic will have things in it that are true (or false) but cannot be proven to be so using the rules of that system. His proof of that is rock-solid and is a work of genius, but even those without advanced mathematical training can be taken through versions of it (this is a classic text that does so, which I’ve gone through myself). When Gödel first presented his results in public, the only person in the audience who appears to have grasped his point and asked him questions was John von Neumann, which figures – he was probably one of the few people on Earth who could have understood that presentation in real time from a standing start. As another character witness, in his later years Einstein said that the main reason he still came in to his office at the Institute for Advanced Study was to have the chance to walk back home with Gödel.

How did undecidability crop up in machine learning? As explained here, new paper first exploits a connection between ML and data compression. Basically, if you can use algorithmic (lossless) compression on a given data set, that tells you that there are rules and patterns inside it that can be learned; there’s some aspect of it that has low complexity which allows that data compression to be possible. The authors use a particular sort of compression algorithm that they show really is related to the learnability via their EMX machine-learning system. So far, so good.

And then they show that the ability to do that compression is related to the size of infinite sets – particularly the “unit interval”, which is all the real numbers between 0 and 1. Shout-out here to Georg Cantor, who introduced this whole crazy idea that there are all kinds of infinite sets and that some of them are larger than others. If that sounds nuts, consider all the rational numbers between 0 and 1. There’s all your fractions, 1/3, 1/2, 2/7 and what have you. It’s an infinite set, you can keep making fractions by throwing one integer over another integer forever. But the set of real numbers between 0 and 1 is even larger, because it includes the irrational numbers (never-ending, never-repeating decimal expansions). It’s also infinite, but it’s a bigger kind of infinite, as you find out when you try to pair them off and the real-number set is larger at every point. Cantor’s “continuum hypothesis” was that there are no infinite sets with a “size” in between those two. (The size, more accurately the “cardinality” of the rational number set is equivalent to the integer set, so another way of stating the continuum hypothesis is that the set of real numbers has the lowest possible cardinality that’s greater than that of the integer set).

Proving that was one of the greatest unsolved mathematical issues of the 20th century, and Gödel strongly suspected that it was in fact undecidable. That was nailed down by Paul Cohen versus standard mathematical set theory in the early 1960s, work (which is even larger than I’m going into here) that rightly won him every major award that the mathematical world could bestow. The connection here is that the ability to do the data-compression in this ML set, which is tied to learnability, is also tied to the continuum hypothesis. Which is undecidable and unprovable. And therefore the question of whether this ML algorithm is capable of learning a (real) rule from such a data set is also undecidable.

So this is fascinating – well, OK, you may or may not agree – but does it have any relevance to the real world? That, we don’t know. You could argue that this system is set up to be particularly Gödel-vulnerable and that the same undecidability will not rear its head under ordinary conditions. That’s probably true, although it’s too early to say. And it’s important to remember that the fact of not being able to prove that an ML system like this one can learn a rule doesn’t mean that such a system won’t learn such a rule. It’s just that you literally can’t be sure if it can or not; you can’t nail that larger question down. The larger field of mathematics (and mathematical logic) has been able to carry on post-Gödel, in some ways perhaps by just not letting themselves get worried about it, and no doubt the machine learning field will as well. But it’s very interesting, and not a little unnerving, to know that the cold finger of undecidability can reach into this area as well. I was going to finish up by saying “You never know”, but it’s not just that. In some cases, you never can know.

Update: as per the comments, for those who really want to dive in, here’s an extended commentary.