How To Deal With Machine Learning Papers

Here’s a very useful article in JAMA on how to read an article that uses machine learning to propose a diagnostic model. It’s especially good for that topic, but it’s also worth going over for the rest of us who may not be diagnosing patients but who would like to evaluate new papers that claim an interesting machine-learning result. I would definitely recommend reading it, and also this one on appropriate controls in the field. The latter is a bit more technical, but it has some valuable suggestions to people running such models, and you can check to see if those are implemented yet. Edit: I should definitely mention Pat Walters’ perspective on this, too!

The new article has a pretty clear basic introduction to the ML field, and frankly, if you take it on board you’ll already be able to at least sound more knowledgeable than the majority of your colleagues. That’s the not-so-hidden secret of the whole ML field as applied to biomedical and chemical knowledge: there are some people who understand it pretty well, a few people who understand it a bit, and a great big massive crowd of people who don’t understand it at all. So here’s your chance to move into the “understand it a bit” classification, which for now, and probably for some time to come, will still be a relatively elite category (!)

As you’d imagine, most of the diagnostic applications for ML involve image processing. That’s widely recognized as an area where these techniques have made significant progress, and that’s for several good reasons. Conceptually, we already had the example of the visual cortex to lead the way as an example of multilayered neural-net processing. A key advantage is the relative data-richness of images themselves, and it’s especially useful that they come packaged in standardized digital formats: grids of pixels, each of which are already assigned with numerical values in some standard color space. There has also been a massive amount of time and money spent on developing the image-recognition field, not least for defense and security applications, which has had a big influence over the years.

All that work has also exposed some of the pitfalls of image recognition – see this recent article for a quick overview. Every deep-learning algorithm has vulnerabilities, just as our own visual processing system does (thus optical illusions). And you have to be alert to the ways in which your snazzy new software might be seeing the equivalent of lines that wiggle when they’re actually stationary, or is missing the equivalent of that person in the gorilla suit weaving in between the basketball players. One characteristic of neural-network models can be brittleness: they work pretty well until they abruptly don’t, and although you would really like to know when that happens, the model may be constitutively unable to tell you that.

Consider what is probably the absolute worst-case “adversarial image attack” for a given system – one where someone knows the ins and outs of just how it was developed and trained, and (more specifically) knows the various weights assigned to parameters during that training and optimization. With such data in hand, you can produce bespoke images that specifically addresses vulnerabilities in the algorithm, and such images are simply not detectable as altered by the human eye. The example shown (from this page at Towards Data Science) is a “projected gradient descent” attack against the ResNet50 model – the perturbations in the middle panel were specifically aimed at its workings (and have been magnified by a factor of 100 just so you can see what they’re like). As you will note, the resulting image is indistinguishable by inspection from the starting one, but the program is now more convinced that the bird is a waffle iron than it was convinced that it was even a bird to start with. The potential problems of such adversarial attacks on medical imaging are already a subject of discussion, but it will be appreciated that errors do not need to rise to the deliberate bird-vs-waffle-iron level to be troublesome.

The JAMA paper recommends that you ask yourself (or perhaps a new paper’s authors!) several questions when you see an interesting new ML diagnostic method. For one thing, how good is the reference set? “Good” can and should be measured in several ways – size, real-world fidelity, coverage of the expected usage space, inclusion of deliberately difficult or potentially misleading images, etc. Note that when IBM’s attempt at using its Watson software for cancer diagnosis failed, one reason advanced for that wipeout was that the cases it trained up on were synthetic ones produced for its benefit (although to be sure, there were probably many other reasons besides).

Another question to ask is two-pronged: do the results make sense, or are they perhaps a bit too counterintuitive? Counterintuitive and right is a wonderful combination, but counterintuitive and wrong is a lot easier to achieve. On the other side, are the results just too darn perfect? That’s a warning sign, too, perhaps of an overfitted model that has learned to deal perfectly with the peculiarities of its favorite data set, but will not do so well when presented with others. And then there’s the ever-present questions of repeatability and reproducibility: if you feed the same data into the system, do you get the same answer every time? And can other people get it to work as well?

The “adversarial controls” paper linked to in the first paragraph (Chuang and Keiser) also recommends a similar reality-check approach, and also recommends seeing if other models converge on the same answers. If not, that’s a sign that one or more of them (all?) are reacting to extraneous patterns that have nothing to do with the issue at hand. They also strongly suggest that people generating such models deliberately try to break them: take out some part (or parts, one at a time) that you would think are crucial and check to make sure that they really are. That’s what led to this situation that I blogged about a year ago, when the substitution of random data for an ML model’s parameters did not seem to degrade its “performance”. If the authors of an ML system you’re interested in haven’t done things like this, then you should try them yourself, by all means.

So we’re all going to have to sharpen up our game, because this topic is definitely not going away. I know that there’s a blizzard of hype out there right now, but don’t use that as an excuse to dismiss the field or ignore it for now. The whole machine learning/deep learning field is moving along very briskly and producing real results, and there is absolutely no reason to think that this won’t continue. Underestimating it is just as big a mistake as overestimating it: avoid both.

Related Posts with Thumbnails