Friday, November 6, 2009

The OOM Killer

There is a small but vocal contingent of Linux "advocates" that are only too happy to tell you that Linux is super-awesome and will solve all your problems. "Linux will do everything that wind0ze will do, only better, and it'll do it for free!" and on and on like that.

I'm not writing this post specifically to annoy people like that, but it certainly wouldn't hurt. :3

So what's this "OOM killer"? It's something that Linux fanboys generally don't like to talk about, assuming they even know it exists. Let's get some background first, on memory allocation.

When a process is created on Linux, the operating system needs to copy all the memory from the parent process into the new process. It does this because, following the traditional process abstraction, the child process needs to inherit all the data from the parent, but also needs to be able to modify it without messing up the parent. Linux uses a technique called "copy on write", or COW, to do this really quickly. The trick with COW is, instead of actually copying the data, you just mark it read-only, and point both the parent and the child to the same copy. Then, if either of them tries to write to it, you copy it secretly and pretend that it was a writable copy all along. This works really, really well, since the vast majority of memory ends up never being written to for various reasons.

Normally, Linux will handle when it runs out of memory by returning an error to the process that tried to request the memory. This works more or less well. There's an unfortunate tendency among programmers to ignore the result of malloc, though, which means that some programs will start to randomly crash when you get close to running out of memory. The point is, though, that at least there's a way to detect the situation - carefully written programs can avoid crashing by checking the return value of malloc and reacting properly if the system is out of memory.

But there's another problem here. Notice that with COW, the actual memory allocation happens at some random time, when you try to write to some random variable. If the system is out of memory, and has to make a copy, then you've got a problem. (Ideally, you'd make sure that there's enough free memory when you create the process, but then you're wasting a lot of memory - can't have that!) You can't just tell the program that you couldn't allocate memory, because the program didn't try to allocate memory in the first place! You have an error that can't be properly handled. So, Linux handles this situation with... the OOM killer.

The OOM (out of memory) killer does exactly what it sounds like: when you run out of memory, it kills a random process so that the system can keep going. They've developed some rather elaborate heuristics for how it selects the process to kill, so that it's less likely to be a process that you really care about, but as described in this awesome analogy, that's somewhat akin to an airline trying to decide which passengers to toss out the airplane if they're low on fuel. No matter what you do, the fact remains that you've gotten yourself into a bad situation.

I've seen swap provided as a solution to this, but that's basically just saying "don't run out of memory in the first place" - it's not terribly helpful. The fact is, no matter how carefully you write your program, and how meticulous you are about checking for errors, there's still a chance that your program will crash randomly, for no reason at all. Yay, Linux!

1 comment:

Frank Church said...

Well, you know me, I don't know doodley-squat about Linux, except for stuff I've picked up from you. So yeah, this OOM killer - boy, I hope Linux tells you when the OOM killer is being implemented. Hope you haven't encountered it too much.