Friday, May 15, 2009

Power-aware scheduling in TFlex

Whenever people ask me what I'm doing for my honors thesis, I find myself in a bit of a bind, because actually explaining what I'm doing properly would take entirely too long for casual conversation. I thought I'd do a post on what I'm actually doing, so that when I say I'm working on my research project this summer, y'all will have some idea what I'm actually talking about.

Unfortunately, actually explaining things properly requires a lot of background explanation.

Background: current architectures

When I say computer architecture, I'm not talking about buildings, but rather the internal structure of the processor, and the way it executes programs. If you take any current computer architecture, and handwave away a whole lot of implementation details, you'll almost always end up with about the same thing. Current architectures are all based on an extremely simple model: the processor reads an instruction from memory, does what the instruction says, and then moves on to the next instruction. Instructions are the basic units of computation - some commonly used instructions will perform various arithmetic operations, or read data from memory, or have the processor look elsewhere for the next instruction.

Well, it's not quite that simple. Modern processors can do all sorts of things to make the process faster, like reading the instructions from high-speed caches instead of main memory, or analyzing the dependencies between instructions so that they can execute more than one at a time. And instructions aren't that simple either; some can execute in a single cycle (where a cycle is on the order of nanoseconds), while others can take hundreds of cycles. With very few exceptions, though, computers maintain the illusion that they are executing instructions one at a time from memory.

Maintaining this illusion makes things really convenient for programmers, but it also means that computers aren't as fast as they could be. Moore's Law has given chip designers more transistors than they know what to do with. You could easily fit dozens or hundreds of simple execution units on a processor, but because the processor needs to pretend that it's executing one instruction at a time, it's very difficult for it to use more than four or so execution units at a time.

The problem doesn't lie in the current generation of processors, but in the current generation of architectures. To make computers significantly faster than they are now, we're going to need new architectures, explicitly designed for high levels of parallelism. Which brings me to my next topic...

The TRIPS architecture

TRIPS is an experimental architecture that was created here at the University of Texas. The most important feature of the TRIPS architecture is that it operates on blocks of instructions, rather than individual instructions. Memory access and control flow, both of which are traditionally foes of high-performance computing, are handled at the block level, rather than at each instruction. (Memory access is a problem because the amount of time it takes is highly unpredictable; control flow is a problem because the uncertainty about the next instruction to execute stops the processor for a short time.) Not handling these on a per-instruction basis also simplifies the individual execution cores.

Within each block, instructions are executed in dataflow order. This means that each instruction within a block runs as soon as its operands are ready for it, rather than having to wait for its turn in the instruction stream. This design implies that a block runs as fast as it's theoretically possible for it to run, unless it's constrained by the actual dimensions of the processor grid. TRIPS uses an 8x4 grid of processors, for 32 cores total.

(As an aside: this isn't just idle theorizing about abstract issues in computer architecture. They actually built a working prototype of TRIPS a few years back.)

This is all really neat stuff, and it solves a lot of scaling issues which current processors face really elegantly. But wait, it gets better!

The TFlex architecture

Parallelism is nice, but the sad fact is that some programs simply aren't parallelizable. Even if you were to try to run them on a processor like TRIPS, they would end up only using one core at a time. Wouldn't it be nice if you could put a program like that on one core, and use the other 31 cores for other stuff at the same time?

The TFlex architecture, a derivative of TRIPS, does exactly that. It allows you to divide the execution cores into different-sized tiles, and run totally different programs on each tile. So, for example, you could run two programs on two 4x2 tiles, and a third program on the remaining 4x4 block of space.

CPU scheduling is the process of allocating time on processors to programs on a computer. Generally, on computers like you or I would have, there are far fewer processors than there are programs running, so CPU scheduling involves switching between different programs quickly enough that you don't notice, while giving each one a fair share of CPU time. On TFlex, we have the luxury of approaching the problem a bit differently: since we can decide on the fly how many processors we want to have, and what sizes they should be, we can make the simplifying assumption that every program is running simultaenously. Context switches are fun (unless you're working on a project for OS, and it's due the next day, and the switch to kernel space is just a little bit off and the whole thing crashes when you run it... well, that's another story entirely), but for an experimental system they're not really necessary.

Every program will have slightly different performance characteristics on such a machine - the speed of a program doesn't vary linearly with the size of the tile it's running on, and the relation between performance and tile size varies tremendously. The reconfigurable nature of TFlex, then, adds a whole new dimension to the problem of CPU scheduling.

And that's it for background stuff.

My research project (finally)

My research is focused around power-aware CPU scheduling on the TFlex architecture. System performance isn't the only thing you might want to optimize for with scheduling - on TFlex, you can also schedule running programs to minimize power consumption, since the power consumption of a program varies with tile size. Currently, I'm working on figuring out the relationship between performance and power consumption. If they're strongly correlated, then it makes my life a lot easier, because optimizing for one will also optimize for the other. On the other hand, if there are significant tradeoffs between the two, the problem may turn out to be more interesting.

Once you end up considering both execution speed and power consumption, there are other interesting questions you can ask. For example: "I want to run these programs, within this power budget. How quickly can they run without going over?" On a power-constrained device, running off a battery, this could be a very useful thing to know. My main for this project is to design a CPU scheduler that addresses this problem, and others like it.


Kiriska said...

I always feel good when I understand at least 80% of your CS-heavy posts. I also wonder why people even bother asking what others' honor thesis/senior projects are in casual conversation because it RARELY can be summarized in less than 500 words, no matter what the area of study. Everyone always seems to need to take a deep breath first and ask you whether you actually care or if you're just making chitchat. :P

ANYWAY... have fun with that. XD

Frank Church said...

Me, on Friday: tldr :(
Me, on Tuesday: Hey, I understood some of that, and it wasn't that long! But I might have to ask you dumb questions about it, so fair warning.