Sunday, November 8, 2009

Decentralization I

Which is better, centralization or decentralization? I've been giving this question a lot of thought lately, in the context of distributed systems, and I've come to realize a few things.

First, the question applies to more things than you'd expect. For once, I'm not going to spend the entire time talking about tech - command and market economies, for instance, can be understood as centralized and decentralized systems, respectively, and analyzed as such.

Second, there are fundamental limitations in both directions. For example, a fully centralized system will always undergo some sort of failure eventually, thanks to Murphy's law, while a completely decentralized system has to deal with malicious individuals, asynchrony, and other such issues that keep people that work on distributed systems awake at night.

Third, it's not an all-or-nothing question; it may not even be a smooth gradient. In addition to the tradeoff between centralization and decentralization, we need to consider hybrid decentralized systems, which have proven to be a good compromise. Popular examples include e-mail, and XMPP (aka Jabber, aka Google Talk).

Over the course of this month, I'm going to be writing a series of posts about decentralization. Many of them won't even be about computers, but about other systems that I'm looking at from this perspective. I think that there are a few core principles that account for the difference between centralized and decentralized systems, and I'm trying to tease out what those are.


Saturday, November 7, 2009

The OOM Killer

There is a small but vocal contingent of Linux "advocates" that are only too happy to tell you that Linux is super-awesome and will solve all your problems. "Linux will do everything that wind0ze will do, only better, and it'll do it for free!" and on and on like that.

I'm not writing this post specifically to annoy people like that, but it certainly wouldn't hurt. :3

So what's this "OOM killer"? It's something that Linux fanboys generally don't like to talk about, assuming they even know it exists. Let's get some background first, on memory allocation.

When a process is created on Linux, the operating system needs to copy all the memory from the parent process into the new process. It does this because, following the traditional process abstraction, the child process needs to inherit all the data from the parent, but also needs to be able to modify it without messing up the parent. Linux uses a technique called "copy on write", or COW, to do this really quickly. The trick with COW is, instead of actually copying the data, you just mark it read-only, and point both the parent and the child to the same copy. Then, if either of them tries to write to it, you copy it secretly and pretend that it was a writable copy all along. This works really, really well, since the vast majority of memory ends up never being written to for various reasons.

Normally, Linux will handle when it runs out of memory by returning an error to the process that tried to request the memory. This works more or less well. There's an unfortunate tendency among programmers to ignore the result of malloc, though, which means that some programs will start to randomly crash when you get close to running out of memory. The point is, though, that at least there's a way to detect the situation - carefully written programs can avoid crashing by checking the return value of malloc and reacting properly if the system is out of memory.

But there's another problem here. Notice that with COW, the actual memory allocation happens at some random time, when you try to write to some random variable. If the system is out of memory, and has to make a copy, then you've got a problem. (Ideally, you'd make sure that there's enough free memory when you create the process, but then you're wasting a lot of memory - can't have that!) You can't just tell the program that you couldn't allocate memory, because the program didn't try to allocate memory in the first place! You have an error that can't be properly handled. So, Linux handles this situation with... the OOM killer.

The OOM (out of memory) killer does exactly what it sounds like: when you run out of memory, it kills a random process so that the system can keep going. They've developed some rather elaborate heuristics for how it selects the process to kill, so that it's less likely to be a process that you really care about, but as described in this awesome analogy, that's somewhat akin to an airline trying to decide which passengers to toss out the airplane if they're low on fuel. No matter what you do, the fact remains that you've gotten yourself into a bad situation.

I've seen swap provided as a solution to this, but that's basically just saying "don't run out of memory in the first place" - it's not terribly helpful. The fact is, no matter how carefully you write your program, and how meticulous you are about checking for errors, there's still a chance that your program will crash randomly, for no reason at all. Yay, Linux!

Friday, November 6, 2009

An Egg-and-Chicken Situation

I was reminded today that, while it's perfectly obvious to me that the egg came first, there are a lot of people that still aren't convinced. So, without further ado...

Initially, we can trivially say that all chickens were preceded by dinosaur eggs. This is no fun though, so I'll go ahead and strengthen the "paradox": which came first, the chicken or the chicken egg?

We define a creature as a "chicken" based on its DNA being sufficiently close to the modern species of chicken.

First, we take it as a given that any creature sufficiently chicken-like to be called one will have hatched from an egg. (Indeed, if a chicken-looking creature was born by some other means, we would probably not accept it as a chicken - and this is a proof in and of itself, albeit a less interesting one.) Thus, every chicken is preceded by at least one egg. However, this does not preclude an infinite cycle, which is the source of the paradox.

We next note that, because a chicken does not have the same DNA as either of its parents, and we're defining chicken-ness based on DNA, its possible that a chicken could be born where one or both of its parents are not-quite-chickens. Furthermore, I assert that, since there has been a point in time at which chickens did not exist, this must have happened at least once. If we consider that first chicken, it had the same chicken DNA when it was an egg, so it was in fact preceded by an egg. QED.

Thursday, November 5, 2009

An Unnecessary Evil

What do antivirus software and URL shorteners have in common? They're both elaborate solutions to fixable problems that should never have existed in the first place. They should have both been implemented by the one responsible for the problem, but both were instead solved by third parties. Also, they're both annoying. >_>

Start with antivirus software. The problem, obviously, is the rampant insecurity of Windows as a platform, combined with (let's be fair to Microsoft here, after all) a user base that's been trained to believe that running programs (installers) you've downloaded from random places on the Internet is okay, or even normal. The combination of these two factors has led to a malware industry that's actually pretty impressive, in terms of scope.

These days, Windows security is a lot better than it used to be, and Microsoft is even providing their own antivirus product. Some complain that it's anticompetitive, and that Microsoft is in a position to wipe out the rest of the antivirus market. I will be blunt: I would be perfectly happy to see that entire market shrivel up and die. It should never have existed in the first place, and if Microsoft can secure their OS to the point that we don't need antivirus, we'll all be better off for it. They haven't solved the problem yet, but they're trying, at least.

As for URL shorteners, let's look at why they exist at all. Twitter has a hard 140-byte limit on updates, a limit which is actually imposed by the 160-byte limit of text messages. People who receive updates through texts are the only ones for whom the limit is relevant; for the rest of us, it can easily be glossed over in the interface.

So, why doesn't twitter just run their own URL shortener for SMS users, keep the messages with shortened URLs within their own system, and automatically expand them when displaying them to users that aren't using SMS? Then users wouldn't have to worry about shortened URLs at all when reading tweets; Twitter clients could automatically shorten URLs to an appropriate length when posting updates; Twitter could even charge for use of really short URLs and finally have an actual revenue stream. (Or not, on that last one, I dunno if there's enough scarcity there to even support micropayments.) Ideally, the shortener would guarantee some minimum length of URL that it could provide, but then actually use the longest possible URL that would fit in the message, to preserve the URL space.

As with antivirus software, a market has sprung up to address this unnecessary flaw in Twitter's service. There was a rash of URL shorteners popping up with the rise in Twitter's popularity, though I think that's cooled off quite a bit with Twitter's use of bit.ly as the default, and is.gd going out of business (or not? I forget now). Given that a lot of people aren't using Twitter via SMS, URL shorteners are an unnecessary evil in most cases. Twitter should follow Microsoft's example, step up to the plate, and give us back our URLs.

Wednesday, November 4, 2009

Idea post: low-latency caching proxies for mobile internet

This is just an idea post, since I seriously don't have time to implement this right now.

A problem that has been getting worse with mobile internet (on cell phones, etc) is latency. Rather, I should say, the bandwidth on mobile devices has been getting better and better, and is starting to rival slow DSL connections for throughput, but the latency is still atrociously bad. This could be addressed by mobile network operators, but since I'd like to see this problem solved before I am dead, I think it's fair to look into alternate solutions.

Quick explanation first: Bandwidth is more complicated than people usually think. There are two main components to it: throughput, which is the number of bytes per second, and latency, which is the amount of time those bytes take to arrive. Internet connections are usually sold exclusively in terms of throughput, but latency can be really important, especially for applications that have to go back and forth between you and a server a lot. Wireless networks (cell networks and wi-fi both) generally have way higher latency than wired connections of any type. For a good wired connection, 30-150 milliseconds latency is normal, while for a wireless connection, it ranges from a few hundred to a few thousand milliseconds. Congratulations! You now know more about internet connections than a Level 2 AT&T tech support person.

idea 0: low-latency caching proxies
This is actually a pretty trivial thing - anybody can set up a proxy server, and point their mobile web browser at it. (Well, theoretically.) This doesn't gain us that much, though. It'll mainly reduce latency caused by the website, not by the mobile network, so it doesn't really address the actual problem. There are two reasons for even mentioning this: it gives us compression, since that's a relatively straightforward thing that proxies can add, and it makes the rest of the ideas possible.

idea 1: aggressive HTTP pipelining
HTTP allows for an optimization called "pipelining". Basically, the client makes a bunch of requests at the same time, on the same connection, and the server responds to them in order. This can do wonders for reducing latency. The usual sequence of actions goes something like this: send request, wait, get response, send request, wait, get response, send request... Pipelining eliminates most of the waiting. Clients will usually limit the amount of pipelining they do to be kind to web servers, but if we're getting everything from our own proxy server from step 0, we can send as many requests as we want.

It's not perfect, though. On the web, documents will usually pull in other files, like images, scripts, etc, and those can themselves pull in other files. So while we'd like to make all the requests up front, we usually don't know which files we'll need before the responses start coming back. Pipelining helps, but we still end up having to wait more than we'd like.

idea 2: inlining images and scripts and stylesheets
What if instead of depending on the client to fetch everything, we had the server help a bit? A lot of files on the web can actually be either linked to as external documents, or rendered inline inside a page. Javascript and CSS can be inlined pretty easily, and every web browser supports that. It turns out it's possible to also inline images this way, using something called a data URL.

The proxy server could fetch the web page, fetch all the other parts of the page that the client would normally have to request individually, and package them up into a giant page, before sending that to the client. If this works perfectly, then the client only has to make one request, and wait for one response.

This has some pretty serious disadvantages, though. For one thing, it bypasses caching on the client device, so anything included using this method would have to be fetched each time the page loads. This would be fine for small files, but there would have to be some kind of heuristic on the server for what to inline, and what to leave out. I'm also not entirely sure that javascript behaves exactly the same way when it's linked versus inline, but given how well mobile devices support javascript to begin with, that may not be such a big issue.

idea 3: image prescaling
Jumping back to decreasing bandwidth here. It seems kind of a waste to download a full-size image to a mobile device when it's just going to scale it down before it's displayed. Why not scale the image down to the display size before it leaves the proxy? We'd need some way to control this from the client end, since otherwise the server won't know what size to scale to. A custom request header would work, probably. This would also save a lot of CPU time on the mobile device, which translates to faster rendering and less battery usage.

This may not work so well for zoomable browsers, though, such as the iPhone. In this case, it would be interesting for the server to make the image a progressive scan image, so that the client could receive a scaled version first, and then the rest of it later. We could even imagine something fancy, with progressive scan images and HTTP Range headers, where the client first downloads the first progressive part of each image, enough to display a scaled version, and then goes back and fetches the rest of each image file.

(And while we're at it, we can convert GIFs to PNGs. :p)

idea 4: speculative server-initiated prefetch (needs a better name) blah blah
We've been trying to work around the fact that clients have to fetch everything that's on a webpage to load it. What if we bypass that entirely? (this one is more me thinking out loud, actually)

What I'm proposing here is a way (that I haven't thought through properly) to have the server push files to the client, rather than waiting for the client to request them. This would allow the client to cache page elements properly, but still have the same performance as inlining page elements. The problem we'd run into then is that the server has no way of knowing what's in the client's cache (and really, shouldn't), so there will still be a lot of wasted bandwidth here. Maybe the server could send a list of page elements and -- nope, then we're back to client fetch anyway. Hmm. This one might be a lost cause. >_>

Existing stuff

Naturally, I'm not the first one to think of this. There are a few implementations of parts of this; the one that comes to my mind first is Opera Turbo. What I'm proposing here, though, is an open-source server that anybody can run, rather than an add-on controlled by one company. After all, the Internet has proven time and time again that the usefulness of a technology is proportional to how easy it is for an enterprising geek to roll their own.

Tuesday, November 3, 2009

Lesser-known racist fallacies

"My friend, who is a member of (ethnic group), was not offended by this joke; therefore, it's not racist." This is really a generalization of the inexplicable tendency of people to assume that any member of another race speaks for all members of their race, which is roughly the intellectual equivalent of "You all look the same to me." I, for example, am a pretty nonrepresentative sample of Indians. If you want to know what all Indians think, give me a few days so I can go around and ask all 1.1 billion of them.

"I didn't mean for it to be taken as racist." a.k.a., "The road to racism is detoured by good intentions." The intent was never the problem to begin with; racism is as problematic as it is because it offends people, not because you wanted to. (okay, so this is debatable.)

"Minorities, when speaking about racism, are above reproach." If anything, minorities are frequently just as racist or more racist than others, just because nobody thinks to call them out on it. (Asians, for example, can be incredible nationalistic snobs. Just try asking a first- or second-generation immigrant parent about all the amazing things that were done first in their country.) This is still a bad thing!

"Traditions and culture are especially important to minorities." In my (unnecessarily inflammatory) opinion, traditions are mainly important to people who don't have much else. I am more than what I've been handed down by thousands of years of handing-down, or at least, I aspire to be.

"Meat is delicious, you should try it sometime. :o" Nah, see, I might be convinced by this reasoning, except that it kind of misses the entire point. >_> I will not be swayed by your delicious bacon!

Monday, November 2, 2009

Barriers to entry for open-source contributions

Contributing to open-source projects is relatively easy, but it could definitely be easier. There have been times that I've found a problem, decided on a solution, even written a patch once or twice, and then lost interest because of some random restriction along the way. This is wasted potential!

Moreover, contributing to open source is easy for me, but I'm already an open source contributor. For new contributors, the process can be really daunting, for a variety of reasons. I'm going to go through a list of things that projects can do to encourage contributions, roughly in order of increasing difficulty.

Ask!
It's really surprising how few open-source projects even get this far. Every project needs to have a page somewhere that says something along the lines of, "If you want to contribute, we could use a hand with X, Y, and Z. Contact so-and-so for details." If you don't say something like this, most people - especially those that aren't familiar with open source - will just assume that you don't really want outside contributions.

Keep some easy bugs around for new contributors
This is pretty easy for any project that has a reasonable volume of bug reports coming in. Invariably, some of them will be for really easy stuff - spelling fixes in documentation, or other really trivial fixes. Instead of just fixing these, give them a special tag or something on your bug tracker, or make a list of them on your website, and advertise this list to your users. The ones that are interested in contributing will have something quick and easy to get started on, and you'll have that many fewer bugs to fix - it's a win-win! Ideally, once you have a system like this in place, people will feel more comfortable about filing trivial bugs too, and you'll end up with higher-quality software overall.

Don't require registration to submit a bug report
Trac is kind of bad about this since everybody runs their own instances, though if I recall correctly recent versions support OpenID, which is a big step forward. If I find a bug, but I have to go through the whole registration dance just to report it, I'm just not going to bother reporting it unless it's a really severe bug. You may say that I'm not a serious contributor if I let a registration page stop me, but that's kind of missing the point. Casual contributors can be just as valuable as really serious ones, and there are a lot more of them out there.

Allow editing through a web interface
This one doesn't actually exist yet, but it could. The normal process of submitting a patch involves checking out the source code, making modifications, generating a patch, and submitting that to the maintainers of the program. 3/4 of those steps could be automated on a code hosting site such as SourceForge or github or bitbucket. The code could be cloned on the server; the user could be presented with a simple text editor in the browser (or something more fancy, like Bespin), the user could save the code with a commit message when they're done, and the commit could automatically be submitted as a pull request on the server. I wouldn't want to use this to make serious changes to the code, but this isn't designed for people that are already making serious changes. For somebody who's just making a small cosmetic change, this would be a huge timesaver, and that in turn increases the number of contributions you get.