Wednesday, January 27, 2010

Project Blog: Collatz

(Note: this post is an extra-credit assignment for my software engineering class.)


The first project is a simple numerical problem based around the Collatz Conjecture. The assignment involves finding the maximum "cycle length" (the number of steps it takes to reach 1) for any number in a given range. Because this is a problem which would involve a lot of repeated computation if it was solved naively, we implemented it using a memoizing cache. Our cache implementation is pretty simple, and uses a HashMap in Java and a dict in Python. This implementation seems to perform well.

Of course, the focus of this assignment isn't on the actual problem, but on the infrastructure we build around it. In particular, we were encouraged to use an issue tracker, plan out the design on a project wiki, use a version control system, and create a comprehensive test suite. We were to use Google Code for the first two. As usual, Google's online code hosting tools are excellent, and even though we didn't actually store any code on it, the wiki and issue tracker were very easy to get started using once we created the project.

My partner and I used Mercurial for version control. Instead of setting up some sort of hosted version control thing, we simply created world readable (chmod 755) clones of a repository on our CS accounts. This isn't exactly secure, but it was the simplest way to get started, and we're reasonably sure that other students won't try to poke around and guess the directory name we used. In my opinion, Mercurial is the simplest of the three suggested VCSes, once you wrap your head around the separation between the repository and the files.

A helpful Mercurial tip: If you add a .hgignore file (note the . at the front of the name - it's a hidden file) to the top level of your repository, Mercurial will ignore files matching certain patterns. This can keep you from accidentally committing a bunch of compiled class files or other generated files, and it also makes the output of "hg status" a lot more usable.

Unit testing isn't something I do often, but I think it's something I should get used to doing on all my projects. For any codebase that I intend to keep around for a while, the value of a test framework that I can quickly try out new changes in would definitely outweigh the cost of writing the tests in the first place. It's pretty cool to be able to get a sanity check that all the code actually does what you think it does. One thing I didn't initially realize is that you can do more than just test the provided functions; I now have a test case which checks the contents of the cache after a test run.

Implementation stuff

It should be possible to write the divide-by-two step as a right shift, and the times-three-plus-one step as a left shift and two adds. This might improve performance, but I haven't done it because I strongly believe in leaving that sort of thing to the compiler. The way I see it, if I'm not using a language that has a good optimizing compiler, I probably don't care about performance all that much to begin with.

(There's another possible optimization that was mentioned in passing during class - combining iterations when possible. Since it's guaranteed that the result of a (3n+1) operation is even, you can implement ((3n+1)/2) as (n + (n >> 1) + 1), and get two iterations done at once.)

There's a tradeoff between using an array for the cache (faster, more memory usage) versus using a mapping type (slower, less wasted memory). It would be interesting to evaluate both of these against a hybrid which uses a fixed-size array for values below a certain threshold, and a mapping type for values above the threshold, since it seems like this would get the best aspects of both types of cache. We haven't implemented this, however, because it would complicate the project unnecessarily.

Problems

We haven't yet submitted our project to the UVa online thing, despite a few unsuccessful tries. Now that the class has discovered the class name requirement, and the fact that the inputs aren't necessarily given in order, we intend to try submitting again. It would certainly be nice if UVa would give a more descriptive error message than "runtime error", though.

Despite what UVa says about the problem, there are a few values (113383, for one) which actually will overflow a 32-bit signed integer. They probably designed the problem with C/C++ in mind, where you would have access to unsigned data types. We worked around this by using long integers.

The assignment seems to imply that it's a good idea to check user input using assertions, but as Professor Downing mentioned in class, that's not right. Because they're usually disabled, assertions should only be used as a sanity check for "impossible" situations. Anything that could actually happen (like the user entering bad input) should be checked separately from the assertions. I think the bit in the assignment about using assertions to check argument validity should be clarified, so people don't get the wrong idea.

Final remarks

Overall, this seems like a very worthwhile introductory project. The actual program is pretty simple, but the real goal is to get the whole class some experience with version control and unit testing, both of which are necessary tools, and both of which are frequently neglected. It always amazes me how far you can get in a computer science program without using version control. :(

Tuesday, January 19, 2010

Gentoo customizations

I've had this Gentoo install running for several (three? four? I forget) years now, because reinstalling your OS is for chumps. Unfortunately, that means that I've forgotten a lot of the tweaks I've made to my system, so that when (not if!) my main hard drive gives out, I'll be out of luck. :(

First, a note: my system hard drive is actually an 8 GB CF disk, because I'm way too cheap to get an SSD. Programs load really quickly, but writing to it can be really slow, so I try to keep write-mostly stuff off of it in general.

SquashFS portage tree

The portage package tree is made up of about a squillion tiny text files, and it's mostly read-only - it's basically the ideal use case for SquashFS. I've combined it with unionfs to make a fully read-write portage tree that only takes up 32 MB, instead of hundreds.

SquashFS is included in the Linux kernel now (gzip compression only, though - not a huge deal), but unionfs isn't. You'll have to grab the patch from here. Apply it with "gunzip unionfs-whatever_for_whatever.diff.gz | patch -p1", if I recall correctly. Then enable SquashFS, unionfs, and tmpfs - I'm assuming you know how to build a kernel, not gonna get into that. You'll also need sys-fs/squashfs-tools installed.

I use a few incredibly trivial scripts to manage this; you can get them here. To get started, run make-portage-snapshot.sh, which clones /usr/portage into a new SquashFS image, called portage-snapshot-new.sfs. Then, create /mnt/portage-ro and /mnt/portage-rw. My setup has a read-only mount of the SquashFS image in -ro, and a tmpfs mount in -rw. Next, rename portage-snapshot-new.sfs to portage-snapshot.sfs, and run mount-portage.sh. If all of this went smoothly, you have a fully-functional copy of the portage tree at /usr/portage. If so, run umount-portage.sh, wipe the portage tree, and mount the unioned portage tree again. Congratulations, you have a few hundred MB extra disk space! To update the SquashFS image, just run make-portage-snapshot.sh again, unmount portage, and then replace the old snapshot with the new snapshot and remount.

For bonus points, you can drop the ChangeLog files from the tree, since portage doesn't actually use them - they're just there because they're sometimes useful. Add the following line to your make.conf:


PORTAGE_RSYNC_EXTRA_OPTS="--exclude ChangeLog --delete-excluded"

and then sync your portage tree again. 

Moving stuff to /tmp

One thing that kind of sucks about Gentoo is having to keep an eye on /usr/portage/distfiles and /var/tmp/portage. I would guess that a lot of users don't even know that you have to, which ends up leading to a lot of wasted disk space. I solved this by moving both of them to /tmp, which (if you have WIPE_TMP enabled in /etc/conf.d/bootmisc) is wiped automatically at every reboot. The following two lines from my make.conf are the only configuration necessary:

PORTAGE_TMPDIR="/tmp"
DISTDIR="/tmp/portage-distfiles"

Another benefit of this: I'm one step closer to being able to mount /usr read-only! Which I'll get around to doing someday, I swear. >_>

Mounting by UUID

Know what sucks? Getting a new hard drive, or a new version of udev, and suddenly having all your disk devices under different names. None of your filesystems will mount! You have to go through them and fix all the entries in /etc/fstab! Oh noes!

Luckily there's a solution: you can list your filesystem UUIDs in fstab instead of your device names. Let's start with this:

/dev/hdb1   /home     ext4    defaults,noatime,data=ordered          0 0

You can find the filesystem UUID like so:

$ ls -l /dev/disk/by-uuid
total 0
lrwxrwxrwx 1 root root 10 Jan 17 15:24 0f36c009-feff-4271-866d-a8796f5c92a7 -> ../../hdc2
lrwxrwxrwx 1 root root 10 Jan 17 15:24 1cb3076f-94ee-4a02-ab01-6186aaeb429a -> ../../sda1
lrwxrwxrwx 1 root root 21 Jan 17 15:24 43dd79e6-af80-4791-938b-0a53f4d6d645 -> ../../hdb1
lrwxrwxrwx 1 root root 20 Jan 17 15:24 902c0d0b-b184-4571-90c3-246f6664e7cc -> ../../hdb2
lrwxrwxrwx 1 root root 10 Jan 17 15:24 cbdcbf38-c8f3-4b32-82f1-26ebb7245273 -> ../../sdb1

We're looking for hdb1, so the UUID for the filesystem is 43dd79e6-af80-4791-938b-0a53f4d6d645. Then we can slap that into the fstab:

UUID=43dd79e6-af80-4791-938b-0a53f4d6d645   /home     ext4    defaults,noatime,data=ordered          0 0

...and the system will find that filesystem no matter what the drive it's on is called.

Alternate keyboard layout

I use the Colemak keyboard layout, and switching was kind of a pain. :( You have to configure the console and X separately.

Console: Download this tarball, grab colemak-1.0/linux_console/colemak.iso15.kmap from it, gzip it, and drop it in /usr/share/keymaps/i386/colemak/ (which you'll have to create). Then, set:

keymap="colemak"

in /etc/conf.d/keymaps.

X: It used to be enough to set "XkbLayout" "us(colemak)" in the input section of your xorg.conf, but then they went and rewrote the configuration for Xorg completely. Awesome! For Xorg with HAL, save this file as /etc/hal/fdi/policy/10-x11-input.fdi, and you should be good.

Apparently, they're rewriting the configuration system from scratch again for the next version of Xorg. I can't wait to see what I have to do to be able to type with Xorg+udev. :-/

Default applications

There are two complementary approaches I'm experimenting with right now for default applications, because I don't especially trust gnome or KDE to get it right.

First: I have a symlink in /usr/local/bin that points to my current browser, and an eselect script which manages that symlink. I can then point all my applications at "browser" as the default browser, and control it pretty easily. Next best thing to the OS actually having some way to set a default browser built-in. This is pretty similar to debian's /etc/alternatives thing... actually, hmm, it's basically identical. I should look into this. >_>

Second: I've written a tool which can automatically find a program that supports a given file, and open it. I'm still not terribly fond of the configuration system for it, but it does work, and if I set it as the handler for a given file type in all my programs, I can change the program used to open that type in a single place, instead of doing it everywhere. 

Automatically mounting external drives

Last I checked, there's no good way to do this that doesn't involve using gnome or KDE. halevt is a mostly acceptable solution, but it has some annoying quirks, like absolutely refusing to mount drives that are listed in fstab. Not only that, it's doomed in the long run, thanks to the inexplicable and premature deprecation of HAL. (2006: "Everybody, use HAL! It will make everything awesome!" 2009: "Guys, it turns out HAL is terrible! Stop using it immediately! Also, we've got this new thing called udev - it'll make everything awesome!")

Actually, I'm not even sure why I'm mentioning this, except to be bitter. Thanks, Gnome and KDE, for your tremendous contributions to Linux UNIX free software Gnome and KDE!

Monday, January 11, 2010

How Not to be Evil

"Don't be evil." Words to live by, especially if you're a giant corporation like Google, which has more potential to do evil than perhaps any other organization in history. They have become pervasive over the past decade - at the moment, I am blogging on a Google service, with another one (GMail) open in another tab; my browser (like most browsers today) has a Google search box built right in at the top, and if I wanted to I could switch to Chrome, a Google-branded browser, in a few seconds. And that's just what I see right now in front of me. They're also pervasive in the search and advertising spaces - the vast majority of Internet users use Google first and foremost for search, and I believe the majority of ads shown on the web are from Google as well.

Once upon a time, Google made a reasonable effort not to abuse their position of influence too much. Sure, they show you ads when you search, but that's understood to be an implicit transaction - the ads pay for the search service. Sure, GMail reads your email to target ads at you, but users understood that that was the whole reason Google was running a mail service in the first place when they chose to use GMail. Whenever Google did something like that, there was a reasonable enough explanation. Notably, they wouldn't give themselves free advertising, and abuse their position as the owner of this massive ad network.

Recently, though, that's changed. It may have started with Chrome, it may have started earlier - I don't really know - but these days Google advertises their own stuff on the search page. When Chrome was first released, there were big ads on the Google home page urging you to install Chrome if you hadn't already. This has a tremendous effect on the market - just recently, Chrome overtook both Safari and Opera, which have been around for far longer. It's now the third most-used browser, behind only Firefox and Internet Explorer. It turns out that, in a shocking turn of events, if Google tells everybody that uses Google to install something, a lot of them will do it.

Today, I'm looking at an ad for the Nexus One, the long-awaited Googlephone, or "gPhone" to those that have a lingering iPhone fixation. It's a significant device, in that it's Google's attempt to tweak the collective nose of the cell phone industry in the US, and introduce people to the benefits of buying unlocked phones. That's still no excuse for giving it ad space - for free - that any company in the world would sell their souls for. Used to be that the most change you would see on the Google homepage was an artistic logo for special occasions, but apparently, that's changed.

Google is usually a benevolent corporate overlord, as corporate overlords go, but this sort of advertising raises all kinds of ethical questions. They are in a unique position as a company which controls online advertising, and also sells many other services, and they should realize that they've crossed some sort of line.