Wednesday, January 27, 2010

Project Blog: Collatz

(Note: this post is an extra-credit assignment for my software engineering class.)


The first project is a simple numerical problem based around the Collatz Conjecture. The assignment involves finding the maximum "cycle length" (the number of steps it takes to reach 1) for any number in a given range. Because this is a problem which would involve a lot of repeated computation if it was solved naively, we implemented it using a memoizing cache. Our cache implementation is pretty simple, and uses a HashMap in Java and a dict in Python. This implementation seems to perform well.

Of course, the focus of this assignment isn't on the actual problem, but on the infrastructure we build around it. In particular, we were encouraged to use an issue tracker, plan out the design on a project wiki, use a version control system, and create a comprehensive test suite. We were to use Google Code for the first two. As usual, Google's online code hosting tools are excellent, and even though we didn't actually store any code on it, the wiki and issue tracker were very easy to get started using once we created the project.

My partner and I used Mercurial for version control. Instead of setting up some sort of hosted version control thing, we simply created world readable (chmod 755) clones of a repository on our CS accounts. This isn't exactly secure, but it was the simplest way to get started, and we're reasonably sure that other students won't try to poke around and guess the directory name we used. In my opinion, Mercurial is the simplest of the three suggested VCSes, once you wrap your head around the separation between the repository and the files.

A helpful Mercurial tip: If you add a .hgignore file (note the . at the front of the name - it's a hidden file) to the top level of your repository, Mercurial will ignore files matching certain patterns. This can keep you from accidentally committing a bunch of compiled class files or other generated files, and it also makes the output of "hg status" a lot more usable.

Unit testing isn't something I do often, but I think it's something I should get used to doing on all my projects. For any codebase that I intend to keep around for a while, the value of a test framework that I can quickly try out new changes in would definitely outweigh the cost of writing the tests in the first place. It's pretty cool to be able to get a sanity check that all the code actually does what you think it does. One thing I didn't initially realize is that you can do more than just test the provided functions; I now have a test case which checks the contents of the cache after a test run.

Implementation stuff

It should be possible to write the divide-by-two step as a right shift, and the times-three-plus-one step as a left shift and two adds. This might improve performance, but I haven't done it because I strongly believe in leaving that sort of thing to the compiler. The way I see it, if I'm not using a language that has a good optimizing compiler, I probably don't care about performance all that much to begin with.

(There's another possible optimization that was mentioned in passing during class - combining iterations when possible. Since it's guaranteed that the result of a (3n+1) operation is even, you can implement ((3n+1)/2) as (n + (n >> 1) + 1), and get two iterations done at once.)

There's a tradeoff between using an array for the cache (faster, more memory usage) versus using a mapping type (slower, less wasted memory). It would be interesting to evaluate both of these against a hybrid which uses a fixed-size array for values below a certain threshold, and a mapping type for values above the threshold, since it seems like this would get the best aspects of both types of cache. We haven't implemented this, however, because it would complicate the project unnecessarily.

Problems

We haven't yet submitted our project to the UVa online thing, despite a few unsuccessful tries. Now that the class has discovered the class name requirement, and the fact that the inputs aren't necessarily given in order, we intend to try submitting again. It would certainly be nice if UVa would give a more descriptive error message than "runtime error", though.

Despite what UVa says about the problem, there are a few values (113383, for one) which actually will overflow a 32-bit signed integer. They probably designed the problem with C/C++ in mind, where you would have access to unsigned data types. We worked around this by using long integers.

The assignment seems to imply that it's a good idea to check user input using assertions, but as Professor Downing mentioned in class, that's not right. Because they're usually disabled, assertions should only be used as a sanity check for "impossible" situations. Anything that could actually happen (like the user entering bad input) should be checked separately from the assertions. I think the bit in the assignment about using assertions to check argument validity should be clarified, so people don't get the wrong idea.

Final remarks

Overall, this seems like a very worthwhile introductory project. The actual program is pretty simple, but the real goal is to get the whole class some experience with version control and unit testing, both of which are necessary tools, and both of which are frequently neglected. It always amazes me how far you can get in a computer science program without using version control. :(

No comments: