P. Static has got Opinions: April 2010

Friday, April 30, 2010

Stupid Python tricks, volume 1

So my thesis is due in less than a week, but that doesn't mean I can't write random nifty things in my free time. >_>

Like I've said in previous posts, UNIX pipes are pretty neat. So let's say I want to replicate them in Python. First: is it even possible? Turns out yeah; Python has operator overloading. In general I am opposed to operator overloading, since it makes it possible to write really horrible code. Really horrible. On the other hand, it also enables Stupid Language Tricks, which might just balance out. :D

I'll do this in two parts, 'cos the full trick is really a combination of two stupid tricks.

Trick 1

First: we want a pipe-like syntax for doing stuff. We can do this by overloading the | operator in Python, using the __or__ and __ror__ functions. Initially, I tried to write classes that implemented the __or__ function to sort of stick together, and strung instances of these classes into a pipeline. Let's say I want to take the absolute value of an input, convert it to a string, and reverse the digits. (No reason - it's just an operation that takes several functions to do.) It'd look like this:

pipe_wrap(abs) | pipe_wrap(str) | pipe_wrap(lambda s: s[::-1])

This turned out to be really messy. I want to write a pipe using regular functions, and having to wrap the functions like that just looks messy.

The approach I ended up with is having "joiner" objects in between. First, I defined a joiner class which implements both __or__ and __ror__, so it can handle |s on both the left and right sides. When it's received things on both sides, it magically turns into the result of calling a provided function on the two things. Here's the class:

class incomplete_join(object):
  def __init__(self, compose, l=None, r=None):
    self.compose = compose
    self.l = l
    self.r = r
  
  def __call__(self, *args, **kwargs):
    raise Exception("Incomplete join: l=%s r=%s" % (repr(self.l), repr(self.r)))
  
  def _do_join(self, l=None, r=None):
    if l is None:
      l = self.l
    if r is None:
      r = self.r
  
    if l is not None and r is not None:
      return self.compose(l, r)
   
    return incomplete_join(self.compose, l, r)
 
  def __or__(self, right):
    return self._do_join(r=right)
  
  def __ror__(self, left):
    return self._do_join(l=left)

def compose(f1, f2):
  return lambda x: f2(f1(x))

# Print function, because Python doesn't yet let you treat "print" as a function
def prnt(x):
  print x

j = incomplete_join(compose)

This is all we need to start composing functions together using pipes. compose implements standard functional composition, and incomplete_join handles the pipe-like syntax. To use our previous example, the following two things are equivalent:

temp = abs |j| str |j| (lambda s: s[::-1]) |j| prnt
temp2 = compose(compose(compose(abs, str), (lambda s: s[::-1])), prnt)

Pretty neat! However, we started out wanting to implement pipes. We've got pipe syntax, sort of, but not pipe behavior. Ideally, we'd like to have a set of objects which can run in parallel in separate processes, which have user-defined behaviors, and which can be combined into a pipeline. So:

Trick 2

In the first example, compose was just a simple thing that composed functions together. What if we used something fancier in its place? What if, for instance, we had a function that took two objects (which stored their results in multiprocessing queues), and called the second one in a new process, and took care of shuffling data from the output queue of the first into the second?

import multiprocessing

class queue_pipe(object):
  def __init__(self, func):
    self.func = func
    self.output_queue = multiprocessing.Queue(1)
  
  def __call__(self, item):
    if item is None:
      self.output_queue.put(None)
    else:
      self.output_queue.put(self.func(item))
  
  def next(self):
    item = self.output_queue.get()
    if item is None:
      raise StopIteration()
    return item

class queue_sink(object):
  def __init__(self, func):
    self.func = func
  
  def __call__(self, item):
    if item is not None:
      self.func(item)

def queue_compose(q1, q2):
  def transfer(q1, q2):
    try:
      item = q1.next()
      while True:
        q2(item)
        item = q1.next()
    except StopIteration:
      q2(None)
  
  proc = multiprocessing.Process(target=transfer, args=(q1, q2))
  proc.start()
  return q2
 
q = incomplete_join(queue_compose)

So what does this do? Using this, you can create a pipeline out of callable iterators. queue_pipe implements both the callable and iterator interfaces, and queue_sink is a (sort of unnecessary) callable which filters out the terminal None value in the pipe. transfer() runs in a new process, and bridges the gap between two pipe components.

An example would be a lot more descriptive here.

@queue_pipe
def double(x):
        return x * 2

qprnt = queue_sink(prnt)

iter(range(50)) |q| double |q| qprnt

This example demonstrates two different ways you can use the wrapper classes, just for the heck of it. The single line of code at the end constructs a pipeline across three processes, and executes each function for each item in the input, in a separate process. (Just like a UNIX pipe!) As an added bonus, the ends of the pipe are standard Python objects, so you can use any iterable at the front and any callable at the back. It even cleans up the processes afterward, since they exit when the end of the input is reached.

So, yeah. This is the kind of stuff I work on when I'm procrastinating on my thesis.

Disclaimer: I do not recommend using this trick in any critical code. It is a hack, and an incomplete one at that. I take no responsibility for bad things that may happen if you actually use this in production code. I cannot guarantee that this code will not fork-bomb your systems, give you the flu, or kidnap your pets.

Thursday, April 8, 2010

JSON pipes

So I mentioned this in my pre-vacation post, but I'm actually running with this now - I've got a sort of prototype going and everything. It's written in Python, but I'm designing it to be trivial to use from any language with a good JSON library, which I think is pretty much every language these days.

Rationale

The UNIX command line really has two problems. First, it's too user-friendly. ("Whaa?" Keep reading.) Any tools which deal with structured data have to present that data to the user, we assume, so programs commonly do things like wrapping or truncating long lines, rounding numbers to be more readable, lining up columns, things like that. However, one of the founding principles of UNIX is that tools should be designed so that their output can be passed to other programs as input. What's good for humans is bad for programs; too much cleaning up the output and it's basically unusable in a pipeline. (The curses library is an extreme example of where that can lead.) There's a constant tension between user-friendly and parsable, so most programs have landed at a weird compromise, and aren't very good at either one.

The second problem is delimiting and escaping. UNIX handles them in a completely ad-hoc and disorganized manner, when it addresses them at all. The worst example of this in my mind is "find -print0". find passes filenames around as newline-separated raw data, and this works fine as long as the filenames don't contain blanks or newlines. Since it turns out that some filenames actually do, they added a special case to find where it would delimit its output with null characters, and then they went and had xargs accept null-delimited data. Problem solved! As long as every program you want to pipe into has implemented the same hack, of course.

And really, find is one of the better examples out there. Most programs ignore delimited and escaping completely, and just do whatever comes to mind. As an example: I'm pretty sure it's impossible to correctly parse the output of "df" if any of your mounted devices contains spaces.

Solution

Programs should use a more structured data format on their stdin and stdout streams, when it makes sense. I'm using JSON for this, because it has everything I need, nothing I don't, and it's dead easy to parse. In a nutshell, every line going through the pipe is a valid JSON value, usually either an object or a string. This buys us a lot - records are trivial to parse (and even named, because you can use dicts for them - hallelujah!), strings are properly escaped, and if we want to then convert to nice human-readable output, the program to do it only needs to be a few lines long. And, it turns out, writing programs to use JSON streams isn't any harder than writing programs to use raw text streams.

(I should emphasize this point: I'm using exactly one line per JSON document. It's entirely possible to parse JSON streams without this restriction - I have a parser that does it, actually - but it's more trouble than necessary. Every language can parse lines out of a stream, and just about every language has a JSON parser. No point requiring a more advanced parser when adding this restriction lets us parse a stream using only a few lines of code.)

Here is a simple example, using the tools I've written so far. It finds the three largest files in a directory:

jls | jsort size | head -n 3

(Note that it works seamlessly with existing line-oriented UNIX tools.)

For comparison, here's the best standard UNIX equivalent I can write:

ls -l | sort -rk 5 | head -n 3

It's about the same length, but it took longer to write, and I had to consult the sort man page and try it out a few times before it worked. (And, really, it would have taken much longer if I hadn't happened to know that sort could sort by arbitrary fields in whitespace-delimited data.) In this example, the main thing JSON buys us is named fields for the sort.

"But hey," you might say, "that's a pretty basic example! Show me something that's really difficult to do with standard command line tools."

RSS feeds are a great example of the sort of structured data that's difficult to fit in a pipe. Let's say you want to write a generic, reusable UNIX program for fetching an RSS feed. Your options for output are either dumping the raw XML (ewww, gross! plus, XML streams are poorly defined and hard to parse), or making up an ad-hoc format ("okay, so every line will be one piece of data, in Key: Value format, and a double newline will be the separator between RSS entries, and the entry text will be base64 encoded so we don't have to worry about double newlines in it, and...."). Or, you could dump the data as a JSON stream! Not only is this trivial (my example of this is only 35 lines long), it's also really easy to work with. For example:

jrss $feed_url | jget link | jxargs -n 1 firefox

I dare you to solve this problem this nicely using standard UNIX practices. (No, you can't just write a program that reads an RSS feed and only spits out the links. The whole point is to get stuff done by combining very general pieces. Yes, I know that jrss has the feedparser library doing most of the work. The point isn't really the implementation; it's the fact that jrss is an extremely general-purpose command line tool.)

Conclusion

I've always thought that pipes were far more elegant and useful than the tools we use with them. I feel like I've finally managed to justify that, by coming up with a better way to pass data through them. Even if you're not convinced that we should start putting JSON everywhere, I hope I've at least convinced some people to start thinking about the shortcomings of traditional UNIX tools. There has been entirely too little of that lately.

For the curious: the code I've been playing around with is online at bitbucket. Suggestions and discussion are, naturally, welcome.

Thursday, April 1, 2010

Vacation miscellany

* For some reason I've been sitting on this post for a week. Wanted to make sure I didn't miss anything, I guess. Plus I've been really busy. :(

* ISPs are by nature jerks about SMTP servers, but on this vacation for the first time, their stupid policies have actually cost me a meal. :[

* Everybody in Guyana apparently has a nickname, or "call name". People get them when they're kids, usually, and they just sort of stick. My mom's nickname is Sister, because one of my aunts used to call her that all the time when they were kids. My Uncle Val's nickname is Valmiki, after the poet. One of my great-uncles is Doctor, because he wanted to be a doctor when he was a kid, and would go around giving people pretend injections. (This, despite the fact that he's been a practicing lawyer for several decades now.)

* The next time you're vacationing in the Caribbean, here's an Authentic Caribbean Experience: Find a coconut vendor; they'll be the guys with the carts full of coconuts and the machetes. They chop the tops right off the coconuts, stick a straw in them, and give them to you to drink. When you're done, you can go back to them with the shell and they'll chop it up for you, and cut off a piece of the shell so you can scrape the jelly out from the inside. If you like coconut, it's kind of awesome. If you don't like coconut, it's possible that you've never had a good coconut D:

* Somehow, I feel like I was able to think more clearly after being Internet-deprived for a few days. It was the weirdest thing, like the steady stream of distractions I get online have more than the obvious effect of distracting me. This seems like something worth looking into later on.

* Dear Blogger: If I am going through the trouble of writing my own HTML, feel free to take that as a hint that I don't want you attempting to rewrite my URLs and breaking them in the process every time I edit a post. Wordpress has its faults, but I'm pretty sure it can handle a damn anchor link.

P. Static has got Opinions

Friday, April 30, 2010

Stupid Python tricks, volume 1

Thursday, April 8, 2010

JSON pipes

Thursday, April 1, 2010

Vacation miscellany

Followers

Blog Archive

About Me