Rationale
The UNIX command line really has two problems. First, it's too user-friendly. ("Whaa?" Keep reading.) Any tools which deal with structured data have to present that data to the user, we assume, so programs commonly do things like wrapping or truncating long lines, rounding numbers to be more readable, lining up columns, things like that. However, one of the founding principles of UNIX is that tools should be designed so that their output can be passed to other programs as input. What's good for humans is bad for programs; too much cleaning up the output and it's basically unusable in a pipeline. (The curses library is an extreme example of where that can lead.) There's a constant tension between user-friendly and parsable, so most programs have landed at a weird compromise, and aren't very good at either one.
The second problem is delimiting and escaping. UNIX handles them in a completely ad-hoc and disorganized manner, when it addresses them at all. The worst example of this in my mind is "find -print0". find passes filenames around as newline-separated raw data, and this works fine as long as the filenames don't contain blanks or newlines. Since it turns out that some filenames actually do, they added a special case to find where it would delimit its output with null characters, and then they went and had xargs accept null-delimited data. Problem solved! As long as every program you want to pipe into has implemented the same hack, of course.
And really, find is one of the better examples out there. Most programs ignore delimited and escaping completely, and just do whatever comes to mind. As an example: I'm pretty sure it's impossible to correctly parse the output of "df" if any of your mounted devices contains spaces.
Solution
Programs should use a more structured data format on their stdin and stdout streams, when it makes sense. I'm using JSON for this, because it has everything I need, nothing I don't, and it's dead easy to parse. In a nutshell, every line going through the pipe is a valid JSON value, usually either an object or a string. This buys us a lot - records are trivial to parse (and even named, because you can use dicts for them - hallelujah!), strings are properly escaped, and if we want to then convert to nice human-readable output, the program to do it only needs to be a few lines long. And, it turns out, writing programs to use JSON streams isn't any harder than writing programs to use raw text streams.
(I should emphasize this point: I'm using exactly one line per JSON document. It's entirely possible to parse JSON streams without this restriction - I have a parser that does it, actually - but it's more trouble than necessary. Every language can parse lines out of a stream, and just about every language has a JSON parser. No point requiring a more advanced parser when adding this restriction lets us parse a stream using only a few lines of code.)
Here is a simple example, using the tools I've written so far. It finds the three largest files in a directory:
jls | jsort size | head -n 3
(Note that it works seamlessly with existing line-oriented UNIX tools.)
For comparison, here's the best standard UNIX equivalent I can write:
ls -l | sort -rk 5 | head -n 3
It's about the same length, but it took longer to write, and I had to consult the sort man page and try it out a few times before it worked. (And, really, it would have taken much longer if I hadn't happened to know that sort could sort by arbitrary fields in whitespace-delimited data.) In this example, the main thing JSON buys us is named fields for the sort.
"But hey," you might say, "that's a pretty basic example! Show me something that's really difficult to do with standard command line tools."
RSS feeds are a great example of the sort of structured data that's difficult to fit in a pipe. Let's say you want to write a generic, reusable UNIX program for fetching an RSS feed. Your options for output are either dumping the raw XML (ewww, gross! plus, XML streams are poorly defined and hard to parse), or making up an ad-hoc format ("okay, so every line will be one piece of data, in Key: Value format, and a double newline will be the separator between RSS entries, and the entry text will be base64 encoded so we don't have to worry about double newlines in it, and...."). Or, you could dump the data as a JSON stream! Not only is this trivial (my example of this is only 35 lines long), it's also really easy to work with. For example:
jrss $feed_url | jget link | jxargs -n 1 firefox
I dare you to solve this problem this nicely using standard UNIX practices. (No, you can't just write a program that reads an RSS feed and only spits out the links. The whole point is to get stuff done by combining very general pieces. Yes, I know that jrss has the feedparser library doing most of the work. The point isn't really the implementation; it's the fact that jrss is an extremely general-purpose command line tool.)
Conclusion
I've always thought that pipes were far more elegant and useful than the tools we use with them. I feel like I've finally managed to justify that, by coming up with a better way to pass data through them. Even if you're not convinced that we should start putting JSON everywhere, I hope I've at least convinced some people to start thinking about the shortcomings of traditional UNIX tools. There has been entirely too little of that lately.
For the curious: the code I've been playing around with is online at bitbucket. Suggestions and discussion are, naturally, welcome.
5 comments:
Very nice. It's totally time to rethink this. For any nonbelievers, see David Wheeler's essay. UNIX + JSON = UNISON ? It wouldn't be impossible to include an interpreter for old scripts either.
UNISON. Heh, I like that. :D
As far as interacting with old programs, it should be pretty trivial to write adapters, right? On one side, you just need something that takes newline- or null-delimited strings, and escapes them, and then the inverse on the other side.
In the spirit of full disclosure, I do have to openly admit my predilection for AllThingsTechie (for example, my wife thinks it may a bit odd when I get excited about work sending me to a week of in-depth training seminars)...
But I have to say the potential for this practically gives me goosebumps :)
I may have to spend a few hours tinkering with this idea.
I came across this post a few months ago, and I'd be really interested in what has become of this. Do you have your code hosted somewhere to try it?
Other than that, your standard shell example could be written more simply with
ls -S | head -n3
if you only care to get the file names, or
ls -lS | sed -n '2,4p'
If you want the long info, making it only two commands piped in either case.
Yeah, code for the tools I'd written is up on my bitbucket page: http://bitbucket.org/pstatic/jcmd Haven't done any work on them in a while, though, been distracted by work mainly :(
Post a Comment