jq or xsltproc alternative for s-expressions?

2019-04-14 11:47发布

问题:

I have a project which contains a bunch of small programs tied together using bash scripts, as per the Unix philosophy. Their exchange format originally looked like this:

meta1a:meta1b:meta1c AST1
meta2a:meta2b:meta2c AST2

Where the :-separated fields are metadata and the ASTs are s-expressions which the scripts pass along as-is. This worked fine, as I could use cut -d ' ' to split the metadata from the ASTs, and cut -d ':' to dig into the metadata. However, I then needed to add a metadata field containing spaces, which breaks this format. Since no field uses tabs, I switched to the following:

meta1a:meta1b:meta1c:meta 1 d\tAST1
meta2a:meta2b:meta2c:meta 2 d\tAST2

Since I envision more metadata fields being added in the future, I think it's time to switch to a more structured format rather than playing a game of "guess the punctuation".

Instead of delimiters and cut I could use JSON and jq, or I could use XML and xsltproc, but since I'm already using s-expressions for the ASTs, I'm wondering if there's a nice way to use them here instead?

For example, something which looks like this:

(echo '(("foo1" "bar1" "baz1" "quux 1") ast1)'
 echo '(("foo2" "bar2" "baz2" "quux 2") ast2)') | sexpr 'caar'

"foo1"
"foo2"

My requirements are:

  • Straightforward use of stdio with minimal boilerplate, since that's where my programs read/write their data
  • Easily callable from shell scripts or provide a very compelling alternative to bash's process invocation and pipelining
  • Streaming I/O if possible; ie. I'd rather work with one AST at a time rather than consuming the whole input looking for a closing )
  • Fast and lightweight, especially if it's being invoked a few times; each AST is only a few KB, but they can add up to hundreds of MB
  • Should work on Linux at least; cross-platform would be nice

The obvious choice is to use a Lisp/Scheme interpreter, but the only one I'm experienced with is Emacs, which is far too heavyweight. Perhaps another implementation is more lightweight and suited to this?

In Haskell I've played with shelly, turtle and atto-lisp, but most of my code was spent converting between String/Text/ByteString, wrapping/unwrapping Lisps, implementing my own car, cdr, cons, etc.

I've read a little about scsh, but don't know if that would be appropriate either.

回答1:

You might give Common Lisp a try.

Straightforward use of stdio with minimal boilerplate, since that's where my programs read/write their data

(loop for (attributes ast) = (safe-read) do (print ...)
  • Read/write from standard input and output.
  • safe-read should disable execution of code at read-time. There is at least one implementation. Don't eval your AST directly unless you perfectly know what's in there.

Easily callable from shell scripts or provide a very compelling alternative to bash's process invocation and pipelining

In the same spirit as java -jar ..., you can launch your Common Lisp executable, e.g. sbcl, with a script in argument: sbcl --load file.lisp. You can even dump a core or an executable core of your application with everything preloaded (save-lisp-and-die). Or, use cl-launch which does the above automatically, and portably, and generates shell scripts and/or makes executable programs from your code.

Streaming I/O if possible; ie. I'd rather work with one AST at a time rather than consuming the whole input looking for a closing )

If the whole input stream starts with a (, then read will read up-to the closing ) character, but in practice this is rarely done: source code in Common Lisp is not enclosed in one pair of parenthesis per-file, but as a sequence of forms. If your stream produces not one but many s-exps, the reader will read them one at a time.

Fast and lightweight, especially if it's being invoked a few times; each AST is only a few KB, but they can add up to hundreds of MB

Fast it will be, especially if you save a core. Lightweight, well, it is well-known that lisp images can take some disk space (e.g. 46MB), but this is rarely an issue. Why is is important? Maybe you have another definition about what lightweight means, because this is unrelated to the size of the AST you will be parsing. There should be no problem reading those AST, though.

Should work on Linux at least; cross-platform would be nice

See Wikipedia. For example, Clozure CL (CCL) runs on Mac OS X, FreeBSD, Linux, Solaris and Windows, 32/64 bits.



回答2:

Working on a slightly different task, I again found the need to process a bunch of s-expressions. This time I needed to perform some non-trivial processing of the given s-expressions (extracting lists of symbols used, etc.), rather than having the option to pass them along as opaque strings.

I gave Racket a try and was pleasantly surprised; it was much nicer than the other Lisps I've used before (Emacs Lisp and various application-specific Scheme scripts), since it has nice documentation and a batteries included standard library.

Some of the relevant points for this kind of task:

  • "Ports" for reading and writing data. These can be (dynamically?) scoped across an expression, and default to stdio (i.e. (current-input-port) defaults to stdin and (current-output-port) defaults to stdout). Ports make stdio and file access about as nice to use as a shell: more verbose, but fewer gnarly edge-cases.
  • Various conversion functions like port->string, file->lines, read, etc. make it easy to get data at the appropriate form of granularity (characters, lines, strings, expressions, etc.).
  • I couldn't find a "standard" way to read multiple s-expressions, since read only returns one, so iteration/recursion would be needed to do this in a streaming fashion.
  • If streaming isn't needed, I found it easiest to read the whole input as a string, append "(\n" and "\n)", then use (with-input-from-string my-modified-input read) to get one big list.

I found Racket's startup time to be pretty slow, so I wouldn't recommend invoking a script over and over as part of a loop if speed is a concern. It was easy enough to move my looping into Racket and have the script invoked once though.