Some thoughts on jq and statelessness
I came across a great article via lobsters recently: Introducing zq: an Easier (and Faster) Alternative to jq. I posted some brief thoughts on it over on the lobsters thread, and in the spirit of "owning your own words", I thought I'd write them up here too.
I do like articles like this, that show and lay out the thinking behind the conclusion, and along the way, impart knowledge about the topic at hand. Especially when they're on a subject I'm eager to learn more about.
While reading the article a couple of things struck me.
First, I'd not really heard of the phrase "stateless dataflow" (and its opposite "stateful dataflow"). I did look it up via Google and found that there were very few results, most of them being scholarly papers either in PDF or even PostScript form. So I sort of forgave myself for not really knowing what was implied, although I had taken a guess anyway.
Basically the author was explaining that the reason for finding the
jq language difficult was down to the computational model. I don't think
jq is the easiest language, and in my experience so far that is down to a number of things, not least the relative terseness of the official manual, but also my inability to grasp powerful constructs, as well as having to manipulate complex object and array structures in my head, not only statically, but also having to imagine how they might change when processed through filters.
It seems that the author's issue with the "stateless dataflow" was down to the fact that what's being processed by
jq is very often a stream of discrete JSON values, rather than a single value.
So what do I mean by "JSON value"? Well, in the article Introducing JSON there's a McKeeman form expressing the JSON grammar, and the building blocks of what we know as JSON are described as "JSON values" thus:
value object array string number "true" "false" "null"
These JSON values are described as fundamental building blocks in RFC 8259.
Anything expressed in JSON will be one of these value types. This is why, for example,
"hello world" is valid JSON, as is
Processing JSON with jq
In the "Invoking jq" section of the manual, it says:
jq filters run on a stream of JSON data. The input to jq is parsed as a sequence of whitespace-separated JSON values which are passed through the provided filter one at a time. The output(s) of the filter are written to standard out, again as a sequence of whitespace-separated JSON data.
Key for me, in my journey towards a deeper understanding of
jq, is that the "filter" here is the entire
jq program, whether that's something short expressed literally on the command line, or in a file, pointed to with the
So each and every JSON value that is passed into
jq is processed by the entire program.
There's the "slurp" option (with
-s) which will "read the entire input stream into a large array and run the filter just once". This is maybe what one might initially assume or expect
jq to do, but one needs to be explicit.
Perhaps a small example might help, based on a sequence of JSON values that we can produce with seq:
1 2 3
If we pass this sequence of JSON values through the simplest of
jq filters -- the identity function -- like this:
seq 3 | jq .
then we get this:
1 2 3
One might think "well, what else would you expect?" but this illustrates the nature of running discrete JSON values through a filter quite nicely.
Before we continue, let's use the
-c) option here:
seq 3 | jq -c .
The output is the same:
1 2 3
For me, this drives home the "discrete JSON values" approach to both
jq's input and output processing - there are three discrete values in, and three out.
I guess this also helps explain what the author of the article means by "stateless". As far as the filter is concerned, it's seeing the values
3 separately and in new contexts each time. And as the article illustrates, this is where
-s) option comes in. Adding the option to the above example:
seq 3 | jq -c -s .
A single JSON value. This is because what the filter received was actually this:
Three discrete values, but wrapped in an outer enclosing array. A single JSON value, in the form of an array. And being the simple identity function, just regurgitating what it reads, produces in turn that same, single JSON value as output. On one line here, rather than pretty printed with more whitespace, because of the
--slurp option brings about a sort of statefulness, in that every discrete JSON value, previously independent, now share the same single context of the single invocation of the
Changing the filter from the
. identity function to the
add function* demonstrates this singular context, this "statefulness":
seq 3 | jq -s add
This yields the single JSON value:
*I'm calling them "functions", but the manual actually calls them "filters"
There's one more observation I'd like to make in these ramblings. The article describes the task of adding up the numbers here:
echo '[1,2,3] [4,5,6]'
In other words, the result should be 21.
We know by now that this:
is actually two discrete JSON values. Two arrays. So, as the author demonstrates, the
--slurp option is called for, thus:
echo '[1,2,3] [4,5,6]' | jq -s '[. | add] | add'
So in this invocation, the filter is executed once only, and actually receives:
The article does a great job of describing the author's thought process here, and also showing how some of the basic filters work. And I guess the filter used here is possibly deliberately complex, or at least contrived to illustrate a point:
[. | add] | add
However to be fair on the language, it has some syntactic sugar in the form of map. In the description, we read:
map(x)is equivalent to
[. | x]. In fact, this is how it's defined.
And we can see this definition in
jq's source, specifically in the builtin.jq file:
def map(f): [. | f];
This definition helps the mental model, and helps me a lot, not only to reduce noise, but also to relate the computation to an arguably well-known function (map). So the entire line turns into a much simpler:
echo '[1,2,3] [4,5,6]' | jq -s 'map(add) | add'
This has turned into a bit of a longer ramble, beyond what I'd originally commented. But writing it has helped me think about this a bit more. Perhaps it helps you too - I hope so!
And most importantly, my thoughts in this post should not detract from the article nor from their conclusions with zq - more power to them!