Summing and grouping values with jq

Here's yet another note-to-self on using jq, this time to transform a flat list of order totals and dates into a summary of total order values by year.

In doing some research for an upcoming live stream I was looking at the Northwind OData v4 service and in particular at the Summary_of_Sales_by_Years entity set. It is not what I initially expected; rather than be a summary of sales by year, it was a list of orders each with a shipping date, order ID and order total. There are over 800 entries, and I grabbed all of them and stored them in a single JSON file Summary_of_Sales_by_Years.json using a Bash shell script slurp that auto-follows the @odata.nextLink annotation trail on each chunk response.

I wanted to group the list by year and get grand totals for each year. This blog post describes how I went about it, and also describes a sort of preparation stage too where I created an initially much smaller dataset to experiment with.

I've created snippets on jqplay for each of the stages here - you'll see the links at the relevant points in this post.

Preparing the sample data

For the sake of brevity in this post, I cut the data down to just 6 entries, two for each of the years represented (1996, 1997 and 1998). I did this with jq too, redirecting the output into a new file subset.json, thus:

jq \
  '.value |= (
    group_by(.ShippedDate[:4])
    | map(.[:2])
    | flatten
  )' \
  Summary_of_Sales_by_Years.json \
  > subset.json

This resulted in the following content in subset.json, which I can now use to more easily illustrate the summing and grouping.

{
  "value": [
    {
      "ShippedDate": "1996-07-16T00:00:00Z",
      "OrderID": 10248,
      "Subtotal": 440
    },
    {
      "ShippedDate": "1996-07-10T00:00:00Z",
      "OrderID": 10249,
      "Subtotal": 1863.4
    },
    {
      "ShippedDate": "1997-01-16T00:00:00Z",
      "OrderID": 10380,
      "Subtotal": 1313.82
    },
    {
      "ShippedDate": "1997-01-01T00:00:00Z",
      "OrderID": 10392,
      "Subtotal": 1440
    },
    {
      "ShippedDate": "1998-01-02T00:00:00Z",
      "OrderID": 10771,
      "Subtotal": 344
    },
    {
      "ShippedDate": "1998-01-21T00:00:00Z",
      "OrderID": 10777,
      "Subtotal": 224
    }
  ]
}

Before we move on, let's briefly examine the jq used to produce this.

Examining the preparation phase

Here's that jq program again:

.value |= (
  group_by(.ShippedDate[:4])
  | map(.[:2])
  | flatten
)

First, there's this construct: .value |= (...). The |= is the update assignment operator and whatever the filter on the right hand side produces becomes the new value for the value property. The parentheses in this particular instance ensure that the output from the entire expression within is used. It's needed here because the expression contains pipes (|) which would otherwise short circuit.

With group_by(.ShippedDate[:4]) the group_by function collects objects by the ShippedDate property - but not the entire property value, just the first four characters, which represent the year, for example "1996" in "1996-07-16T00:00:00Z" (there's the strptime function too, which will parse a date into its component parts, but knowledge of the data and laziness won through here). Note that the [:4] construct (which is short for [0:4]) is the array/string slice filter operating on a string value in this case, which will return a substring.

The use of group_by produces an array of arrays, with one subarray for each year.

This is then piped into map(.[:2]). The [:2] (again, short for [0:2]) is the array/string slice filter again, but this time, it's operating on an array rather than a string. I'm using map to run the filter .[:2] against each element of the input array, which contains a subarray for each of the years. And the .[:2] filter, in an array context, will return the first two elements.

The result of this is still an array of arrays, but now each subarray has only two objects each. Now they can all be merged, i.e. taken out of their respective subarrays and collected together. This is done with the flatten filter.

▶ You can see how this jq program reduces the input data to the subset in this jqplay snippet: Initial input data reduction.

Producing the totals by year

So, (now based on the subset of data above), what I actually want is a summary of total order value for each year, something like this:

[
  [
    "1996",
    2303
  ],
  [
    "1997",
    2753
  ],
  [
    "1998",
    568
  ]
]

Grouping by year

It makes sense that the approach required will also employ the group_by function as we want total order values for each year, and we can determine the years in the same way as we've seen in the preparation stage, i.e. with the array/string slice filter ([:4]).

Let's start to explore:

jq \
  '.value
  | group_by(.ShippedDate[:4])' \
  subset.json

[
  [
    {
      "ShippedDate": "1996-07-16T00:00:00Z",
      "OrderID": 10248,
      "Subtotal": 440
    },
    {
      "ShippedDate": "1996-07-10T00:00:00Z",
      "OrderID": 10249,
      "Subtotal": 1863.4
    }
  ],
  [
    {
      "ShippedDate": "1997-01-16T00:00:00Z",
      "OrderID": 10380,
      "Subtotal": 1313.82
    },
    {
      "ShippedDate": "1997-01-01T00:00:00Z",
      "OrderID": 10392,
      "Subtotal": 1440
    }
  ],
  [
    {
      "ShippedDate": "1998-01-02T00:00:00Z",
      "OrderID": 10771,
      "Subtotal": 344
    },
    {
      "ShippedDate": "1998-01-21T00:00:00Z",
      "OrderID": 10777,
      "Subtotal": 224
    }
  ]
]

This is a nice illustration of the array of arrays structure we talked about earlier. There's a subarray for the objects for each year.

Totalling the values

Now the data is in the right "shape", it's time to focus on summing the Subtotal values within each subarray.

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map(map(.Subtotal))' \
  subset.json

This produces the following:

Note the nested calls to map, i.e. map(map(...). This is because the outer map processes the outer array, and passes each element (each of which are also arrays - the by-year subarrays) to the function specified, which is also map, which processes (in turn) each inner array, which contain the objects. The simple filter .Subtotal will just return the value of the Subtotal property, so we see a list of lists of subtotals, remembering that we've got two for each of the three years.

So we have an array of arrays of subtotal values. As a next step let's add these grouped subtotal values together, using add (which is a filter that operates on arrays). While we're at it, we'll use floor to hard round down to the nearest whole number:

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map(map(.Subtotal) | add | floor)' \
  subset.json

This produces the following:

Adding the year

Almost there - but it's not that useful without the year. To get the structure we want, which is an array of arrays each containing the year and total, we'll need to add the year, and enclose that, with the total, in an array:

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map([
      first.ShippedDate[:4],
      (map(.Subtotal) | add | floor)
    ])' \
  subset.json

We've expanded what's passed to the outer map function to the following:

[
  first.ShippedDate[:4],
  (map(.Subtotal) | add | floor)
]

What's happening here is that we're using array construction ([...]) to produce an array, with two elements, starting with the value of first.ShippedDate[:4].

Each of the expressions in this array construction receives an array (one of the year-specific subarrays), but for our first element we only want the value from one of the elements in the incoming array, so we can use the first function to do that. This is a lovely bit of syntactic sugar through a definition, along with definitions for its siblings last and nth, in builtin.jq:

def first: .[0];
def last: .[-1];
def nth($n): .[$n];

I'm tempted to want to define another function rest thus:

def rest: .[1:length];

See The beauty of recursion and list machinery for why - in particular, a slight obsession about x:xs, first and rest, head and tail, and so on.

But I digress.

The second element in the constructed array, i.e. (map(.Subtotal) | add | floor), is the same as before, except that it's now surrounded in parentheses to ensure the whole thing is evaluated in one go (specifically, so that it's only the map(.Subtotal) that gets passed through those pipes to add and floor, and not anything else).

So this is where we've ended up:

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map([
      first.ShippedDate[:4],
      (map(.Subtotal) | add | floor)
    ])' \
  subset.json

Running this produces the desired result:

[
  [
    "1996",
    2303
  ],
  [
    "1997",
    2753
  ],
  [
    "1998",
    568
  ]
]

Very nice!

▶ You can see how this result is achieved in this jqplay snippet: Producing the 'array' style final result.

An alternative result structure

As alternative way of representing the totals by year, and knowing that the year values are stable enough to be property names in objects, we could instead go for something like this:

{
  "1996": 2303,
  "1997": 2753,
  "1998": 568
}

To get this, it's not much of a departure from what we previously ended up with. First, instead of using array construction ([...]) we can use object construction. As expected, we need to specify the property name and value, in this form:

property: value

Let's make that change, noting that because the expression for the property (the key) is not "identifier-like", i.e. it's an expression to be evaluated, we need to enclose it in parentheses like this: (first.ShippedDate[:4]). Here we go:

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map({
      (first.ShippedDate[:4]): map(.Subtotal)|add|floor
    })' \
  subset.json

This produces almost but not quite what we want:

[
  {
    "1996": 2303
  },
  {
    "1997": 2753
  },
  {
    "1998": 568
  }
]

But that's OK, because the more you get the feel for how jq behaves, the more you'll likely guess that there'll be a simple way to merge these objects. And there is - the versatile add filter. We've used add already to sum up an array of numeric values (the subtotals) but "adding" an array of objects together merges them.

So let's pipe the output of the map({...}) into add:

jq \
  '.value
  | group_by(.ShippedDate[:4])
  | map({
      (first.ShippedDate[:4]): map(.Subtotal)|add|floor
    }) | add' \
  subset.json

This merges the three year:total pairs from the three objects into a single object, and thus gives us what we want:

{
  "1996": 2303,
  "1997": 2753,
  "1998": 568
}

Lovely!

▶ You can see how this alternative result is achieved in this jqplay snippet: Producing the 'array' style final result.

Summing up

This turned out (again) to be a slightly longer post than expected, but in writing it, and in manipulating the source data, I've learned more about jq. So that's a result. I hope this helps you too.