JSON object values into CSV with jq

| 8 min read

I wanted to grab a CSV version of a JSON representation of an entityset in the Northwind OData service. Here's how I converted the JSON structure into CSV with jq, and along the way, I talk about arrays, objects, iterators, object indices, variables, identity, array construction and format strings.

In our current Back to basics: OData series on the Developer Advocates' Hands-on SAP Dev show I'm using various aspects of the classic Northwind OData service:

https://services.odata.org/V4/Northwind/Northwind.svc/

There's an entityset that I wanted to grab the data from, but have in CSV form. It's the Customer_and_Suppliers_by_Cities entityset that looks like this:

{
"@odata.context": "https://services.odata.org/V4/Northwind/Northwind.svc/$metadata#Customer_and_Suppliers_by_Cities",
"value": [
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
},
{
"City": "México D.F.",
"CompanyName": "Ana Trujillo Emparedados y helados",
"ContactName": "Ana Trujillo",
"Relationship": "Customers"
},
{
"City": "México D.F.",
"CompanyName": "Antonio Moreno Taquería",
"ContactName": "Antonio Moreno",
"Relationship": "Customers"
}
]
}

I've reduced the actual representation down to just three entries to save space here, and will retrieve these three entries only (with OData's $top system query option) to keep the display of data in this blog post under control.

I wanted to turn this JSON into something like this:

"Berlin","Alfreds Futterkiste","Maria Anders","Customers"
"México D.F.","Ana Trujillo Emparedados y helados","Ana Trujillo","Customers"
"México D.F.","Antonio Moreno Taquería","Antonio Moreno","Customers"

As I'm trying to learn more about jq I thought I'd use that.

Before doing anything else, I grab the representation into a local file called entities.json like this (restricting the entities to the first three):

curl \
--url 'https://services.odata.org/v4/northwind/northwind.svc/Customer_and_Suppliers_by_Cities?$top=3' \
> entities.json

Iterating through the entities

OK, so it's the values in the objects that I want, in the value property. So I start out with the simple object identifier-index like this:

jq '.value' entities.json

This gives me:

[
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
},
{
"City": "México D.F.",
"CompanyName": "Ana Trujillo Emparedados y helados",
"ContactName": "Ana Trujillo",
"Relationship": "Customers"
},
{
"City": "México D.F.",
"CompanyName": "Antonio Moreno Taquería",
"ContactName": "Antonio Moreno",
"Relationship": "Customers"
}
]

The value of the value property is indeed an array of objects. So far so good. But I want to do something with each of those objects, so next I add the array value iterator ([]) thus:

jq '.value[]' entities.json

This results in something that looks almost but not quite the same:

{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
{
"City": "México D.F.",
"CompanyName": "Ana Trujillo Emparedados y helados",
"ContactName": "Ana Trujillo",
"Relationship": "Customers"
},
{
"City": "México D.F.",
"CompanyName": "Antonio Moreno Taquería",
"ContactName": "Antonio Moreno",
"Relationship": "Customers"
}

What's happening here is that the iterator causes jq to emit a JSON value for each item in the array. This is an important concept and also relates to the fact that jq can process -- as well as emit -- multiple JSON values. I discuss this in the post Some thoughts on jq and statelessness which you may be interested to read.

In other words, while the first invocation (.value) emitted a single JSON value (an array), the second (.value[]) caused three JSON values (three objects) to be emitted, effectively one at a time.

Understanding the iterator a bit more

To understand what the iterator does, we can run a simple experiment. We'll revisit each of the two jq invocations we've done so far, adding a second filter into the mix via the pipe. We'll use the simple length filter which, given an array will return the number of elements, and given an object will return the number of key-value pairs (and, for that matter, given a string, will return the length of that string).

Remember that the value of the value property is an array. So this:

jq '.value | length' entities.json

returns the following:

3

Doing the same with the second invocation, in other words this:

jq '.value[] | length' entities.json

has a slightly different output:

4
4
4

In this case, what's happening is that the array iterator is causing each element of the array to be passed, one at a time, through the filter(s) that follow (i.e. through length in this case). And when passed an object, length returns the number of key-value pairs. There are four key-value pairs in each of the objects, so we get 4, but we get that three times, one for each object (and remember, each of these instances of 4 are valid JSON values).

Jumping ahead temporarily to CSV output

So that we better understand where we're heading, I want to introduce the @csv format string, which is described as follows:

The input must be an array, and it is rendered as CSV with double quotes for strings, and quotes escaped by repetition.

So this:

echo '[1,2,"buckle my shoe"]' | jq --raw-output '@csv'

(note the use of the --raw-output (-r) option so that jq won't try to emit JSON values but instead output the values directly) results in CSV like this:

1,2,"buckle my shoe"

So our aim is to produce a list of arrays, one for each JSON object in the input (one for each entity, effectively). Then each of these arrays can then be fed through the @csv format string, to produce CSV records.

Producing the arrays with hardcoded key names

Each CSV record needs four values, the values for each key in the object(s):

{
"City": "México D.F.",
"CompanyName": "Antonio Moreno Taquería",
"ContactName": "Antonio Moreno",
"Relationship": "Customers"
}

In other words, values for City, CompanyName, ContactName and Relationship.

The simplest way to do this would be to just use object identifier-indices directly, something like this:

jq --raw-output '
.value[]
| [.City, .CompanyName, .ContactName, .Relationship]
| @csv
'
entities.json

This gives us what we want:

"Berlin","Alfreds Futterkiste","Maria Anders","Customers"
"México D.F.","Ana Trujillo Emparedados y helados","Ana Trujillo","Customers"
"México D.F.","Antonio Moreno Taquería","Antonio Moreno","Customers"

But of course that's somewhat unsatisfactory. We'd have to examine the input data and then adjust the object identifier-indices each time we had different input data.

Producing the arrays dynamically

According to the great Larry Wall, the three great virtues of a programmer are laziness, impatience and hubris. And we can get a little nearer to laziness and also somewhat to impatience here by striving to make our solution determine the keys automatically.

There's a keys function in jq which will return the keys of an object. That might get us part of the way. Let's try it out:

jq '
.value[]
| keys
'
entities.json

This produces what we expect, or at least hope for:

[
"City",
"CompanyName",
"ContactName",
"Relationship"
]
[
"City",
"CompanyName",
"ContactName",
"Relationship"
]
[
"City",
"CompanyName",
"ContactName",
"Relationship"
]

In fact, we have the structure that we want and are now only really one "indirection" away from our goal. Let's put this into the CSV output context to see, by piping the result into @csv:

jq --raw-output '
.value[]
| keys
| @csv
'
entities.json

This gives us:

"City","CompanyName","ContactName","Relationship"
"City","CompanyName","ContactName","Relationship"
"City","CompanyName","ContactName","Relationship"

We can make use of these key values like City with the object identifier-index construct. Well, almost. We need the more generic form for which the object identifier-index is just a shorthand version for when identifiers are simple and "string-like".

In other words, the generic object index can be used when the identifier is not "string-like" ... such as when it's a variable.

Let's step back and focus for a moment on just one of the objects - the first (0th) one - using the array index construction ([n]):

jq '
.value[0]
| keys
'
entities.json

This gives us:

[
"City",
"CompanyName",
"ContactName",
"Relationship"
]

Let's assign the keys to a variable $k, and just emit the value of that variable:

jq '
.value[0]
| keys as $k | $k
'
entities.json

Perhaps unsurprisingly, this gives us the same result:

[
"City",
"CompanyName",
"ContactName",
"Relationship"
]

But now we have an array of key names to work with!

Note that keys produces an array, so we can use the array value iterator ([]) to cause each of the keys to be emitted separately (looped through, effectively) and passed to subsequent filters.

Adding the iterator [] to the keys function like this:

jq '
.value[0]
| keys[] as $k | $k
'
entities.json

produces this:

"City"
"CompanyName"
"ContactName"
"Relationship"

This is a similar effect to what we've seen earlier; it causes jq to iterate over the output of keys one item at a time, so the $k after the pipe in this sample is called four times, one for each key, and each time producing a JSON value (the key name as a string) as output.

We may be focusing deeper and deeper on the keys here, but don't forget we always have the identity filter (.) to give us access to the input, to whatever came through the pipe to where we are now, as it were.

Variable assignment with 'as' as a foreach loop

Let's understand this, by way of something perhaps unexpected. Replacing the $k at the end of the pipeline with simply ., like this:

jq '
.value[0]
| keys[] as $k | .
'
entities.json

actually gives us this:

{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}

Odd, the same object, four times. But when we stare at that for a second, we realise that it's exactly what we asked for. With keys[] we're iterating through the keys of the object (City, CompanyName, ContactName and Relationship). Four of them. So whatever is beyond the pipe after that, which is simply the identity filter (.), is being called four times. And the identity filter (which simply outputs whatever it receives as input) receives as input the original object.

What we might expect . to output is one key, each time. That would be the case if we didn't assign keys[] to the variable $k with keys[] as $k. Let's remove the as $k bit to see:

jq '
.value[0]
| keys[] | .
'
entities.json

This produces:

"City"
"CompanyName"
"ContactName"
"Relationship"

So in this case, .'s input are (each time) the keys of the object. The important thing to realise here is that the variable assignment as $k means that the input that came into that expression (the object) passes straight through unconsumed to the next filter. This part of the manual for the section on variables helps to explain:

The expression exp as $x | ... means: for each value of expression exp, run the rest of the pipeline with the entire original input, and with $x set to that value. Thus as functions as something of a foreach loop.

With this in mind, we should now be able to understand why this:

jq '
.value[0]
| keys[] as $k | .
'
entities.json

produced four identical copies of the object.

While that's an odd thing to produce, it helps a lot here. Having the input at this stage in the pipeline (.) set to the object, combined with the "foreach loop" (as the manual described it) iterating over the values in $k, is very useful!

Let's look at that in a basic form; how about emitting an array with two elements, the first being the value of $k and the second being the input, each time:

jq '
.value[0]
| keys[] as $k | [$k, .]
'
entities.json

This gives us a combination of values like this:

[
"City",
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
]
[
"CompanyName",
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
]
[
"ContactName",
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
]
[
"Relationship",
{
"City": "Berlin",
"CompanyName": "Alfreds Futterkiste",
"ContactName": "Maria Anders",
"Relationship": "Customers"
}
]

And of course, look what we can do with that combination of data in . and $k, using the generic object index like this .[$k] to look up the value of each of the keys:

jq '
.value[0]
| keys[] as $k | .[$k]
'
entities.json

This results in:

"Berlin"
"Alfreds Futterkiste"
"Maria Anders"
"Customers"

Great! And if we wrap this entire expression in an array construction ([...]), we then have the right shape (an array) to give to the @csv format string (and as we're emitting CSV again we'll use the --raw-output option again here):

jq --raw-output '
.value[0]
| [ keys[] as $k | .[$k] ]
| @csv
'
entities.json

This produces a perfect single CSV record:

"Berlin","Alfreds Futterkiste","Maria Anders","Customers"

Now all we need to do is remove the array index (the 0 from .value[0]) to go back to an iteration over all the items in the array:

jq --raw-output '
.value[]
| [ keys[] as $k | .[$k] ]
| @csv
'
entities.json

and we get exactly what we're looking for:

"Berlin","Alfreds Futterkiste","Maria Anders","Customers"
"México D.F.","Ana Trujillo Emparedados y helados","Ana Trujillo","Customers"
"México D.F.","Antonio Moreno Taquería","Antonio Moreno","Customers"

Storing as a function

I'm likely to want to use this approach again some time, so I'll store the core construct here as a function in my local ~/.jq file (see the modules section of the manual for more detail):

def onlyvalues: [ keys[] as $k | .[$k] ];

Now I can use that function wherever I want; here's a great place, because it also simplifies the entire invocation:

jq --raw-output '
.value[]
| onlyvalues
| @csv
'
entities.json

And yes, this produces the same output:

"Berlin","Alfreds Futterkiste","Maria Anders","Customers"
"México D.F.","Ana Trujillo Emparedados y helados","Ana Trujillo","Customers"
"México D.F.","Antonio Moreno Taquería","Antonio Moreno","Customers"

Wrapping up

This turned out to be a longer post than I'd intended to write. I found that I wanted to make sure I explained each part of the solution, and why it was how it was. Of course, this has the benefit of causing me to think a little harder about what jq is doing, which in turn helps me learn a little bit more about it.