Exploring GitHub repo name distribution with jq

I wanted a brief rundown of the name prefixes for repositories in the SAP-samples organisation on GitHub. With the gh CLI it was easy to grab the names, and gave me the opportunity to practise a bit of jq. Here's what I did.

The SAP-samples organisation on GitHub is where we keep lots of sample code, configuration and more for various SAP services and products. We also store our workshop and CodeJam material in repositories there too.

There's a sort of loose naming convention, where the first part of the name gives a general indication of topic. For example, the first part of the cloud-messaging-handsonsapdev repository, "cloud", gives an indication that the topic is the cloud in general, and the first part of the btp-setup-automator repository, "btp", indicates that the main topic is the SAP Business Technology Platform.

I wanted to find out what the names were of all the repositories in the SAP-samples organisation, and understand the distribution across the different topics. Something like this, showing here that the most popular topic is "cloud":

1    abap
1    artifact
2    btp
3    cloud
2    sap
1    ui5

Using the gh CLI

Requesting the names of public repositories with the GitHub CLI gh is easy. Here's an example:

gh repo list SAP-samples --limit 10 --public

This produces output something like this (output somewhat redacted for display purposes):

SAP-samples/cloud-sdk-js                   This re...  public  7h
SAP-samples/cloud-cap-samples-java         A sampl...  public  15h
SAP-samples/btp-setup-automator            Automat...  public  15h
SAP-samples/btp-ai-sustainability-bootcamp This gi...  public  15h
SAP-samples/cloud-cap-samples              This pr...  public  17h
SAP-samples/ui5-exercises-codejam          Materia...  public  19h
SAP-samples/cap-sflight                    Using S...  public  1d
SAP-samples/cloud-cf-feature-flags-sample  A sampl...  public  1d
SAP-samples/cloud-espm-cloud-native        Enterpr...  public  2d
SAP-samples/iot-edge-samples               Showcas...  public  2d

This is a slightly contrived example, because I wanted to illustrate the distribution over a small number of repositories (10 in this case). To this end, I cut down the actual output to come up with a list of repositories that would illustrate the point. If you want to find out what I did with this list, and how I turned it into what gh would output, in particular what JSON structure it would produce (see the next section in this post), you may want to read the "prequel" post to this one: Converting strings to objects with jq.

With regular shell tools I could parse out the names, split off the topic prefix, and go from there. But I'm trying to improve my skills in jq, and the gh CLI gives me an opportunity to do that, with the combination of two options.

Requesting JSON output with --json

With --json I can specify fields I want to have returned to me. At first I was at a loss as to which fields were available to specify, but leaving off the value for --json gives a list.

In other words, invoking this:

gh repo list --json

results in a list like this (cut short for brevity):

Specify one or more comma-separated fields for `--json`:
  assignableUsers
  codeOfConduct
  contactLinks
  createdAt
  defaultBranchRef
  deleteBranchOnMerge
  description
  diskUsage
  forkCount
  ...

The field name is available, and applying it as the value for --json like this:

gh repo list SAP-samples --limit 10 --public --json name

gives this JSON output:

[
  {
    "name": "cloud-sdk-js"
  },
  {
    "name": "cloud-cap-samples-java"
  },
  {
    "name": "btp-setup-automator"
  },
  {
    "name": "btp-ai-sustainability-bootcamp"
  },
  {
    "name": "cloud-cap-samples"
  },
  {
    "name": "ui5-exercises-codejam"
  },
  {
    "name": "cap-sflight"
  },
  {
    "name": "cloud-cf-feature-flags-sample"
  },
  {
    "name": "cloud-espm-cloud-native"
  },
  {
    "name": "iot-edge-samples"
  }
]

Filtering JSON output with --jq

With the --jq option, a jq filter can be supplied that will be applied to the JSON output produced. Let's start with a very simple example.

As we can see, the structure returned is an array of objects, each containing the property or properties requested with the --json option. So to obtain the value of each of the name properties from the JSON output that we saw earlier, we can use .[] | .name, or, more succinctly, .[].name:

gh repo list SAP-samples --limit 10 --public \
  --json name \
  --jq .[].name

This returns the following:

artifact-of-the-month
cloud-sdk-js
sap-tech-bytes
cloud-cap-samples-java
btp-setup-automator
btp-ai-sustainability-bootcamp
sap-iot-samples
abap-platform-fundamentals-01
cloud-cap-samples
ui5-exercises-codejam

The --jq option in gh is applied with --raw-output

We can make one side observation here. Normally, we'd expect to see JSON values output from jq; in other words, double-quoted strings like this:

"artifact-of-the-month"
"cloud-sdk-js"
"sap-tech-bytes"
"cloud-cap-samples-java"
"btp-setup-automator"
"btp-ai-sustainability-bootcamp"
"sap-iot-samples"
"abap-platform-fundamentals-01"
"cloud-cap-samples"
"ui5-exercises-codejam"

So it seems like when a jq filter is applied via the --jq option to gh, it's applied with the --raw-output (-r) option implicitly. I think that makes sense, especially if the output is to be used with other Unix command line tools later on in a pipeline.

Using the power of jq

Now we have the context in which we can invoke a jq filter on the JSON output from gh, let's dig in a little more. Bear in mind that this may not be the most efficient way of doing things, but I thought it might still be useful, and it certainly helps me to try to express something in jq in public, as it were.

To be kind to the API, I'll grab the JSON output from the gh invocation and use that while I build up the filter:

gh repo list SAP-samples --limit 10 --public \
  --json name \
  > names.json

As a reminder, the content of names.json will look like this:

[
  {
    "name": "cloud-sdk-js"
  },
  {
    "name": "cloud-cap-samples-java"
  },
  {
    "name": "btp-setup-automator"
  },
  {
    "name": "btp-ai-sustainability-bootcamp"
  },
  {
    "name": "cloud-cap-samples"
  },
  {
    "name": "ui5-exercises-codejam"
  },
  {
    "name": "cap-sflight"
  },
  {
    "name": "cloud-cf-feature-flags-sample"
  },
  {
    "name": "cloud-espm-cloud-native"
  },
  {
    "name": "iot-edge-samples"
  }
]

Get the first part of the name

The convention is to use dashes to separate the different parts of the repository names, so it occurs to me that I can use split, which produces an array, and then grab the first element.

Let's have a first go, based on the name property access we saw earlier:

jq '.[].name | split("-") | .[0]' names.json

This produces the following list:

"artifact"
"cloud"
"sap"
"cloud"
"btp"
"btp"
"sap"
"abap"
"cloud"
"ui5"

Stay within the context of an array

In jq, there are plenty of functions that operate on arrays, such as sort, min and max and reverse. There's also group-by which is what will be useful to our requirements here. The manual's description is as follows:

group_by(.foo) takes as input an array, groups the elements having the same .foo field into separate arrays, and produces all of these arrays as elements of a larger array, sorted by the value of the .foo field.

We're starting from an array (note the outer enclosing [...] in the data we're working on) so it makes sense to try to keep that array context. So rather than use the array / object iterator, which "explodes" an array into separate results, we can use map here:

jq 'map(.name | split("-") | .[0])' names.json

This produces the same values, but within an array:

[
  "artifact",
  "cloud",
  "sap",
  "cloud",
  "btp",
  "btp",
  "sap",
  "abap",
  "cloud",
  "ui5"
]

Using group_by

Now we can use group-by on this (switching here to a multi-line version for better readability):

jq \
  'map(.name | split("-") | .[0])
  | group_by(.)' \
  names.json

This seems to "do exactly what it says on the tin":

[
  [
    "abap"
  ],
  [
    "artifact"
  ],
  [
    "btp",
    "btp"
  ],
  [
    "cloud",
    "cloud",
    "cloud"
  ],
  [
    "sap",
    "sap"
  ],
  [
    "ui5"
  ]
]

Note that the value passed to group_by is ., i.e. the path_expression is the entire string value, for example "artifact", "cloud", "sap" etc.

Great. We can already start to see the distribution of topics now, but let's go a bit further.

Creating a list of topic counts

I think ideally I'd like a flat list of topics with their counts, in a tab-separated list, as that is then conducive to further processing on the command line should I want to. In other words, I want this sort of line for each topic:

[count][tab][topic-name]

Producing the raw data

First, let's produce the raw data for this list. While we wanted to avoid exploding the array earlier, now would be the time to use the array / object iterator:

jq \
  'map(.name | split("-") | .[0])
  | group_by(.)
  | .[]' \
  names.json

This produces a JSON value for each of the array items. Here, each item, and thus value produces, is an array containing one or more instances of a topic name:

[
  "abap"
]
[
  "artifact"
]
[
  "btp",
  "btp"
]
[
  "cloud",
  "cloud",
  "cloud"
]
[
  "sap",
  "sap"
]
[
  "ui5"
]

In effect, this removes the outermost [...] array that contains all these inner arrays.

Now it's just a matter of defining what we want to see, with the array constructor, in this case, two elements representing the length of the array, and the first value of the array [length, .[0]]:

jq \
  'map(.name | split("-") | .[0])
  | group_by(.)
  | .[]
  | [length, .[0]]' \
  names.json

Remember that this construct .[] | ... will iterate through each array element and pass them one at a time to the filter that follows the pipe. And this produces the following:

[
  1,
  "abap"
]
[
  1,
  "artifact"
]
[
  2,
  "btp"
]
[
  3,
  "cloud"
]
[
  2,
  "sap"
]
[
  1,
  "ui5"
]

Output the results as a tab-separated list

We have our list of topic counts, so now let's add the final touch to have a tab-separated list. There's nothing further we need to do to the data, it's as we want it. So we just need some formatting. In the Format strings and escaping section of the jq manual, we see that there's the @tsv which is described thus:

The input must be an array, and it is rendered as TSV (tab-separated values). Each input array will be printed as a single line.

This is exactly what we're looking for. Note that here, the "input array" referred to is each of the individual arrays in the output above, i.e. this is the first array:

[
  1,
  "abap"
]

Let's try it:

jq \
  'map(.name | split("-") | .[0])
  | group_by(.)
  | .[]
  | [length, .[0]]
  | @tsv' \
  names.json

"1\tabap"
"1\tartifact"
"2\tbtp"
"3\tcloud"
"2\tsap"
"1\tui5"

Close! Remember that by default, an invocation of jq on the command line will output JSON values by default. These strings are JSON values. But here we want the raw form, via the --raw-output (-r), to benefit from (and see) the tab characters (\t) that the @tsv has put in for us:

jq -r \
  'map(.name | split("-") | .[0])
  | group_by(.)
  | .[]
  | [length, .[0]]
  | @tsv' \
  names.json

This gives us what we're looking for:

1    abap
1    artifact
2    btp
3    cloud
2    sap
1    ui5

And in fact, remembering that when a jq filter is invoked from gh via the --jq option the raw output is used by default, we can now put everything together and benefit from that in the final gh invocation, which looks like this:

gh repo list SAP-samples --limit 10 --public \
  --json name \
  --jq \
  'map(.name | split("-") | .[0])
   | group_by(.)
   | .[]
   | [length, .[0]]
   | @tsv'

This gives us the same result, i.e.:

1    abap
1    artifact
2    btp
3    cloud
2    sap
1    ui5

So I can see that the most common topic here is "cloud".

Wrapping up

I'm happy with this approach, how I'm starting to get a better feel for how data flows through a jq filter, and also that I can use such filters with the GitHub CLI.