DJ Adams

Extracting blog post dates from URLs with jq

I had a JSON array of objects from a list of GitHub repo issues. Each object contained a blog post URL and a title. The URL had the post date embedded in the path, and I wanted to sort them all based on the post date. Here's how I did it.

I have a working list of blog posts, as issues in a GitHub repo (as a sort of temporary data store). Each issue has the blog post title as the issue title, and just the blog post URL in the issue body, like this:

example blog post issue

The base data

I had retrieved the issue data as JSON like this:

gh issue list \
  --limit 500 \
  --label dj-adams-sap \
  --json number,title,body \
  > dj-adams-sap.json 

Here's what the first and last couple of items in dj-adams-sap.json look like (extracted with jq '.[:2] + .[-2:]' dj-adams-sap.json):

[
  {
    "body": "https://blogs.sap.com/2018/03/26/monday-morning-thoughts-cloud-native/",
    "number": 224,
    "title": "Monday morning thoughts- cloud native"
  },
  {
    "body": "https://blogs.sap.com/2018/03/31/scripting-the-workflow-api-with-bash-and-curl/",
    "number": 223,
    "title": "Scripting the Workflow API with bash and curl"
  },
  {
    "body": "https://blogs.sap.com/2022/08/04/introducing-sap-codejam-btp-a-new-group-and-a-first-event/",
    "number": 83,
    "title": "Introducing “SAP CodeJam BTP” - a new group, and a first event"
  },
  {
    "body": "https://blogs.sap.com/2022/10/06/devtoberfest-2022-week-2/",
    "number": 82,
    "title": "Devtoberfest 2022 Week 2"
  }
]

Extracting the dates

The dates of the blog posts can be determined from the first part of the path info in the blog post URLs, clearly. So I decided to map over each object and add a new property postdate which would be a YYYY-MM-DD formatted string worked out from that data.

First, I decided to define a function to extract the date:

def date: 
  sub(
    "^https.+?com/(?<yyyy>[0-9]{4})/(?<mm>[0-9]{2})/(?<dd>[0-9]{2})/.+$";
    "\(.yyyy)-\(.mm)-\(.dd)"
  );

This uses the sub function to perform a regexp based substitution, actually replacing the entire input string (the URL) with a new string made up from the capture groups defined.

These are named capture groups, here's one of them; this one matches 4 consecutive digits into a capture group named yyyy:

(?[0-9]{4})

Looking at the argument supplied for the second parameter of sub/2, the \( ... ) syntax is string interpolation), to have an expression (in this example it's .yyyy, .mm and .dd) evaluated and expanded in a string.

Adding the postdate property

With the date function ready, I could then simply iterate over the items in the array, adding a new postdate property to each object, with the value of whatever the date function extracts from the item's .body property:

map(. + { postdate: .body|date })

Based on the reduced data set above, this then produces:

[
  {
    "body": "https://blogs.sap.com/2018/03/26/monday-morning-thoughts-cloud-native/",
    "number": 224,
    "title": "Monday morning thoughts- cloud native",
    "postdate": "2018-03-26"
  },
  {
    "body": "https://blogs.sap.com/2018/03/31/scripting-the-workflow-api-with-bash-and-curl/",
    "number": 223,
    "title": "Scripting the Workflow API with bash and curl",
    "postdate": "2018-03-31"
  },
  {
    "body": "https://blogs.sap.com/2022/08/04/introducing-sap-codejam-btp-a-new-group-and-a-first-event/",
    "number": 83,
    "title": "Introducing &#8220;SAP CodeJam BTP&#8221; - a new group, and a first event",
    "postdate": "2022-08-04"
  },
  {
    "body": "https://blogs.sap.com/2019/10/06/devtoberfest-2022-week-2/",
    "number": 82,
    "title": "Devtoberfest 2022 Week 2",
    "postdate": "2019-10-06"
  }
]

Sorting

Then it's just a simple case of using sort_by (followed optionally by reverse) to get the post date order I want:

map(. + { postdate: .body|date })
| sort_by(.postdate)

Of course, I could combine the two parts if I didn't want the postdate property to be an explicit fixture in my downstream processing. Something like this:

sort_by(.body | date)

Wrapping up

It did occur to me that given the pattern of blog post URLs, I could just sort by them directly. Then again, it wasn't as interesting and I didn't learn anything about named capture groups. Anyway, this post is mostly for me, for when my future self forgets how to use capture groups and the sub function.