Columnar layout with AWK
Here's a breakdown of a simple AWK script I wrote to format values into neatly aligned columns
(Jump to the end for a couple of updates, thanks gioele and oh5nxo!)
I'm organising my GitHub repositories locally by creating a directory structure representing the different GitHub servers that I use and the orgs and users that I have access to, with symbolic links at the ends of these structures pointing to where I've cloned the actual repositories.
Here's an example of what I started out with:
; find ~/gh -type l
/Users/dja/gh/github.tools.sap/developer-relations/advocates-team-general
/Users/dja/gh/github.com/SAP-samples/teched2020-developer-keynote
/Users/dja/gh/github.com/qmacro-org/auto-tweeter
and what I wanted to end up with (you can see the invocation of the script here too):
; find ~/gh -type l | awk -F/ -vCOLS=5,6,7 -f ~/.dotfiles/scripts/cols.awk
github.tools.sap developer-relations advocates-team-general
github.com SAP-samples teched2020-developer-keynote
github.com qmacro-org auto-tweeter
In other words, I wanted to select columns from the output and have them printed neatly and aligned. Don't ask me why, I guess it's just some form of OCD.
Anyway, I decided to write this in AWK, partly because I don't know AWK that well, but mostly as a meditation on the early days of Unix and a homage to Brian Kernighan. Talking of homages, I've also decided to share this script by describing it line by line, in homage to Randal L Schwartz, that maverick hero that I learned a great deal from in the Perl world.
Randal wrote columns for magazines, each time listing and describing a Perl script he'd written, line by line. I learned so much from Randal and enjoyed the format, so I thought I'd reproduce it here.
Let's start with the script, in full, courtesy of GitHub's embeddable Gist mechanism, which, incidentally, I created from the command line using GitHub's CLI gh
, like this:
; gh gist create --public scripts/cols.awk
I subsequently edited it too (there are now multiple revisions) with:
; gh gist edit c84f5a17dc4740dc2defa6a913cd3c2c
OK, so here's the entire script.
Remember that an AWK scripts are generally data driven, in that you describe patterns and then what to do when those patterns are matched. This is described nicely in the Getting Started with awk
section of the GNU AWK manual. The approach is <pattern> <action>, where the actions are within a {...}
block. In this script, there are two special (and common) patterns used: BEGIN
and END
, i.e. before and after all lines have been processed. There's an <action> block in the middle which has no pattern; that means it's called for each and every line in the input. There's also an <action> block with a specific pattern, which we'll look at shortly.
The invocation
Note the invocation earlier looks like this:
awk -F/ -vCOLS=5,6,7 -f ~/.dotfiles/scripts/cols.awk
Here are what the options do:
-F/
says that the input field separator is the/
character-vCOLS=5,6,7
sets the value5,6,7
for the variableCOLS
-f <script>
tells AWK where to find the script
OK, let's start digging in.
The BEGIN
pattern
Lines 7-9 just make sure that the optional GAP
variable, if not explicitly set (using a -v
option in the invocation) is set to 1. That's how many spaces we want between each column. If we had wanted a value other than the default here, an extra option like this would be required, for example -vGAP=2
.
The NR == 1
pattern
The action in this block is executed only on one occasion - when the value of NR
is 1
.
NR
is a special AWK variable that represents the record number, i.e. the value is 1
for the first record, 2
for the second, and so on. Note that there's also FNR
(file record number) which comes in handy when you're processing multiple input files. So the <action> block related to this NR == 1
pattern is only executed once, when processing the first record in the input.
This <action> block, specifically lines 18-24, deal with the value for the COLS
variable. If it's been set (as in our invocation: -vCOLS=5,6,7
) it splits out the column numbers (5,6 and 7 here) into an array fieldlist
. If it's not been set, then the default should be all columns, which are put into the fieldlist
array using the loop in lines 21-23. Note that NF
is another special variable, the value of which tells us the number of fields in the current record.
The default pattern
Lines 31-36 represent the action for the default pattern, i.e. this is executed for each line in the input. That includes even the first record, although we've done some processing for the first record in the <action> block for the NR == 1
pattern already. That's because all patterns are tested, in sequence, unless an action invokes an explicit next
to skip to the next input record (see update #2 at the end of this post for the attribution for this info).
The script has to work out what the longest word in each column is, and for that it needs to read through the entire input. I think perhaps there may be better ways of doing this, but here's what I did.
Because this script needs two passes over the input, we store the current record in an array called records
in line 32. Worthy of note here is that each field in a record is represented by its positional variable i.e. $1
, $2
, and so on, and $0
represents the entire record. In lines 33-35 we build up an array fieldlengths
of the longest field by position. Arguably we only really need to remember the longest lengths of the fields in fieldlist
, but hey.
The END
pattern
Lines 40-49 represent the action for the special END
pattern, i.e. once the records have been processed (once). At this stage we have the longest lengths for each of the fields (columns), and now we just need to go through the input again, which we have in the records
array.
In line 42 we use the split
function to split out the record we're processing into an array called fields
:
split(records[record], fields, FS)
The third argument supplied to this call is FS
, which is another special variable representing the field separator for this execution. Remember the -F/
option in the invocation, shown earlier? In this case, the value of FS
is also therefore /
. If the field separator is different (the default is whitespace) then the value of FS
will be different too.
Then in lines 43-46 we start printing out each chosen field (remember, the chosen ones are in fieldlist
). The printf
call in line 45 is special, let's break that down here:
printf "%*-s", fieldlengths[f] + GAP, fields[f]
Like other flavours of printf
, this one also takes a pattern and one or more variables to substitute into that pattern. The pattern here is for a single variable, and is %*-s
. This means that the variable to print is a string (basic form is %s
), which should be padded out, left justified (-
) by a value also to be supplied as a variable (*
).
So we need to supply two variables, the width to which the variable value should be padded, and the variable itself. And that's what is supplied. First, we have fieldlengths[f] + GAP
, which works out to be the longest length for that field (column), plus zero or more spaces as defined in GAP
. Then we have the variable that we want printed, i.e. fields[f]
.
Noting that printf
won't print a newline unless it's explicitly given (as \n
), this works well because then the consecutive fields are printed on the same line. Line 47 takes care of printing a newline when all the fields are output for that record.
And that's it. As the tagline for this blog says, I reserve the right to be wrong. I'm not a proficient AWK scripter, but this works for me.
Happy scripting!
Update #1, later the same day: Over on Lobsters, the user gioele contributed a pipeline version, which also helps me in a different area (small pieces loosely joined) of the same Unix meditation: find ~/gh -type | cut -d/ -f5,6,7 | column -s/ -t
. Thanks gioele!
Update #2, even later the same day: Over on Reddit, the user oh5nxo puts me right; in an earlier version of this script (and this blog post) I'd put the lines of code that are now in the NR == 1
<action> block inside the main (default) <action> block, as I'd mistakenly thought that I'd have to otherwise repeat some code. That wasn't the case. Thanks for sharing your knowledge, oh5nxo! I've updated the script and this post to reflect that.