Introduction
Not everyone can afford (or have the need of) an ELK stack for analyzing logs. Sometimes reading files from the command line is the fastest and most straightforward way to spot a problem, if you know how to filter the data.
AWK is an excellent UNIX tool (actually it’s an interpreted programming language) for processing structured data, for example rows and columns generated by applications and tools.
The word AWK comes from the initials of the language’s three developers: Alfred Aho, Peter Weinberger and Brian Kerningham, that presented their work in the 1977.
AWK
The powerful logic of AWK is common to other UNIX tools: working on text lines. AWK processes on line at a time and it can apply rules on it, which are composed by
pattern {action}
which mens that if a line matches with the pattern
then the action
is performed on it.
- both
pattern
andaction
are optional - if
pattern
is not specified then theaction
is applied on all the records - if
action
is not specified, then the record that matches thepattern
is reported on standard output
In short AWK reads the text one line at a time, it splits the input into multiple fields using a delimiter (whitespace by default1), assigns those fields to variables ($1,$2,$3…) and applies the actions defined by the developer.
Like most UNIX tools, AWK can receive the input through a pipe, therefore it’s often use in combination with cat
. But it can also process directly existing files.
Let’s see a couple of basic examples. Given this input file:
foo 21
bar 14
baz 44
qux 10
quux 53
quuz 29
waldo 75
The following command will print the second column
cat input.txt | awk '{print $2}'
the previous command is equal to
awk '{print $2}' input.txt
As you can see, there is no pattern so the action print
is applied to all rows.
Now let’s say that we want to retrieve the number only for waldo
awk '/waldo/ { print $2 }' input.txt
This time the action is applied only when the pattern
matches. The output is of course 75
.
More complex examples
From time to time in my daily job I have to analyze log files written by NGINX, the popular HTTP server and reverse proxy.
In NGINX all client requests to the server are recored in the access log in a specified format using the ngx_http_log_module module.
The default format for logging is called combined and usually is defined in the configuration as follows
http {
...
log_format combined '$remote_addr - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';
...
}
Which produces a log entry like this one
127.0.0.1 - - [08/Apr/2020:06:25:16 +0000] "GET /api/event HTTP/2.0" 200 937 "https://www.mywebsite.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
As you can see, each field has a different meaning
$remote_addr
– IP from which request was made$remote_user
– HTTP Authenticated User. This will be blank for most apps as modern apps do not use HTTP-based authentication[$time_local]
– timestamp as per server timezone"$request"
– HTTP request type (GET, POST, etc) + requested path without args + HTTP protocol version$status
– HTTP response code from server$body_bytes_sent
– size of server response in bytes"$http_referer"
– referral URL (if present)"$http_user_agent"
– user agent as seen by server
Since NGINX writes information about every client request, on large websites with a lot of traffic you can have access log files with hundreds of thousands of lines. That’s why having a tool like AWK is so important.
For example, knowing the NGINX default format we can extract single pieces of information
awk '{print $1}' access.log # print the IP address
awk '{print $4,$5}' access.log # print the date/time
awk '{print $9}' access.log # print the status code
awk '{print $10}' access.log # print the size
Count requests by response code
Let’s say that we want to count the number of requests and group them by response code.
We can extract this information using a combination of AWK and other UNIX utilities, such as sort and uniq.
awk '{print $9}' access.log | sort | uniq -c | sort -rn
All we’re doing here is using AWK to extract the status code field from the log file and piping it through some other commands. The first sort
is to enable uniq
to properly identify and count (-c
) unique status codes. The final sort orders the result by number (-n
) in descending (-r
) order.
254982 200
57281 204
442 499
318 502
122 404
102 400
5 302
2 401
2 304
1 408
As you can see we can already identify some issues. We have a bunch of 502 errors which usally identify serious issues in the application, but this is a different story.
The same output of the previous command can be achieved directly with AWK, using a little bit more of its power:
awk '{count[$9]++}; END { for (i in count) print i, count[i] }' access.log
Awk supports associative arrays, which are like traditional arrays except they uses strings as their indexes rather than numbers.
The syntax is simple
arrayname[string]=value
In the previous command we create a new array with a key specific for each HTTP status code found inside the input file and pregressively increment the internal value on each occurrence. At the end of the file we print the content of this array by iterating on the key and the value.
Find which URLs returned a status code
The following will search for requests which returned the 404 error code response and then it will sort them by number of requests per URL:
awk '($9 ~ /404/)' access.log | awk '{count[$7]++}; END { for (i in count) print i, count[i] }'
however in this case a combination of awk
, sort
and uniq
is slightly more intuitive.
awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Notice the tilde ~
operator, which in AWK means matching.
Similarly, for 502 error code we can run following command:
awk '($9 ~ /502/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn
Or maybe we want to go further and display all the requests that didn’t return 200 (OK). We can do that using an inverted regular expression
awk '($9 !~ /200/)' access.log | awk '{print $9}' | sort | uniq | sort -rn
Find most used user agents
Maybe you want to list all user agents ordered by the number of times they appear. Intuitively you can do:
awk '{print $12}' access.log | sort | uniq -c | sort -rn
but since the default separator in AWK is a whitespace, you’re going to miss a lot of information because a user agent like the following
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"
will simply become "Mozilla/5.0
.
Therefore this time we have to split our rows using a different separator. The double-quote character "
is the way to go:
awk -F\" '{print $6}' access.log | sort | uniq -c | sort -rn
Count how many times Googlebot indexes the website
awk -F\" '($6 ~ /Googlebot/)' access.log | wc -l
The following is a variation that groups the number of Googlebot hits by date, assuming that the format of the date is like 22/Aug/2021:00:16:03 +0000
for FILE in access.log.*.gz; do zcat $FILE | awk -F\" '($6 ~ /Googlebot/)' - | awk '{a[substr($4,2,7)]+=1}END{for(b in a){print b,a[b]}}';
Most requested URLs
awk -F\" '{print $2}' access.log | awk '{print $2}' | sort | uniq -c | sort -rn
However this command does not consider the HTTP method and protocol (e.g. GET requests are treated like POST requests). You can remove the second AWK command to group also my method and protocol:
awk -F\" '{print $2}' access.log | sort | uniq -c | sort -rn
Output:
23617 GET /api/cart HTTP/2.0
5635 POST /api/cart HTTP/2.0
235 GET /api/cart HTTP/1.1
68 POST /api/cart HTTP/1.1
Maybe you want to discard requests that appear less than 100 times. You can do that again with AWK
awk -F\" '{print $2}' access.log | sort | uniq -c | awk '$1>=100{print}' | sort -rn
Output:
23617 GET /api/cart HTTP/2.0
5635 POST /api/cart HTTP/2.0
235 GET /api/cart HTTP/1.1
Most requested URLs containing pattern
awk -F\" '($2 ~ "/api/event"){print $2}' access.log | awk '{print $2}' | sort | uniq -c
Output:
563 /api/event
46 /api/event/1
21 /api/event/3
18 /api/event/4
...
Group most requested URLs containing pattern by hour
Let’s say that we want to know how many times a particular URL has been requested for each hour. The following command
awk -F\" '($2 ~ "/api/event"){print $1}' access.log
gives us back the first part of the row (up to the first "
character) where the url matches our pattern.
1.2.3.4 - - [07/Apr/2020:09:46:06 +0000]
4.3.2.1 - - [07/Apr/2020:09:46:06 +0000]
...
then we can split again using the [
character
awk -F\[ '{print $2}'
07/Apr/2020:09:46:06 +0000]
07/Apr/2020:09:46:06 +0000]
...
and finally split again using the :
character, before grouping and sorting:
awk -F\: '{print $1,$2}'
07/Apr/2020 09
07/Apr/2020 09
...
The final command will then be
awk -F\" '($2 ~ "/api/event"){print $1}' access.log | awk -F\[ '{print $2}' | awk -F\: '{print $1,$2}' | sort | uniq -c
Output:
07/Apr/2020 09 7894
07/Apr/2020 10 7566
07/Apr/2020 11 6464
07/Apr/2020 12 7809
07/Apr/2020 13 7548
07/Apr/2020 14 9547
07/Apr/2020 15 8497
07/Apr/2020 16 9451
07/Apr/2020 17 8283
07/Apr/2020 18 9920
07/Apr/2020 19 10247
07/Apr/2020 20 10366
07/Apr/2020 21 8675
07/Apr/2020 22 8419
07/Apr/2020 23 8657
08/Apr/2020 00 7756
08/Apr/2020 01 6668
08/Apr/2020 02 6142
08/Apr/2020 03 6145
08/Apr/2020 04 6641
08/Apr/2020 05 5860
08/Apr/2020 06 2045
I will continue to update the post whenever I find useful variations.