Parse log files with AWK

 Reading time ~8 minutes

Heads up: this article is over a year old. Some information might be out of date, as I don't always update older articles.

Introduction

Not everyone can afford (or have the need of) an ELK stack for analyzing logs. Sometimes reading files from the command line is the fastest and most straightforward way to spot a problem, if you know how to filter the data.

AWK is an excellent UNIX tool (actually it’s an interpreted programming language) for processing structured data, for example rows and columns generated by applications and tools.

The word AWK comes from the initials of the language’s three developers: Alfred Aho, Peter Weinberger and Brian Kerningham, that presented their work in the 1977.

The AWK programming language book

AWK

The powerful logic of AWK is common to other UNIX tools: working on text lines. AWK processes on line at a time and it can apply rules on it, which are composed by

pattern {action}

which mens that if a line matches with the pattern then the action is performed on it.

  • both pattern and action are optional
  • if pattern is not specified then the action is applied on all the records
  • if action is not specified, then the record that matches the pattern is reported on standard output

In short AWK reads the text one line at a time, it splits the input into multiple fields using a delimiter (whitespace by default1), assigns those fields to variables ($1,$2,$3…) and applies the actions defined by the developer.

Like most UNIX tools, AWK can receive the input through a pipe, therefore it’s often use in combination with cat. But it can also process directly existing files.

Let’s see a couple of basic examples. Given this input file:

foo 21
bar 14
baz 44
qux 10
quux 53
quuz 29
waldo 75

The following command will print the second column

cat input.txt | awk '{print $2}'

the previous command is equal to

awk '{print $2}' input.txt

As you can see, there is no pattern so the action print is applied to all rows.

Now let’s say that we want to retrieve the number only for waldo

 awk '/waldo/ { print $2 }' input.txt

This time the action is applied only when the pattern matches. The output is of course 75.

More complex examples

From time to time in my daily job I have to analyze log files written by NGINX, the popular HTTP server and reverse proxy.

In NGINX all client requests to the server are recored in the access log in a specified format using the ngx_http_log_module module.

The default format for logging is called combined and usually is defined in the configuration as follows

http {
  ...
  log_format combined '$remote_addr - $remote_user [$time_local] '
                      '"$request" $status $body_bytes_sent '
                      '"$http_referer" "$http_user_agent"';
  ...
}

Which produces a log entry like this one

127.0.0.1 - - [08/Apr/2020:06:25:16 +0000] "GET /api/event HTTP/2.0" 200 937 "https://www.mywebsite.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"

As you can see, each field has a different meaning

  • $remote_addr – IP from which request was made
  • $remote_user – HTTP Authenticated User. This will be blank for most apps as modern apps do not use HTTP-based authentication
  • [$time_local] – timestamp as per server timezone
  • "$request" – HTTP request type (GET, POST, etc) + requested path without args + HTTP protocol version
  • $status – HTTP response code from server
  • $body_bytes_sent – size of server response in bytes
  • "$http_referer" – referral URL (if present)
  • "$http_user_agent" – user agent as seen by server

Since NGINX writes information about every client request, on large websites with a lot of traffic you can have access log files with hundreds of thousands of lines. That’s why having a tool like AWK is so important.

For example, knowing the NGINX default format we can extract single pieces of information

awk '{print $1}' access.log         # print the IP address
awk '{print $4,$5}' access.log      # print the date/time
awk '{print $9}' access.log         # print the status code
awk '{print $10}' access.log        # print the size

Count requests by response code

Let’s say that we want to count the number of requests and group them by response code.

We can extract this information using a combination of AWK and other UNIX utilities, such as sort and uniq.

awk '{print $9}' access.log | sort | uniq -c | sort -rn

All we’re doing here is using AWK to extract the status code field from the log file and piping it through some other commands. The first sort is to enable uniq to properly identify and count (-c) unique status codes. The final sort orders the result by number (-n) in descending (-r) order.

254982 200
 57281 204
   442 499
   318 502
   122 404
   102 400
     5 302
     2 401
     2 304
     1 408

As you can see we can already identify some issues. We have a bunch of 502 errors which usally identify serious issues in the application, but this is a different story.

The same output of the previous command can be achieved directly with AWK, using a little bit more of its power:

awk '{count[$9]++}; END { for (i in count) print i, count[i] }' access.log

Awk supports associative arrays, which are like traditional arrays except they uses strings as their indexes rather than numbers.

The syntax is simple

arrayname[string]=value

In the previous command we create a new array with a key specific for each HTTP status code found inside the input file and pregressively increment the internal value on each occurrence. At the end of the file we print the content of this array by iterating on the key and the value.

Find which URLs returned a status code

The following will search for requests which returned the 404 error code response and then it will sort them by number of requests per URL:

awk '($9 ~ /404/)' access.log | awk '{count[$7]++}; END { for (i in count) print i, count[i] }'

however in this case a combination of awk, sort and uniq is slightly more intuitive.

awk '($9 ~ /404/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Notice the tilde ~ operator, which in AWK means matching.

Similarly, for 502 error code we can run following command:

awk '($9 ~ /502/)' access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Or maybe we want to go further and display all the requests that didn’t return 200 (OK). We can do that using an inverted regular expression

awk '($9 !~ /200/)' access.log | awk '{print $9}' | sort | uniq | sort -rn

Find most used user agents

Maybe you want to list all user agents ordered by the number of times they appear. Intuitively you can do:

awk '{print $12}' access.log | sort | uniq -c | sort -rn

but since the default separator in AWK is a whitespace, you’re going to miss a lot of information because a user agent like the following

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"

will simply become "Mozilla/5.0.

Therefore this time we have to split our rows using a different separator. The double-quote character " is the way to go:

awk -F\" '{print $6}' access.log | sort | uniq -c | sort -rn

Count how many times Googlebot indexes the website

awk -F\" '($6 ~ /Googlebot/)' access.log | wc -l

The following is a variation that groups the number of Googlebot hits by date, assuming that the format of the date is like 22/Aug/2021:00:16:03 +0000

for FILE in access.log.*.gz; do zcat $FILE | awk -F\" '($6 ~ /Googlebot/)' - | awk '{a[substr($4,2,7)]+=1}END{for(b in a){print b,a[b]}}';

Most requested URLs

awk -F\" '{print $2}' access.log | awk '{print $2}' | sort | uniq -c | sort -rn

However this command does not consider the HTTP method and protocol (e.g. GET requests are treated like POST requests). You can remove the second AWK command to group also my method and protocol:

awk -F\" '{print $2}' access.log | sort | uniq -c | sort -rn

Output:

23617 GET /api/cart HTTP/2.0
 5635 POST /api/cart HTTP/2.0
  235 GET /api/cart HTTP/1.1
   68 POST /api/cart HTTP/1.1

Maybe you want to discard requests that appear less than 100 times. You can do that again with AWK

awk -F\" '{print $2}' access.log | sort | uniq -c | awk '$1>=100{print}' | sort -rn

Output:

23617 GET /api/cart HTTP/2.0
 5635 POST /api/cart HTTP/2.0
  235 GET /api/cart HTTP/1.1

Most requested URLs containing pattern

awk -F\" '($2 ~ "/api/event"){print $2}' access.log | awk '{print $2}' | sort | uniq -c

Output:

  563 /api/event
   46 /api/event/1
   21 /api/event/3
   18 /api/event/4
   ...

Group most requested URLs containing pattern by hour

Let’s say that we want to know how many times a particular URL has been requested for each hour. The following command

awk -F\" '($2 ~ "/api/event"){print $1}' access.log

gives us back the first part of the row (up to the first " character) where the url matches our pattern.

1.2.3.4 - - [07/Apr/2020:09:46:06 +0000]
4.3.2.1 - - [07/Apr/2020:09:46:06 +0000]
...

then we can split again using the [ character

awk -F\[ '{print $2}'
07/Apr/2020:09:46:06 +0000]
07/Apr/2020:09:46:06 +0000]
...

and finally split again using the : character, before grouping and sorting:

awk -F\: '{print $1,$2}'
07/Apr/2020 09
07/Apr/2020 09
...

The final command will then be

awk -F\" '($2 ~ "/api/event"){print $1}' access.log | awk -F\[ '{print $2}' | awk -F\: '{print $1,$2}' | sort | uniq -c

Output:

07/Apr/2020 09 7894
07/Apr/2020 10 7566
07/Apr/2020 11 6464
07/Apr/2020 12 7809
07/Apr/2020 13 7548
07/Apr/2020 14 9547
07/Apr/2020 15 8497
07/Apr/2020 16 9451
07/Apr/2020 17 8283
07/Apr/2020 18 9920
07/Apr/2020 19 10247
07/Apr/2020 20 10366
07/Apr/2020 21 8675
07/Apr/2020 22 8419
07/Apr/2020 23 8657
08/Apr/2020 00 7756
08/Apr/2020 01 6668
08/Apr/2020 02 6142
08/Apr/2020 03 6145
08/Apr/2020 04 6641
08/Apr/2020 05 5860
08/Apr/2020 06 2045

I will continue to update the post whenever I find useful variations.


  1. By default any run of spaces and/or tabs and/or newlines. See here ↩︎

comments powered by Disqus

Testing approaches for Laravel Form Requests

Introduction

Laravel Form Requests in my opinion are one of the most powerful components of the framework. They can handle …