Debugging, Unicorns, Rails and Timeouts

The goal of this post is to describe a debugging session, and different possible approaches for debugging in general. It also contains some tips and tricks to use with unicorn and Rails to debug request timeouts.

I usually enjoy debugging because of the unknown thing to fix, like a pirate finding a hidden box of gold :). Seriously, I think one of the most important things in writing software is debugging. Even Rob Pike thinks that “it leads to better software”. So being good and fast at debugging will really help your productivity as a software engineer.

The problem

A few days ago, a colleague notified me about one of our rails project which sporadically was returning 502 error pages. Surprisingly, after a refresh the page rendered correctly. This issue was reported to him by a QA engineer, who is working in another time zone (our day was their night).

So I was like um…um… interesting, time for some gold digging :)

For some background, the application is a typical rails stack with Nginx serving as a proxy to unicorn upstreams replying to http requests.

Debugging session

Rob Pike said “Ken taught me that thinking before debugging is extremely important”, so I started to build, in my head, the whole lifecycle of the request ending up in 502 error pages. The request will hit our load balancer first, which will choose one of the nginx web servers behind it. After the nginx server receives the request, if it’s a dynamic request that is not cached, it will proxy that to the application server, a unicorn master. The unicorn master will send that to one of its available workers to serve the request.

So I need to start by checking the nginx error log to see what’s happening when the request hit that part of the stack.

tail -f /var/log/nginx/myapp_error.log
2013/10/31 12:13:58 [error] 15893#0: *1080 upstream prematurely closed connection while reading response header from upstream, client: 8.82.88.88, server: -, request: "GET /api/v1/endpoints?page=24&perPage=100 HTTP/1.1", upstream: "http://unix:/myapp/shared/sockets/myapp.socket:/api/v1/endpoints?&page=24&perPage=100", host: "myapp.meltwater.com"
2013/10/31 12:27:04 [error] 15894#0: *1217 upstream prematurely closed connection while reading response header from upstream, client: 10.102.128.11, server: -, request: "GET /api/v1/endpoints?page=104&perPage=100 HTTP/1.1", upstream: "http://unix:/myapp/shared/sockets/myapp.socket:/api/v1/endpoints?page=104&perPage=100", host: "myapp.meltwater.com"

As you can see, when nginx was reading the response from the upstream unicorn worker handling the request, the unicorn prematurely closed the connection prompting nginx to send a 502 error response to the browser client. So the problem is at the upstream side, unicorn. Time to check unicorn logs.

A few characters on my terminal, I was served by:

tail -f unicorn.stderr.log
W, [2013-10-31T07:05:44.389967 #15146]  WARN -- : #<Unicorn::HttpServer:0x00000001286210>: worker (pid: 15146) exceeds memory limit (380649472 bytes > 212010887 bytes)
W, [2013-10-31T07:05:44.390662 #15146]  WARN -- : Unicorn::WorkerKiller send SIGQUIT (pid: 15146) alive: 66593 sec (trial 1)
I, [2013-10-31T07:05:44.928483 #6171]  INFO -- : reaped #<Process::Status: pid 15146 exit 0> worker=10
I, [2013-10-31T07:05:47.322550 #8323]  INFO -- : worker=10 ready
W, [2013-10-31T07:05:54.551811 #11959]  WARN -- : #<Unicorn::HttpServer:0x00000001286210>: worker (pid: 11959) exceeds memory limit (381841408 bytes > 261913909 bytes)
W, [2013-10-31T07:05:54.551981 #11959]  WARN -- : Unicorn::WorkerKiller send SIGQUIT (pid: 11959) alive: 68810 sec (trial 1)
I, [2013-10-31T07:05:55.027671 #6171]  INFO -- : reaped #<Process::Status: pid 11959 exit 0> worker=7

Some background here before we interpret the logs. In the rails application, we’re using a gem called unicorn-worker-killer. Basically the gem provides automatic restart of Unicorn workers based on max number of requests, and process memory size (RSS), without affecting any requests. This greatly improves site’s stability by avoiding unexpected memory exhaustion at the application nodes.

Now looking at the unicorn error logs, we can see that the unicorn-worker-killer was doing its job. The rails application was exceeding the memory limit set. This means that either 1) the application really needed more memory so the maxmemory set should be increased or 2) some optimizations needed to be applied to the code.

I wasn’t very familiar with the application code to know exactly if I could improve some parts of the code right away. But I figure out that I needed to know the ballpark of the memory needed to run the application smoothly. So, I decided I would disable the unicorn-worker-killer for a while then monitor the unicorn workers process memory size to see which max memory size would be appropriate.

After a test deployment with capistrano with the previous changes, I went and look at the unicorn error logs again to see what was happening.

tail -f unicorn.stderr.log
E, [2013-11-04T21:47:34.396924 #31952] ERROR -- : reaped #<Process::Status: pid 32025 SIGKILL (signal 9)> worker=3
I, [2013-11-04T21:47:40.379604 #19377]  INFO -- : worker=3 ready
E, [2013-11-04T21:53:03.425294 #31952] ERROR -- : reaped #<Process::Status: pid 32030 SIGKILL (signal 9)> worker=4
I, [2013-11-04T21:53:07.548663 #19773]  INFO -- : worker=4 ready
E, [2013-11-04T21:53:31.131844 #31952] ERROR -- : reaped #<Process::Status: pid 32037 SIGKILL (signal 9)> worker=5

Even after disabling the gem unicorn-worker-killer, there were still some unicorn workers being “reaped”. Now, something to know here and the reason you should choose unicorn as your application server is that, with unicorn, there is a master process that create and manage child processes as workers that would handle requests. The master process would kill a worker that is processing a request that has timed out. That’s an awesome feature because you don’t want to bring down the whole system because of requests that timed out.

So, the previous logs are saying that some requests are timing out.

The immediate question that was in my head was which requests (URI) the unicorn workers were serving at that time. A look at the rails logs did not help because if the worker get killed, the whole process is gone, and nothing can stay behind to log which requests were being processed at that time. Now, something to learn is that there are two places that have timeout settings in our rails stack, 1) nginx proxy timeout while sending requests or getting replies from unicorn upstreams, 2) unicorn request timeout.

For nginx:

# /etc/nginx/conf.d/myapp.conf
proxy_connect_timeout 60;
proxy_read_timeout    60;
proxy_send_timeout    60;

For unicorn:

# config/unicorn.rb
# nuke workers after 60 seconds (the default)
timeout 60

So I quickly checked those settings to see if everything was set to the right value. A request should not be taking more than 60 seconds and timeout. The catch is that we couldn’t reproduce the 502 errors at will.

We are at a point where we know that the main problem is that some requests are being made and those requests are timing out, they eat a lot of resources affecting other requests too. But we are not sure which ones.

The next question is how to find which requests a unicorn worker is currently serving while timing out and getting killed by its master.

So I was like if I can write a wrapper around the lifecycle of a request while it’s being processed in rails, I could log the requests that are about to timeout. If you’ve read the guides on rails, you will realise that you can write a middleware that could be injected around the request as it’s being processed.

So, I came up with:

# config/initializers/log_before_timeout.rb
class LogBeforeTimeout
  def initialize(app)
    @app = app
  end

  def call(env)
    thr = Thread.new do
      sleep(UNICORN_TIMEOUT - 1) # set this to Unicorn timeout - 1
      unless Thread.current[:done]
        path = env["PATH_INFO"]
        qs = env["QUERY_STRING"] and path = "#{path}?#{qs}"
        Rails.logger.warn "#{path} I’m about to timeout bringing down my unicorn worker too :("
      end
    end
    @app.call(env)
  ensure
    thr[:done] = true
    thr.run
  end
end

Basically, what the above code does is that for every request that is being processed by the rails app:

it will hook into the current thread processing the request
it will wait until just before the max timeout set in unicorn (by default 60 seconds)
if the thread is still not done yet, meaning the request is still being processed, it will log the current path being processed with an identifier log message saying this request is about to timeout.

With that, by checking the rails log, we could easily spot all the requests that are causing the unicorn workers to get killed by their master.

After creating a file config/initializers/log_before_timeout.rb with the above code content, I deployed that code on our servers, and I was like, bring the nasty requests!!!

A cat/grep on the rails log reveals :

cat /var/log/myapp.log | grep "I’m about to timeout bringing down my unicorn worker too :("
Nov 28 22:57:55 server1 myapp[48500]: /api/v1/endpoints?page=1&perPage=1000 I’m about to timeout bringing down my unicorn worker too :(
Nov 28 23:14:20 server1 myapp[48514]: /api/v1/business?page=1&perPage=1000 I’m about to timeout bringing down my unicorn worker too :(
Nov 29 01:34:32 server1 myapp[48533]: /api/v1/endpoints?page=2&perPage=1000 I’m about to timeout bringing down my unicorn worker too :(
Nov 29 01:35:39 server1 myapp[48497]: /api/v1/business?page=2&perPage=1000 I’m about to timeout bringing down my unicorn worker too :(

As you can see, they were doing some nasty requests with a perPage of 1000 that the application couldn’t handle. These requests were timing out and hence eating a lot of system resources. Those were ultimately also affecting other requests that under normal circumstances would be very fast.

The solution was to put some limit of the parameters of the requests and optimize that part of the code.

Conclusion

Rob Pike said “By thinking about how that problem could happen, he’d intuit where the model was wrong or where our code must not be satisfying the model.” Even though I was not very familiar with the code, I was able to find the problem by looking at how those 502s errors could have happened.

Brian W. Kernighan said in his book The Elements of Programming Style that “Everyone knows that debugging is twice as hard as writing a program in the first place.”, so sometimes, it gets really difficult to debug some problems. “By thinking about how that problem could happen”, you will find that debugging becomes easier, writing program will then become even easier as you will learn what not to do, which kind of code to write in certain situations etc

Do you have any suggestion on how to improve debugging skills or how to become more productive as a software engineer? Please leave a comment below.