(this article originally appeared on Medium)
Those of you who have been exposed to issues happening in production systems know that it’s a very difficult situation to deal with. There’s a lot of pressure to fix things as soon as possible, and those issues are obviously unexpected so we would rather be doing something else (like, sleeping!).
Keep calm and trust only your data
There are some best practices to follow and good reflexes to have in those stressful situations. Mark Simms gave an excellent talk on that topic at Build 2017 and I can’t recommend it enough. Mark’s main tips are:
- resist the urge to rush, keep your head cool and “slow down to move fast”
- trust data, not your intuition (that is, if you do have telemetry data to work on!)
Those are excellent pieces of advice, but even when you manage to keep calm and have valuable data at hand, it can still be challenging to decide where to start and how to proceed. Data can be overwhelming and give you wrong hints about the issue. Sometimes you may even come across data that reveals a totally different issue, and it’s tempting to explore that path and lose focus on the most urgent problem.
To handle those situations efficiently, I’m applying a technique I’ve nicknamed “walking backwards”. Now I’m not claiming that this is my invention; it’s nothing but common sense but I thought I would explain its rationale and illustrate it with some example. Here it goes.
Walking backwards along the positive path
We usually start the exploration of an issue with “the thing that doesn’t work” and try to find data that would explain why it doesn’t work. What I do instead, starting from “that thing that doesn’t work”, is to ask myself “what would make it work?”. Or starting from “something that didn’t happen”, “what would make it happen?”.
The logic behind that is that, in the complex systems we’re dealing with, the root cause of a problem often resides many layers or levels away from the visible symptoms. How many times have you wondered “How can that be broken? There’s no reason it doesn’t work”. And that’s usually because thinking about the potential reasons that may explain the direct symptoms is a dead-end. And so we start looking at random data, hoping to make sense of the issue by accident. But following what I would call the “positive path”, that is the backwards chain of events that would not have produced the issue we’re investigating, sets a trail we can walk until we eventually find the root cause.
Let me illustrate this technique with some real life example. I’m currently working on the back-end systems of Keakr, a France-based social network app for urban music lovers. Keakr users can upload videos and share them with their friends and followers.
In the context of a real production issue
A couple of months ago, some users suddenly started to report that videos were loading slowly. Now before we start the exploration of that issue, let me briefly explain how videos are processed in Keakr:
Every time a video is created (1), the frontline web servers store in blob storage the raw mp4 file coming from the app (2), then dispatch a request (3) that’s asynchronously received by some background workers to transcode the video in a streaming format using Azure Media Services (4). Because transcoding is not a quick operation, and because there may be a queue of videos waiting to be transcoded, we serve the raw mp4 until the video is ready to be streamed.
Now back to our problem. Follow me as we apply the “walking backwards” technique, starting from the reported symptoms:
Some videos are slow to load
What makes videos fast to load?
- Serving them in streaming format. A quick check in the database revealed that many recent videos had not been transcoded.
Transcoding of video fails
What makes transcoding of videos succeed?
- Successful operation of the background workers. Looking at our logs and metrics, the workers seemed to work fine, processing other requests without any problem.
- The reception and execution of transcoding requests. As workers store incoming requests in some cold storage for auditing, we found out that no such requests had been stored recently.
Transcoding request don’t arrive in the workers
What makes those request arrive?
- Successful operation of the message queue. As stated previously, other requests were processed by the workers so the message queue was running fine.
- Logical deduction was that the requests were not fed into the message queue.
Transcoding requests are not issued
What makes the requests being issued?
- Successful execution of the “create video” HTTP request handler on the web server. That request handler (1) stores the raw mp4 in blob storage and some meta-data in the database, (2) dispatches a push notification to the user and finally (3) sends the transcoding request on the message queue. Looking at the blob storage and database, we knew that step (1) completed, so we dig into the dispatch of push notifications… to realize that the certificate we were using to interface with APNS (Apple’s push notification system) had expired! This led to an uncaught exception, stopping the execution at that point and preventing the transcoding requests to be issued.
From the symptoms to a very unexpected root cause
Now take a moment to consider the root cause and its visible symptoms: videos were loading slowly because a push notification certificate had expired. There is just no way that we would have spontaneously associated those things together.
We could have spent an awful lot of time trying to investigate the CDN or the video transcoding pipeline, sending test videos to Azure Media Services and eventually finding out that this part was working fine. It is only by applying a rather simple technique that we were able to guide our analysis down to the source of the problem in as little time as possible.
Are you following similar methods when troubleshooting production systems? Maybe variants of what I’ve described, or some totally different approach? Please share your thoughts and suggestions in the comments!