(this article originally appeared on Medium)
I have been using Application Insights on most of the projects I have been working on for the past couple of years. It is an impressive, comprehensive service that lets you monitor your application’s health and behavior, and also easily investigate issues by exploring the telemetry it gathers from your app. In this post, I will demonstrate how Application Insights can be used to perform the first steps one would typically follow when exploring a performance anomaly.
Looking at a real life example
Among the essential metrics one should always monitor in order to have a basic understanding of a web app’s activity are the number of requests served and the average response time. I like to put those two on the same chart in order to correlate them (response time typically increases with throughput, but I will leave that for a future post!).
Imagine that you are checking your dashboard one morning and see the following:
It seems that we got a pretty nasty spike in response time (the green line), peaking at an average of almost 10 seconds around 10:38pm. That spike was probably too brief to trigger an alert, so it was most likely a transient issue. Still, it’s a good use-case to explore how we can use Application Insights to dig deeper into what happened.
As we’re dealing with a performance issue, a good place to start is the “Performance” section that is available under the “Investigate” category.
Here we see pretty much the same information as in the previous chart: response times and request count.
Focusing on the event
Let’s zoom in to have a better sense of the actual duration of the issue. We just have to drag and drop the blue time sliders and center the time window around 10:30pm.
This shows that the issue lasted for around 5 minutes and also that the response time degradation was much worse than originally thought! That’s because we look at an average of the response time, which gets more accurate as we reduce the aggregation time window.
Investigating the underlying cause
Time to find out what caused the spike. From this same chart, we can display the average CPU load of our web app instead of the request count to see if there’s any correlation.
The CPU load did increase during the event, but to a very reasonable maximum of around 10%. So that was definitely not our contention point and it probably increased as a result of the underlying root cause.
Just below this chart lies a list of “operations” — actually, all the requests issued to your web app within the time window — ordered by duration. This can help to find out if there’s any particular operation that caused the issue.
In our case, it seems that the problem was impacting all requests, pointing to a systemic issue rather than anything application-related. Also note that once again, the average latency we saw earlier was, as any average, very optimistic! Some requests actually took more than a minute to complete…
Correlating with dependencies
The next thing to look at is the dependencies, which are all the external services that our web app uses to serve its requests. That would typically include databases, caches, social platforms, notification dispatchers etc.
Bingo! We got a significant spike in dependency latency at the very same time, so the overall degradation in our web app’s response time is most likely coming from a dependency. But which one? Under that chart, we got the same kind of “operations” table, listing all dependency operations.
It seems that our slow dependency was the service behind the 10.1.0.6 IP address — a database in our case. But our web app interacts with half a dozen different dependencies, so we should verify that the database was the only one having problems. Time to unleash the full power of Application Insights’ Advanced Analytics!
Moving up a gear with Advanced Analytics
Under the hood, Application Insights is powered by a very powerful query engine code-named Kusto (Kusto is now also integrated with Azure Log Analytics, and here is the best place to learn about the query language). Most of the charts and tables that you see on the Azure portal’s Application Insights pane are the results of Kusto queries. What’s great is that you can have direct access to those queries from the “Analytics” links available in most sections. In our example, we have a “View in Analytics” drop-down menu that points to the Analytics side of each chart.
Choosing “Trends: response time” leads us to the Advanced Analytics portal of Application Insights and shows us the exact query that got executed to create the previous chart:
And just below, we see the very same chart showing the spike in dependency latency:
Notice that purple dot on top of the highest spike? That’s Application Insights’ Smart Diagnostics kicking in. We just have to click on that dot to let Application Insights perform some clustering on the data and try to identify a pattern that would explain the spike. Here’s what we got in our case:
Without any manual guidance, the pattern it has found is a combination of dependency target (the database) and operation (creation of a cursor). And by looking at the dependencies response time with and without that pattern (which is a chart that’s automatically generated by the Smart Diagnostics), we can further confirm the diagnosis:
From zero to full insights in minutes
I’ll stop this already long post here; obviously the next step would be to understand what happened with the database at that particular moment but that didn’t involve Application Insights in our case.
I’ve hopefully demonstrated how easy and straightforward it is to navigate through the several charts and analytics provided by Application Insights. This complete process is usually performed in a couple of minutes for someone who’s familiar with the UI and its different sections.
It’s also important to keep in mind that all the metrics that have served this analysis are collected by default as soon as you add Application Insights to your web app (which is done with just a few clicks from Visual Studio).
I would love to hear your feedback about this real life cast study. Did you find it useful? Any interest for more posts in the same format? What about short videos? Just let me know!