Exploring performance issues with Azure Application Insights

(this article originally appeared on Medium)

I have been using Application Insights on most of the projects I have been working on for the past couple of years. It is an impressive, comprehensive service that lets you monitor your application’s health and behavior, and also easily investigate issues by exploring the telemetry it gathers from your app. In this post, I will demonstrate how Application Insights can be used to perform the first steps one would typically follow when exploring a performance anomaly.

Looking at a real life example

Among the essential metrics one should always monitor in order to have a basic understanding of a web app’s activity are the number of requests served and the average response time. I like to put those two on the same chart in order to correlate them (response time typically increases with throughput, but I will leave that for a future post!).

Imagine that you are checking your dashboard one morning and see the following:

spotting a performance problem

It seems that we got a pretty nasty spike in response time (the green line), peaking at an average of almost 10 seconds around 10:38pm. That spike was probably too brief to trigger an alert, so it was most likely a transient issue. Still, it’s a good use-case to explore how we can use Application Insights to dig deeper into what happened.

As we’re dealing with a performance issue, a good place to start is the “Performance” section that is available under the “Investigate” category.

an overview of the Performance section

Here we see pretty much the same information as in the previous chart: response times and request count.

Focusing on the event

Let’s zoom in to have a better sense of the actual duration of the issue. We just have to drag and drop the blue time sliders and center the time window around 10:30pm.

zooming on the issue

This shows that the issue lasted for around 5 minutes and also that the response time degradation was much worse than originally thought! That’s because we look at an average of the response time, which gets more accurate as we reduce the aggregation time window.

Investigating the underlying cause

Time to find out what caused the spike. From this same chart, we can display the average CPU load of our web app instead of the request count to see if there’s any correlation.

looking at the CPU load

The CPU load did increase during the event, but to a very reasonable maximum of around 10%. So that was definitely not our contention point and it probably increased as a result of the underlying root cause.

Just below this chart lies a list of “operations” — actually, all the requests issued to your web app within the time window — ordered by duration. This can help to find out if there’s any particular operation that caused the issue.

the list of operations

In our case, it seems that the problem was impacting all requests, pointing to a systemic issue rather than anything application-related. Also note that once again, the average latency we saw earlier was, as any average, very optimistic! Some requests actually took more than a minute to complete…

Correlating with dependencies

The next thing to look at is the dependencies, which are all the external services that our web app uses to serve its requests. That would typically include databases, caches, social platforms, notification dispatchers etc.

looking at the dependencies

Bingo! We got a significant spike in dependency latency at the very same time, so the overall degradation in our web app’s response time is most likely coming from a dependency. But which one? Under that chart, we got the same kind of “operations” table, listing all dependency operations.

the list of dependency operations

It seems that our slow dependency was the service behind the 10.1.0.6 IP address — a database in our case. But our web app interacts with half a dozen different dependencies, so we should verify that the database was the only one having problems. Time to unleash the full power of Application Insights’ Advanced Analytics!

Moving up a gear with Advanced Analytics

Under the hood, Application Insights is powered by a very powerful query engine code-named Kusto (Kusto is now also integrated with Azure Log Analytics, and here is the best place to learn about the query language). Most of the charts and tables that you see on the Azure portal’s Application Insights pane are the results of Kusto queries. What’s great is that you can have direct access to those queries from the “Analytics” links available in most sections. In our example, we have a “View in Analytics” drop-down menu that points to the Analytics side of each chart.

opening the graph in advanced analytics

Choosing “Trends: response time” leads us to the Advanced Analytics portal of Application Insights and shows us the exact query that got executed to create the previous chart:

the original Kusto query

And just below, we see the very same chart showing the spike in dependency latency:

the same chart in advanced analytics

Notice that purple dot on top of the highest spike? That’s Application Insights’ Smart Diagnostics kicking in. We just have to click on that dot to let Application Insights perform some clustering on the data and try to identify a pattern that would explain the spike. Here’s what we got in our case:

highlighting a potential pattern

Without any manual guidance, the pattern it has found is a combination of dependency target (the database) and operation (creation of a cursor). And by looking at the dependencies response time with and without that pattern (which is a chart that’s automatically generated by the Smart Diagnostics), we can further confirm the diagnosis:

confirming the impact of the pattern

From zero to full insights in minutes

I’ll stop this already long post here; obviously the next step would be to understand what happened with the database at that particular moment but that didn’t involve Application Insights in our case.

I’ve hopefully demonstrated how easy and straightforward it is to navigate through the several charts and analytics provided by Application Insights. This complete process is usually performed in a couple of minutes for someone who’s familiar with the UI and its different sections.

It’s also important to keep in mind that all the metrics that have served this analysis are collected by default as soon as you add Application Insights to your web app (which is done with just a few clicks from Visual Studio).

I would love to hear your feedback about this real life cast study. Did you find it useful? Any interest for more posts in the same format? What about short videos? Just let me know!

Comments