With the emergence of hybrid work, poor user experience has become far too common—users complaining about slow applications, network outages, and machine crashes is now an everyday occurrence. But the vast majority of these issues are assumed to be resolved because they go away and not because the root cause of the issue was ever found.
Root cause analysis requires data—and lots of it—captured in a manner that is time-aligned, contextual, and broad enough to identify (or rule out) potential culprits. The problem is that this data is not easy to gather and analyze using traditional monitoring techniques. Moreover, it doesn’t help that performance issues affect users and applications that can be anywhere, making it all the more difficult to capture the right data from the right places.
In my former life as a Gartner analyst, the main challenge my clients faced wasn’t a lack of performance data, it was the inability to use multiple silos of uncorrelated performance data to actually fix an issue. Siloed monitoring toolsets that focused on one area were being used to shift blame from one team to another, such as app teams pointing to network teams which, in turn, would point to security or end user compute. And, this was if toolsets existed at all. In some client environments, there was a complete lack of visibility.
Figure 1: Digital experience monitoring requires diverse performance telemetry data collected, correlated, and visualized for actionable insights across all users
Any comprehensive diagnostic exercise captures both time series and event data across three main areas of potential cause: the application, the network, and the endpoint device. This approach secures enough evidence to confidently point to where the problem lies so that it can be solved.
As we know, data can be messy and there are rarely smoking guns to any problems. For example, A garbled Microsoft Teams call or a sluggish application can be caused by a number of underlying causes. DEM solutions were built to help with this very problem, but they need to capture a true end user experience and also scan across all of the potential underlying causes, such as the endpoint, network, application, and security, to get to a root cause.
As seen in Figure 2, starting with an objective end user experience measure (slow page load, poor call quality) is key. Then it comes down to correlating poor user experience with the various potential underlying causes.
Figure 2: Root cause analysis requires a breadth of data points to successfully interpret the signal from the noise
As an example, a few weeks ago, a Zscaler employee suddenly started experiencing severe performance degradations affecting all apps, but most noticeably Zoom. With Zoom being a real-time application, fluctuations in connectivity are especially noticeable. Upon examination in ZDX, a Zoom issue was verified as its ZDX score had dropped, showing a number of dips over the course of the day.
Figure 3: Zoom showing performance drops
Step one was to look at the server and DNS resolution times to see if those dips could be correlated. This wasn’t the issue.
Figure 4: Server response times and DNS response times were flat
The next step was to look at end-to-end network latency, and while it was slightly bursty, total latency was under 25 ms and likely not the root cause of the problem.
Figure 5: Latency was relatively flat and under 25ms the entire time.
Finally, it was time to look at the end user device itself. The health metrics on the device looked fine, with CPU, memory, and disk utilization all within acceptable limits.
Figure 6: Device health, in terms of CPU, memory and disk usage were all fine.
While device metrics looked fine, the user’s device events highlight changes in the device’s attributes. This showed that the Gateway_MAC_Address was flipping between a valid and null value—a null value would mean that the device would temporarily lose connection with its next hop. Since this chain of events indicated a layer 2 issue between the endpoint and the gateway, the user rebooted their gateway (which didn’t help) and finally replaced their gateway device, which led to the problem’s resolution.
Figure 7: Device events highlight changes in the device’s attributes, showing a layer 2 issue between the endpoint and the gateway.
When it comes to finding the root cause of performance issues, it comes down to having all the right data in all the right places.
For more information on how Zscaler and ZDX provides this level of visibility to enable root cause analysis, see here.
(Figure 2 inspired by Gartner analyst Greg Murray’s excellent work found here—Gartner subscription required.)