Getting in on the Ground Floor of the ‘New Observability’

modernization

By Ozan Unlu, CEO, Edge Delta

In fall 2022, Splunk issued its annual State of Observability Survey and Report, highlighting the increasingly critical role observability plays in enabling multi-cloud visibility and dramatically improving end-user experiences. Growth in observability – or the ability to measure a system’s current state based on the data it generates – has occurred in lockstep with cloud adoption and is viewed by many as an essential counterpart, helping IT teams overcome more complex and extensive monitoring challenges.

According to Splunk’s report, the benefits of observability are very real: organizations that are observability leaders (two years or more with an observability practice) report a 69 percent better mean time to resolution for unplanned downtime or performance degradation; and leaders’ average annual cost of downtime associated with business-critical applications is $2.5 million, versus $23.8 million for the beginner group (one year or less).

So where exactly do government IT organizations fall? Solidly in the beginner category, with 78 percent of government IT teams registering as beginners (versus an average of 59 percent across other industries). In fact, no government IT team surveyed registered as a leader.

While it may seem that government IT is trailing in observability, there’s actually tremendous upside to be had here. This is because observability, as a discipline, is undergoing a dramatic shift – spawned largely by rapidly growing data volumes –  that nascent observability practices are ideally poised to leverage. This shift entails:

  • Moving from ‘Big Data’ to ‘Small Data’ Traditional observability approaches have entailed pooling all data in a central repository for analysis, with the understanding that collecting all datasets together and correlating them can be the key to delivering the insights needed to quickly determine root causes. The problem with this approach is that traditionally all data is relegated to hot storage tiers which are exceedingly expensive, especially with data volumes exploding due to the cloud and microservices. The incidence of an IT team unknowingly exceeding a data limit and getting hit with a huge unexpected bill as a result is far more common than one would expect. To avoid this – and knowing that the vast majority of data is never used – an organization may begin indiscriminately discarding certain data sets, but the problem with this approach is that problems can lurk anywhere and random discarding may introduce significant blind spots. The solution is not to discard randomly, but rather to inspect all data in parallel, in smaller, bite-sized chunks.
  • Analyzing Data in Real-Time – Besides cost, another drawback of the “centralize and analyze” approach described above is the fact that it takes time to ingest and index all this data – time that an organization may not have if a mission-critical system is down. In spite of great technology advances, recent outage analyses have found that the overall costs and consequences of downtime are worsening, likely due in part to an inability to harness, access and manipulate all this data in milliseconds. Many data pipelines just can’t keep up with the pace and volume of data – and the more data there is to ingest, the slower these pipelines become. Furthermore, these pipelines do very little to help IT teams understand their datasets and determine what is worth indexing. This leads to overstuffed central repositories that then take much longer to return data queries, further elongating mean-time-to-repair (MTTR). With the average cost of downtime estimated at $9,000 per minute (translating to over $500,000 per hour), a much better approach entails analyzing data for anomalies in real-time.
  • Pushing Data Analysis Upstream – Analyzing data in real-time helps IT teams not just detect anomalies faster, but also immediately identify the root cause of issues, based on what systems and applications are throwing the errors. Furthermore, when data is analyzed at the source, the concept of data limits in a central repository becomes a non-issue. For organizations that want to keep a central repository, high-volume, noisy datasets can be converted into lightweight KPIs that are baselined over time, making it much easier to tell when something is abnormal or anomalous – a good sign that you want to index that data. Some organizations find they don’t actually need a central repository at all.
  • Keep All Your Data and Make it Accessible – As noted above, by pushing analytics upstream in a distributed model, an organization can have an eye on all data, even if all this data is not ultimately relegated to a high-cost storage tier. However, there are going to be times when access to all of this data is needed, and it should be accessible – either in a streamlined central repository, or in cold storage. In line with the DevOps principle of encouraging self-service, developer team members should have access to all their datasets regardless of the storage tier they’re in, and they should be able to get hold of them easily, not having to ask operations team members who often serve as the gatekeepers in the central repository model.

There’s little question that government IT teams are not as advanced as other industries when it comes to observability. This is reflected in the industry’s comparatively slow cloud adoption (on average, government IT teams report 24 percent of internally developed applications are cloud-based, compared to an average of 32 percent across other industries); and perhaps more importantly, lack of confidence. Only 22 percent of government IT teams reported feeling completely confident in their ability to meet application availability and performance SLAs, compared to 48 percent of respondents across other industries.  The good news is, with relatively few investments already made, government IT teams are in a better position to capitalize on more modern, cost-efficient and agile observability strategies – helping them increase cloud adoption while building confidence.