The Future of Observability
How is observability changing in recent years, and what next?
In recent years, businesses have become increasingly reliant on observability to manage and maintain complex systems and infrastructure. As systems become even more complex, observability must evolve to keep pace with changing demands. The big question for 2023: what next for observability?
The proliferation of microservices and distributed systems has made it more difficult to understand real-time system behavior — which is critical to troubleshooting problems. Recently, more businesses have solved this problem with automation to monitor distributed architecture, deep dive tracking, and real-time observability.
However, each decade has brought a sea change in how observability is expected to function. The last three decades have seen transformation after transformation — from on-premise cloud to cloud too, now, cloud-native. With each generation has come new problems to solve, opening the door for new companies to form:
- On-premise cloud era led to a few companies like Solarwinds, BMC, & CA Technology.
- Cloud era (i.e. where AWS came in) led to a shaking market, with new companies like Datadog, New Relic, Sumologic, Dynatrace, Appdynamic, and more.
- Cloud-native era (starting in 2019–20) has resulted in another market shakeup.
Why is observability changing in 2023 and beyond?
The main reason for the current shakeup is that businesses are building software using an entirely different technology from 2010. Rather than monolithic architectures, they use microservices, Kubernetes, and distributed architecture.
There are a number of reasons why this is the case:
- Better security 🔐
- Easy scalability 📈
- More efficiency for distributed teams 👬
However, there are challenges as well. According to data from Gartner, 95% of systems will be cloud native by 2025. Since cloud native generates much more data than previous generations of technology, hosting and scaling those data becomes more challenging. This presents three major problems.
Prohibitive costs
The first problem is relatively straightforward: cost. All legacy observability companies have become so expensive that most startups and medium businesses can’t afford them. As a result, they’re using old technology to host and process their data — technology that can’t respond to startups’ needs in 2023.
Evolving priorities in observability
Additionally, as the capabilities of observability have become more advanced, the KPIs and OKRs that dev and ops teams track have evolved.
Before, the primary focus was on ensuring applications and infrastructure didn’t crash. Now, dev and ops teams are operating at a deeper level, prioritizing:
- Request latency
- Saturation
- Scalability
- Traffic maps for where usage is happening
- Optimizing and predicting future outcomes
- How new code changes cloud usage
In a sentence, dev and ops teams have become more proactive than reactive. This requires technology that can keep up.
Changing expectations for observability
Finally, the rise of microservices architecture changes how IT teams observe application changes. One microservice can run across a hundred machines, and a hundred small services can run in one machine. There’s no “one-size-fits-all” approach. Dev and ops teams need deeper analysis to understand what is happening across their infrastructure.
What will the new generation of observability tools need in 2023?
These are the challenges. So how should the new generation of observability tools respond in 2023? From my perspective, here are seven things we will need to win the market.
Note: I’m looking at a 30,000-foot view of a vast market. It’s unlikely that a single company will do all these things. But these are the needs, and it’s going to require new companies, technologies, and platforms to meet them all.
1. Unified observability
All the legacy companies say they’re an unified observability platform. What this really means is that they have different tabs for metrics, logs, traces, etc. accessible from their platform.
This doesn’t actually solve the problem. What dev and ops teams need is one place from which to view all this data in a single timeline. Only then will they be able to trace correlations and determine root causes to issues — and solve them quickly.
2. Integrated observability & business data
As Bogomil from Sequoia mentioned in this blog, most businesses don’t correlate their observability and business data. This is a problem because there are powerful insights to be gained from analyzing the two side by side.
For example, Amazon recently tracked that if their website slows by one extra second, they lose millions of dollars daily. This can be huge for eCommerce businesses, especially if they track a slowdown in orders — it could be due to poor application performance. The faster they fix the application, the more orders they receive, and the more revenue they earn.
The same goes for software companies. If the application is fast, this improves its usability, which improves user experience, which impacts a number of business metrics. Only by integrating these two sets of data can businesses start to make these connections to improve the bottom line.
3. Vendor-agnostic (OTel)
Companies are looking for a solution that doesn’t lock in one vendor. That’s how most tech companies are contributing to open telemetry and making otel the go-to tool for data collector agents. OTel has many benefits like Interoperability, flexibility, and Improved performance monitoring.
4. Predictive observability
In the AI era, everything is moving to become a human-less experience. This can enable systems to do the things that humans simply cannot, like predicting errors before they even happen via machine learning.
This is not common in observability right now, and there is a major need for more innovation. By adding an AI layer to observability platforms, businesses can predict issues before they happen, and solve them before the user or customer even knows that something is wrong.
5. Predictive security in observability
Observability and security work very closely. Most observability companies are moving to security because they control all the data collected from applications and infrastructure.
By reading metrics, logs, and traces, specifically those that demonstrate unusual behavior, AI should be able to understand security threats. Most SEIM and XDR don’t do this. And even if they do, then it’s a rule-based model, rather than analyzing and learning from behaviors.
6. Cost optimization
Perhaps the biggest challenge in observability is cost. Although cloud storage is getting cheaper and cheaper, most observability companies aren’t lowering their prices to match. Customers get the short end of the stick, mainly because there are no alternatives.
OpenTelemetry collects over 200 points every second however, we don’t need all these data points. So rather than charge users for storage they don’t need, they should collect and store only the useful ones, and delete the rest. This can reduce the cost to store and process data.
7. Correlation to causation analysis
Most legacy observability platforms give basic information about what’s happening in the cloud or application. However, many times the inciting event takes place hours or even days before. As such, it’s important to monitor CI/CD pipelines to see when code gets pushed, as well as which regulation or request starts to create the problem.
Let’s say there’s one network socket that’s slow, and it starts to clog requests. As a result, your backend starts to slow, which then produces an error. Then the front end slows, producing another error. Then the application crashes. You may only notice the front end slowing down, and think that caused the application crash. But in reality, the problem started elsewhere.
In a distributed architecture, this root cause analysis takes more time than in a monolith. Observability platforms need to adapt to this new reality.
8. AI-based alerts
Alert fatigue is a real challenge. When developers receive so many alerts that they mute email threads or Slack channels, this hides issues and slows down time to resolution.
Instead, AI-based alert systems leverage AI to predict which alerts are essential and which are not. AI can also provide context, and even suggest possible solutions.
Final thoughts on the state of observability
This is an exciting time to be in observability. As I mentioned earlier, the changes we’re seeing are opening the door to untold opportunities. The question remains: who will rise to the top in 2023?
If you’re interested in joining us on this journey, check out our website and follow us.