Written by Jon Greaves
Since the first computers entered server rooms, the need to monitor them has been well understood. Earliest forms of monitoring were as simple as status lights attached to each module showing if it was powered up or in a failed state. Today’s datacenter is still awash with lights, with the inside joke being that many of these are simply “randomly pleasing patterns” and in all honestly, providing very little use.
In 1988, RFC1065 was released. Request for Comment (RFC), allowed like-minded individuals to band together and build standards. RFC’s - typically under the umbrella of organizations like the Internet Engineering Task Force (IETF), 1065 and two sister RFC’s - outline a protocol “Simple Network Management Protocol” (SNMP) and a data structure Management Information Base (MIB). SNMP was originally focused on network devices, but its value was soon realized covering all connected IT assets including servers.
Today, SNMP has been through three major releases and is still a foundation for many monitoring solutions.
At the highest-level, three forms of monitoring exist today:
- Reactive – a device (server, storage, network, etc.) sends a message to a console when something bad happens
- Proactive – the console asks the device if it is healthy
- Predictive – based on a number of values, the health of a device is inferred
Each of the above has pros and cons. For example, reactive monitoring tends to offer the most specific diagnostics, e.g. my fan is failing. One scenario exists which limits this as your only solution. Should the device die, or fall off the network, it will not generate messages. Since the console is purely reacting to messages, it is not able to determine if the device is alive and well, or completely dead. This is a major flaw in reactive monitoring solutions.
Proactive on the other hand, has the console polling the device at predetermined intervals. During each poll the console asks the device a number of questions to gauge its health and function. This solves the issue of reactive monitoring, but creates significantly more network traffic and load on the device. In fact, cases have occurred where devices have been hit so hard, they cannot operate.
So what typically happens, is reactive monitoring is paired with proactive polling to resolve this issue. You get the benefits of both solutions and negate the disadvantages.
While reactive and predictive monitoring may be the norm today, they still leave computer systems vulnerable to outages. As complexity continues to grow, a different approach to monitoring is needed. Two very interesting fields of research - prognostics and autonomics - are emerging to take on these challenges.
Prognostics make use of telemetry to look for early signs of failures often by applying complex mathematical modules. These modules take into account many streams of data and not only look at directly correlated failure conditions, but also what might best be described as the harmonics of a system. For example, by looking at the frequency of alarms and health data from multiple components of a system, small variations can be detected which can lead to failures.
This approach has been used with great success in other industries. The Commercial Nuclear Industry has deployed such an approach to help detect issues and false alarms. False alarms can result in the shutdown of a facility and cost millions of dollars per day. We also see many military applications for this kind of advanced monitoring including the next generation battle field systems and the joint strike fighter where thousands of telemetry streams are analyzed real-time to look for issues that could impact a mission.
While these applications seem far-fetched from the problems of monitoring today’s computer systems, several companies have made huge advances in this technology. Most notably, Sun Microsystems, who has used such approaches in several high end servers to not only detect pending hardware failures, but also applied to software to look for Software Aging where memory leaks, run away threads and general software bloat can lead to outages of long running applications. Pair detection of aging with “software rejuvenation” where applications are periodically cleansed, and large improvements in application availability can be realized.
Autonomics and autonomic computing can also be applied to these challenges to allow IT infrastructure to take corrective action to prevent outages and optimize application performance. Autonomic Computing is an initiative started in early 2001 by IBM, with the goal of helping manage complex distributed systems. This tends to manifest itself in tools implemented as decision trees, mimicking the actions a system admin might perform to correct issues before they become outages. Academia is leading the charge in this area with key projects in super computing centers where scale and complexity requires a new approach to attack this problem.
With the advances in systems monitoring and management also comes new kinds of risks - some of which can come from seemingly harmless data. Let’s take the example of a publicly traded company. This company outsources the hosting and management of its infrastructure. The application management company enables monitoring, the customer is careful to exclude any sensitive data from what’s being monitored. The customer just allows the basic data collected reporting on memory, disk, network and CPU. From first impression, this seems like harmless data.
Each quarter as the company closes its books, its CRM and ERP systems (both monitored) crunch the quarter’s data. For Q1, the customer has a great quarter as publicly disclosed in filings. The provider monitoring their environment now has a benchmark that one could infer transactional volume based on disk I/O, memory and CPU utilization. But let’s say the customer misses their numbers in Q2. Now, the provider has data that can infer a bad quarter. As Q3 is in the process of closing, and before the CFO has even seen the results - armed with just basic performance data from CPU, Memory and Disk - the hosting provider can now, in theory, predict the quarter’s results.
This simplistic scenario highlights the value of telemetry, even that which seems low risk in the future. As our ability to infer failures, performance, and eventually business results grows, new kinds of risks will emerge, requiring mitigation.
To this point we have focused on what basically is “node level” monitoring, i.e., the performance of a server or other piece of IT infrastructure and its health alone. This is, and will likely always be, the foundation for managing IT systems. However, it does not tell the full story - arguably the most important factor in today’s environments - of how the business processes supported by the infrastructure are performing.
IT Service Management focuses on the customer’s experience of a set of IT systems as defined by their business functions. For example, assume a customer has a CRM system deployed. While the servers may be reporting a healthy status, if the application has been misconfigured or a batch process is hung, the end user would be experiencing degraded operations while a traditional monitoring solution is likely to be reporting the system functioning and “green”. Taking an IT Service Management approach, the CRM solution would be modeled showing the service dependencies (e.g., depends on web, application and database tiers and requires network, servers and storage to be functioning). This model is then enhanced by simulated end user transactions and application performance metrics to identify issues outside the availability of the core IT infrastructure and statistics from an IT service desk. This holistic approach to monitoring provides greater visibility to CIO’s, typically expressed as a dashboard of how their IT investment is performing from their user community’s view.
Virtualization technology and its use to enable cloud computing has opened up many opportunities for organizations to realize the agility we all seek when it comes to our IT investments. Virtualization also has not simplified the administration of IT as was originally promised – instead, it has greatly increased it. Case in point, look at an example of a typical use of virtualization - server consolidation. Pre-consolidation, each server had a function, typically supported by a single operating system image running on bare metal. Should the server or operating system experience a problem, it was easy to uniquely identify the issue and initiate an incident handling process to remediate. In a consolidated environment, a single server may be running 10’s of virtual machines, each with their own unique function. These virtual machines may also be migrated between physical servers in an environment. Traditional monitoring solutions were not designed with the concept that a resource may move dynamically or even are offline for the time it’s not needed, and started on demand.
Now, taking the extreme of virtualization to the next logical level - cloud computing - today’s monitoring tools are taxed even more. Your servers are now hosted in an infrastructure/platform, and as a service provider, you have even less control of your resources. This hasn’t gone unnoticed by providers. In fact, over the past month, several monitoring consoles have been released (including for Amazon EC2) to start addressing this challenge. Independent solutions are also appearing, most noticeably Hyperic who launched http://www.cloudstatus.com/ where you can view Amazon and GoogleApp Engine’s availability by using “proactive monitoring”. The natural evolution will be these tools interfacing with more traditional solutions to give companies more holistic views of their environment. This takes an old concept of “Manager of Managers” to the next level.
Today’s computing architectures are really taxing the foundations of monitoring solutions. This does, however, create great opportunities for tools vendors and solution providers to attack. More so, it also brings more to the focus, the idea of IT Service Management where understanding the end users performance, expectations and mapping back to SLA’s becomes the norm.
A brief interview with Javier Soltero, co-founder and CEO of Hyperic, the leader in multi-platform, open source IT management.
Q. Monitoring is typically seen as the last step of any deployment, often not considered during the development. Do you see customers embracing a tighter coupling of the entire software lifecycle with engineering IT Service Management Solutions?
Absolutely, it’s a very encouraging trend especially among SaaS companies and other business that are heavily dependent on their application performance. The really successful ones spend time building a vision for how they want to manage the service. That vision then helps them select which technologies they use and how they use them. Companies that build instrumentation into their apps have an easier time managing their application performance and will resolve issues faster.
Q. Customers are really embracing IT Service Monitoring as a key element to not only understand performance but also ROI for IT investments, what challenges do you see for customers to adopt these technologies?
The biggest challenge we see is the customer’s ability to extract the right insight from the vast amount of data available. The usability of these products also tends to make the task of figuring out things like ROI and other business metrics difficult. Oftentimes a tool that can successfully collect and manage the massive amounts of data required to dig deep into performance metrics lacks an analytics engine capable of displaying the data in an insightful way, and vice versa.
Q. End user monitoring has typically been delivered with synthetic transactions, this has certainly been a valuable tool. How do you see this technology evolving?
The technology for external monitoring of this type will continue to evolve as the clients involved for these applications get more and more sophisticated. For example, a user might interact with a single application that includes components from many other external applications and services. The ability for these tools to properly simulate all types of end-user interactions is one of the many challenges. More important is the connection of the external transaction metrics to the internal ones.
Q. Monitoring is one part of the equation, mapping availability and performance makes this data useful. With virtualization playing such a big part of datacenters today, how do you see tools adapt to meet the challenges of portable and dynamic workloads?
The most important element of monitoring in these types of environments is visibility into all layers of the infrastructure and the ability to correlate information. Driving efficiency in dynamic workload scenarios like on-premise virtualization or infrastructure services like Amazon EC2 requires information about the performance and state of the various layers of the application. Providing that level of visibility has been a big design objective of Hyperic HQ from the beginning and it’s helped our customers do very cool things with their infrastructure.
Q. How do you see monitoring and IT service management evolve as cloud computing becomes more pervasive?
Cloud computing changes the monitoring and service management world in two significant ways. First, the end user of cloud environments is primarily a developer who is now directly responsible for building, deploying, and managing his or her application. This might change over time, but I’m pretty sure that regardless of the outcome, Web and IT operations roles will be changed dramatically by this platform. Second, this new “owner” of the cloud application is trapped between two SLAs: an SLA he provides to his end user and an SLA that is provided by the cloud to him. Cloudstatus.com is designed to help people address this problem.
Q. Do you see SaaS model reemerging for the delivery of monitoring tools, where customers will use hosted monitoring solutions?
Yes, but it will be significantly different from the types of SaaS based management solutions that were built in the past. The architecture of the cloud is the primary enabler for a monitoring solution that, like the platform that powers it, is consumed as a service.