Researchers from North Carolina State University in the US have developed a software tool that automatically detects and responds to cloud computing problems.
Since cloud computing invariably depends on data centres filled with virtual machines, i.e. multiple software devices on individual physical machines, a problem with one virtual machine can disrupt the entire cloud.
What this new tool does is to look at the memory being used, network traffic, CPU usage, etc. to define ‘normal’ behaviour for every virtual machine in the cloud. It can then look for deviations and predict anomalies. If a deviation is spotted, diagnostics can determine which metrics may be affected and trigger the appropriate prevention system.
“If we can identify the initial deviation and launch an automatic response, we can not only prevent a major disturbance, but actually prevent the user from even experiencing any change in system performance,” says Dr. Helen Gu, an assistant professor of computer science at NC State and co-author of a paper describing the research.
The software uses less than 1% of the CPU load and only 16 megabytes of memory once it’s up and running. In benchmark tests it identified up to 98% of anomalies, apparently much higher than the rate found in existing approaches.
Review: The use of cloud computing is growing rapidly, no bad thing, given that centralised, data centre-based, ICT operations can be far more energy efficient than decentralised solutions. Of course the efficiency of data centres varies considerable. Power Usage Effectiveness (the PUE metric that compares overall data centre power use with that used by the ICT equipment) can be as high as 3.0 or as low as 1.05. But there is always more that can be done to make the data centre more energy efficient.
The real problem may be that relying on cloud computing puts all your eggs in one basket. And when things do go wrong it can be a frustrating experience, particularly if you rely on some third party supplier for basic applications. So anything that helps to minimise problems has to be a good thing.
It’s not clear how good this software is, though – ‘up to’ 98% identification of problems is not much use as an indication of reliability – although there are only 1.7% false alarms. The approach sounds intuitively better than the alternative, though, which is to identify abnormal behaviour and write software to manage it. If you know how a machine looks when it goes wrong you should be able to prevent it doing so in the first place.