As enterprise software platforms expand in complexity and importance, performance anomalies have become a serious threat that can result in millions of dollars in losses. Faced with this challenge, performance engineering experts have begun utilizing machine learning algorithms to predict performance issues, remedy them, and even avoid them altogether.
Machine learning solutions can analyze and interpret thousands of statistics per second, providing real-time (or near real-time) insight into a system's performance. They can be used to recognize data patterns, build statistical models, and make predictions that are invaluable to the process of performance monitoring and testing.
With these abilities, machine learning tools are able to solve performance issues faster and more accurately than performance teams, significantly improving efficiency. Furthermore, they can help teams understand the platform's behavior quickly while mitigating the risks associated with poor performance, such as reputational damage, a reduction in customers, and financial losses.
Here we take a closer look at the necessary considerations for organizations looking to harness the power of machine learning algorithms to improve performance anomaly detection and, as a result, overall performance testing.
Recognizing Performance Anomalies
During testing, there are numerous signs that an application is producing a performance anomaly, such as delayed response time, increased latency, hanging, freezing, or crashing systems, and decreased throughput.
The root cause of these issues can be traced to any number of sources, including operator errors, hardware/software failures, over- or under-provisioning of resources, or unexpected interactions between system components in different locations.
There are three types of performance anomalies that performance testing experts look out for.
- Point anomalies: A single instance of data that's vastly different from the rest of the data in the dataset or database.
- Contextual anomalies: Here the anomaly is specific to a given context. These are common in time-series data, like in the case of performance peaks due to traffic increase.
- Collective anomalies: In this case, there is a set of data instances that together indicate abnormal behavior.
In all cases, anomaly detection is similar to what's known as noise removal or novelty detection during the performance testing process. The difference is that while anomaly detection looks to flag potential threats, novelty detection works to identify patterns that were not observed in training data, and noise removal works to separate unwanted observations from the desired data for analysis.
[RELEVANT READING: Using Machine Learning to Detect Anomalies in Performance Engineering]
Performance Anomaly Detection Without Machine Learning
One of the most basic methods of anomaly detection in performance testing is to identify and flag data points that stray from the common model through simple statistical techniques.
- Reactive approach
Here a team may set a threshold for specific performance metrics, like CPU utilization, disk I/O, memory consumption, or network traffic, and raise alarms when that threshold is violated. The challenging aspect of this approach is that larger data systems can have variable workloads, so setting static thresholds can trigger false alarms and won't help the team understand the effect of application changes or updates to performance.
- Proactive approach
In this category, teams are continuously evaluating a system by comparing it to baselines or statistical models. Since systems are continuously evolving, baselines are actually very rare. Additionally, this charges the team with the arduous task of keeping performance models up to date with the system's changing behavior.
- Rule-of-thumb approach
This method relies heavily on the past experience of trained "gurus" that use important performance trackers to perform manual checkups and work mainly based on personal observations and routine inspections.
Machine Learning-based Approaches
Machine learning can be used to help determine statistical models of "normal" behavior in a piece of software. They are also invaluable for predicting future values and comparing them against the values being collected in real-time, which means they are constantly redefining what "normal" behavior entails.
A great advantage of machine learning algorithms is that they learn over time. When new data is received, the model can adapt automatically and help define what "normal" is month-to-month or week-to-week. This means we can account for new data patterns and make more accurate predictions and forecasts than the ones based on the data's original pattern. Best of all, these updates would happen without human intervention.
There are several ways machine learning can be utilized to detect anomalies in performance. Here are a few of the most popular methods:
- Density-based: Assuming that normal data points occur around a dense neighborhood and, therefore, anomalies are far away, this method of detection is based on the K-Nearest Neighbors algorithm, or alternatively the local outlier factor (LOF).
- Clustering-based: This method is one of the most popular concepts for unsupervised learning. It assumes that data points that are similar usually belong to similar groups. Here, a K-means algorithm creates 'k' similar clusters of data points, with instances that fall outside the clusters being marked as potential anomalies.
- Support vector machine-based: Support Vector Machine is generally associated with supervised learning, but extensions such as OneClassCVM can be leveraged to identify anomalies as unsupervised problems. Here the algorithm learns a soft boundary to cluster the normal data and then identifies instances that fall outside the learned region.
Now that we have these methods in mind, let's look at how to establish what a specific system's needs are, and how to satisfy them effectively to fully realize the potential of a machine learning system for performance testing.
[NEARSHORE SOFTWARE OUTSOURCING | Start Leveraging Machine Learning with the Experts. Let's Talk]
Getting Started: Benchmarking, Learning, and Finding Experts
It's essential to first establish which machine learning model best serves the platform's specific requirements, and what metrics the model needs to report on in order to gain valuable insight into the system's performance.
It's extremely important to define "normal" behavior in the system, as well as determining what is considered an anomaly. This information allows the machine learning model to understand what it needs to look for and fine-tune the concept as it goes.
Human insight should always form the foundation of these optimization endeavors. Decisions on changes or tweaks to the system must lie in the hands of experienced performance engineers that can determine the best strategies to keep systems working.
Here are some strategic considerations that must be addressed before tackling the difficult task of developing a machine learning platform for performance testing.
- Timeliness: This relates to how quickly you need performance data and whether it should be tied to business decisions in real-time or to longer-term planning. Do you need to be able to take action right away? Or would you benefit more from retrospective analysis to help inform infrastructure changes?
- Scale: Here we consider the amount of data the system needs to process. How many data metrics does the system need to process? How big are the datasets?
- Rate of change: Some systems can see frequent changes in their environment with the regular release of feature updates or new versions. Others can evolve very slowly and don't see many changes over time. It's important to know how often a system might experience change so the chosen algorithm can be programmed to adapt appropriately.
- Conciseness: It's also important to consider how a system will generate a result. For example, should it produce an overall answer that takes all metrics into account, or should it give a detailed answer, specifying each metric individually? These are usually classified into univariate anomaly detection (each metric is analyzed individually) and multivariate anomaly detection (all metrics are considered together). A third option is a mix of both, starting with an overview but later focusing on just a few key metrics.
- Definition of incidents: It's important to determine what falls under "normal" and what falls under "anomaly" to determine what algorithm classification to apply. A system with well-defined incidents can utilize supervised learning techniques, where it receives examples of anomalies, and checks for them. Comparatively, a system lacking well-defined incidents can be approached through unsupervised learning, but it's a method where good results can only be achieved if "normal" behavior has been established. In either case, knowing what you're looking for is key.
Finding experts in this growing field can be challenging, so it's worth forming a partnership with a mature nearshore software development outsourcing provider to address your company's performance needs. Nearshoring makes a vast pool of global talent accessible, providing world-class performance and machine learning experts at a fraction of the domestic cost.
Today's organizations are discovering that peak software performance is not just a benefit for customers but a necessity. An inability to respond quickly to system lag or crashes can result in a great financial and reputational loss that should not be taken lightly.
A company with the right foresight can utilize machine learning technology to take a more preemptive approach to performance anomalies, resulting in a system that exceeds user expectations. This goes a long way toward turning customers into loyal supporters of a brand, while also empowering companies to expand their business without overwhelming their development and operations teams.
Looking for a performance team to help your company keep its systems on track? Schedule a call with a PSL representative to find out how PSL can help you implement performance engineering best practices.