Taking the best out of your HPC infrastructure is complex, and while you try to do your best as an administrator to optimize it, your efforts can be hampered by end-users’ behaviors.
From the prediction of individual jobs’ resource needs based on their characteristics, to the projection of the cluster load or its energy consumption, Predict-IT is the framework to forecast the behavior of your clusters. Predict-IT embeds a series of machine learning algorithms that feed on the cluster’s logs: job scheduler accounting database, energy readings, applications’ logs… it learns from these historical data and continuously improves.
Integrated with your job scheduler, Predict-IT can warn users about potential issues with their submission parameters, or even directly act and update the jobs’ requirements to optimize the use of your HPC resources. Integrated with Analyze-IT, Predict-IT can help you plan future maintenance periods while limiting the impact on the production.
Predict-IT is now powered by OKA™
By analyzing the job-submission historical data in the job scheduler’s logs, Predict-IT creates a computational model of your cluster based on machine-learning techniques. Such model is then applied at the moment of job submission to provide feedback to the end-user about the optimized submission parameters of job, about the risk of failure, but also on an estimation of the time to result.
Configured specifically for your cluster, Predict-IT steadily improves over time: it adapts to your HPC environment by learning from your cluster logs and newly arrived jobs, each time becoming more and more accurate in its predictions. Each training is compared to the previous model and is only applied in production if it yields better and more accurate predictions. The more data you feed it and the more precise they are, the higher the prediction accuracy will be. Train Predict-IT on selected workloads for even greater accuracy.
Instead of being reactive, what if you could be proactive on the problems you face with your cluster?
MeteoCluster is our framework for cluster behavioral analysis. Plugged on multiple sources of data, it forecasts the evolution and trend of selected metrics such as the cluster load, the energy consumption or any time-series available in the data. Coupled with a set of detection plugins, it allows to explore “what-if” scenarios to help you plan future evolutions of the cluster, prepare for your up-coming maintenance by identifying the next peak or trough…
Soon, MeteoCluster will embed an extensible library of problematic patterns detection that will help system administrators to quickly and accurately identify error-prone/instability-leading behaviors of users and servers.