OKA Predict

Taking the best out of your HPC infrastructure is complex, and while you try to do your best as an administrator to optimize it, your efforts can be hampered by end-users’ behaviors.

From the prediction of individual jobs’ resource needs based on their characteristics, to the projection of the cluster load or its energy consumption, OKA Predict is the framework to forecast the behavior of your clusters. OKA Predict embeds a series of machine learning algorithms that feed on the cluster’s logs: job scheduler accounting database, energy readings, applications’ logs… it learns from these historical data and continuously improves.

Integrated with your job scheduler, OKA Predict can warn users about potential issues with their submission parameters, or even directly act and update the jobs’ requirements to optimize the use of your HPC resources. Integrated with the whole OKA Suite, OKA Predict can help you plan future maintenance periods while limiting the impact on the production.

OKA Predict is a product of the OKA™ Suite

Increase Resources Productivity

HPC Clusters are always fully loaded, however, many HPC jobs don’t come to completion.

OKA Predict can detect if a job presents a risk to be killed by the job scheduler, and advise with appropriate walltime parameter.

Leverage Machine Learning Power

Job Schedulers store a lot of historical information on cluster usage and job characteristics.

OKA Predict is specially configured for your cluster and keeps learning from data to continuously improve its predictions.

Limit Waste of Resources

How many of your jobs reserve more resources than necessary, or could run on a different topology?

OKA Predict can advise on the right amount of resources a job needs to reduce unused resources and better share them.

Operating mode

By analyzing the job-submission historical data in the job scheduler’s logs, OKA Predict creates a computational model of your cluster based on machine-learning techniques. Such model is then applied at the moment of job submission to provide feedback to the end-user about the optimized submission parameters of job, about the risk of failure, but also on an estimation of the time to result.
Configured specifically for your cluster, OKA Predict steadily improves over time: it adapts to your HPC environment by learning from your cluster logs and newly arrived jobs, each time becoming more and more accurate in its predictions. Each training is compared to the previous model and is only applied in production if it yields better and more accurate predictions. The more data you feed it and the more precise they are, the higher the prediction accuracy will be. Train OKA Predict on selected workloads for even greater accuracy.

Meteo Cluster

Instead of being reactive, what if you could be proactive on the problems you face with your cluster?

MeteoCluster is our framework for cluster behavioral analysis. Plugged on multiple sources of data, it forecasts the evolution and trend of selected metrics such as the cluster load, the energy consumption or any time-series available in the data. Coupled with a set of detection plugins, it allows to explore “what-if” scenarios to help you plan future evolutions of the cluster, prepare for your up-coming maintenance by identifying the next peak or trough…

Soon, MeteoCluster will embed an extensible library of problematic patterns detection that will help system administrators to quickly and accurately identify error-prone/instability-leading behaviors of users and servers.

Features

Understand how the load and energy consumption of your cluster will evolve in the coming days, week, month… MeteoCluster can forecast the evolution of such metrics and many more.

Need to plan maintenance on without impacting production? When will the next peak of usage be? MeteoCluster comes with an extensible library of interactive detection tools.