Introducing OKA Suite – Empowering HPC Performance!

Simulation is strategic, it provides competitive advantages to industries, it helps moving scientific research forward and with the explosion of data and artificial intelligence, it is becoming essential to our lives… Efficiently running an HPC infrastructure is complex, and often lacks the proper tools to track down and get insights on how the users are behaving and how the cluster is responding to the demand.

 

UCit have packaged its HPC and machine learning expertise in a software suite to assist HPC system administrators in being even more effective: the OKA Suite.

The OKA Suite features the right tool for each of your clusters’ area of optimization, it is composed of 5 distinct and complementary products to address them all: OKA Core, OKA Shaper, OKA Energy, OKA Financials, OKA Predict.

OKA Core – Get the most out of your HPC resources

Right at the backbone of the OKA Suite, OKA Core provides an extensible platform that presents the state of your HPC infrastructure through simple and comprehensible dashboards.

 

Whether you need high level KPIs to report the cluster usage, or low-level information to track down the origin of an issue, OKA Core gives you the right level of details to:

  • Analyze all your HPC clusters’ KPIs in one single, dedicated platform.
  • Quickly diagnose and understand issues using advanced data cross-checks and zoom-ins.
  • Easily identify areas of optimization for your clusters and resources.

OKA Core contains the following tools:

  • Job Status – To spot wasted resources due to failed jobs (displays number of jobs and core-hours consumed per job status)
  • Load – To understand overall resource allocation (displays allocated cores through time, and number of jobs allocated per node)
  • Throughput – To analyze QoS and job submission patterns (displays submission frequency, slowdown, interarrival…)
  • Resources – To determine jobs typologies and their consumption (displays number of cores & core-hours, memory and nodes consumed by the jobs)
  • Consumers – To do advanced cross-checks for in-depth workload analysis (allows grouping of jobs per Group, User, JobName, Queue/Partition, QoS, Parallel Environment. For each, displays number of cores & core-hours, execution & waiting time, slowdown…)
  • Concurrent users – To spot abnormal behaviors among users (displays active users per period)

OKA Shaper – Shape the future of your HPC infrastructure

OKA Shaper is OKA Core’s natural next step when it comes to proactively acting upon your HPC resources’ future:

 

  • Assess and predict the load of your clusters to better prepare for the years to come.
  • Identify which workloads are good candidates for Cloud migration.
  • Project the cost and size of your future hybrid HPC resources with dedicated shaper tools

OKA Shaper provides the following advanced tools:

  • Congestion/Contention – provides a day-to-day update of the cluster status (Optimal, Acceptable, Contention, Congestion) based on resources needs and delivered computing power, and jobs life cycle for each day. It helps to identify if the cluster is correctly sized and configured, or if upgrades should be performed or if additional/external resources could be beneficial.
  • MeteoCluster – is the framework for cluster behavioral analysis. Plugged on multiple sources of data, it forecasts the evolution and trend of selected metrics such as the cluster load, the energy consumption, or any time-series available in the data. Coupled with a set of detection plugins, it allows to explore “what-if” scenarios to help you plan future evolutions of the cluster, prepare for your up-coming maintenance by identifying the next peak or trough…
    • Multi-scale predictor – Understand how the load and energy consumption of your cluster will evolve in the coming days, weeks, months… MeteoCluster can forecast the evolution of such metrics and many more.
    • Scenario explorer – Need to plan maintenance on without impacting production? When will the next peak of usage be? MeteoCluster comes with an extensible library of interactive detection tools.
  • CloudSHaper – provides the tooling to follow and project costs of HPC workloads whether your HPC cluster is on-premises or in the Cloud[1]. The CloudSHaper plugin provides the capability to explore what-if scenarios and plan budgets accurately.

[1] AWS & Azure as of June 2023.

OKA Energy – Optimize your clusters’ energy consumption

Next, OKA Energy is the tool to optimize your clusters’ energy consumption and:

 

  • Decrease energy costs and environmental impact thanks to HPC-dedicated energy tools.
  • Contain your carbon footprint to comply with regulatory and corporate emission policies.

 

OKA Energy contains the following toolset:

  • Energy – OKA Energy includes a subset of EAR (Energy Aware Runtime[1]) to report the cluster and job energy consumption and estimate their carbon footprint. After acquiring OKA Energy, the power and energy metrics become available for all OKA Core and OKA Shaper plugins (as an example MeteoCluster can project future power consumption of the cluster)
  • RackOON – OKA Energy features a specific plugin dedicated to energy monitoring, which provides a physical view of the cluster.
  • Carbon – OKA Energy features a specific plugin dedicated to energy monitoring, which provides a physical view of the cluster.

[1] https://gitlab.bsc.es/ear_team/ear. Other energy consumption data sources can be integrated on the platform through OKA™ Data Enhancers.

OKA Financials – Get a hold onto your HPC clusters’ costs

OKA Financials enhances OKA Core with precise and manipulable cost metrics available on all other products of the suite and extends OKA Shaper by adding costs projection through MeteoCluster.

 

OKA Financials contains the following toolset:

  • Cost propagation – After acquiring Costs information, this metric becomes available for all OKA Core and OKA Shaper plugins (as an example MeteoCluster can project future costs of the cluster). You can either define the core-hour cost of your cluster, of have more fine-grained control on the way to report your costs (e.g., per queue and account) through the creation of a custom Data Enhancer.
  • CloudSHaper – provides the tooling to follow and project costs of HPC workloads whether your HPC cluster is on-premises or in the Cloud[1]. The CloudSHaper plugin provides the capability to explore what-if scenarios and plan budgets accurately.

[1] AWS & Azure as of June 2023.

OKA Predict – Optimize usage of your HPC resources in real-time with workload predictors

OKA Predict takes your cluster usage optimization to the next level: allowing for real-time optimization of end-users’ jobs’ submissions. With OKA Predict, your team can build powerful Machine Learning predictors that will suggest or automatically apply the optimal job parameters to minimize resource use and time-to-results for end-users.

 

OKA Predict is a machine-learning tool to forecast jobs performance, costs and energy consumption. By integrating your job scheduler or submission portal with OKA Predict, you can:

  • Improve Cluster productivity.
  • Reduce waste of resources.
  • Help End Users to get results Faster.

OKA Predict trains periodically on new data collected from the job scheduler or from additional logs. Its filtering functionalities make it possible to define workloads of interest to build specific predictors, and thus learn more precisely about the most frequent or most important cases. The submission parameters that OKA Predict includes as standard are the execution time and the RAM required for the job per node.

OKA Predict works in 3 phases, the training phase allows to build predictors, the prediction phase offers users parameters to specify for their jobs on the maximum execution time or on the memory required, or simply give feedback to users about their waiting or rendering time for tasks.

 

OKA Predict forecasts the following jobs’ characteristics at submission time:

  • State – detects the risk of a job failing or finishing in timeout.
  • Execution Time – predicts the execution time of jobs to plan resources and get results faster.
  • Memory – predicts how much memory should be requested.
  • Waiting Time & Time to Result – get feedback on when your jobs will end.
  • Energy – estimate the impact on the environment.

OKA Suite Adaptability and Flexibility

OKA™ is extensible, adaptable, and dynamic, and provides consistent interfaces and integrated tools.

Each cluster view is accessible to authorized users and configurable through profiles.

Moreover, OKA’s filtering capabilities allow to precisely select the workloads to analyze (for example, you could analyze separately all the jobs created by each of the research departments or concentrate on last summer’s period where there have been strange job behaviors…).

Commonly used filters can be saved and reused to quickly review the most important KPIs. It is even possible to train specific predictors on them.

OKA™ can be extended with additional features and deliver additional metrics through Data Enhancers that allow additional data sources to be plugged to OKA™, such as the EAR database or hardware vendor tools which measure power and energy information or application specific logs.

You can discover more on OKA Suite at https://oka.how or jump directly to https://doc.oka.how/ to understand how it works and how to start an evaluation.