The bottom blue zone represents the optimal state, where the cluster operates above the minimum target load yet below the maximum waiting load. This signifies maximum capacity usage without excessive queue waits.
The acceptable state, a grayish-yellow zone, occurs when there’s a lower demand for CPU hours, leading to underutilization of the cluster’s capacity. It’s not ideal as it may indicate overprovisioning or low demand periods like weekends.
Congestion and contention represent the top right and left zones, respectively. Congestion occurs when the cluster consistently operates in an overloaded state, while contention happens during peak demand periods or specific requests such as very large jobs (with lots of CPUs requested).
An example of congestion reveals a cluster’s evolution from an acceptable state at the start to a congested state over time, indicating increased demand surpassing optimal use:
Another cluster demonstrates more balance across the zones but faces contention, notably during peak periods like summer. These peaks might signal a need to offload jobs to external resources or could result from new users’ job submissions or maintenance periods, requiring further investigation:
For HPC administrators, achieving an ideal balance between the cluster’s capacity and users’ demands is pivotal. This is where UCit’s OKA Core framework stands as a tool of choice in navigating these challenges. By delving into the behavior of users and jobs within the cluster and scrutinizing logs (accounting, applications, etc.), OKA Core offers insights into identifying problematic events and prescribing solutions.
Unveiling the OKA Core framework: Real-world encounters with congestion and contention
The OKA Core framework empowers administrators with a suite of customizable tools designed to decode the labyrinthine behaviors within HPC clusters. At its heart, OKA Core assimilates data from various sources, including job schedulers like SLURM, LSF, PBS, SGE, TORQUE, and any additional logs you can gather about your jobs. This treasure trove of information, in turn, fuels OKA Core’s ability to present hundreds of Key Performance Indicators (KPIs) vital for understanding cluster operations.
Through a comprehensive presentation and analysis of this data, OKA Core unveils the nuanced dynamics of the cluster, pinpointing congestion and contention states. It categorizes these states into four zones: optimal, acceptable, congestion, and contention, providing administrators with a visual representation of the cluster’s health.
OKA Core generates an analysis based on defined parameters like maximum and minimum cluster load and rate ratio. For instance, with a set minimum cluster usage of 70% and a waiting-to-running ratio of 1.5, you can visualize four zones: optimal, acceptable, congestion, and contention. This analysis might reveal 40% optimal, 30% acceptable, and significant contention instances. It provides a day-by-day overview of the cluster’s computational life for a year, letting you click on specific dates to delve deeper into that day’s activities—such as CPU hours delivered, jobs waiting in the queue, and job types. This tool helps understand congestion and contention situations, and other plugins can offer detailed insights into resource consumption, user activity, and job specifics: