Simulation is strategic, it provides competitive advantages to industries, it helps moving scientific research forward and with the explosion of data and artificial intelligence, it is becoming essential to our lives… Efficiently running an HPC infrastructure is complex, and often lacks the proper tools to track down and get insights on how the users are behaving and how the cluster is responding to the demand.
UCit have packaged its HPC and machine learning expertise in a software tool which assists HPC system administrators to be even more effective. Analyze-IT provides an extensible platform that presents the state of your HPC infrastructure through simple and comprehensible dashboards. Whether you need high level KPIs to report the cluster usage, or low-level information to track down the origin of an issue – Analyze-IT gives you the right level of details.
Identify Atypical User Behaviors
Did you spot that novice user submitting bursts of jobs in the last 2 days?
Or that user who has less than 10% of his jobs that end correctly?
Improve Cluster Quality of Service
How long do your jobs spends in queue compared to their actual runtime ?
Do you have a high proportion of failed/cancelled/timeout jobs ?
Limit Waste of Compute Resources
What resources are left unused, while requested by your users?
How many of your jobs could run on cheaper nodes?
Plan Future Cluster Evolution
When do you have peak capacity needs that require additional resources?
How do you dimension your future cluster’s size?
Number of jobs and core-hours consumed per job status
Allocated cores through time, and number of jobs allocated per node
Submission frequency, slowdown, interarrival, number of active jobs
Number of users active on the cluster (running or requesting jobs)
Number of cores, memory, nodes… used by the jobs
Detailed information about jobs grouped along multiple categories
Detection and detailed analysis of resubmitted jobs
Cluster state (Optimal, Acceptable, Contention, Congestion), and jobs life cycle