Rafay Systems, the leading platform provider for Kubernetes Operations, announced the expansion of the industry’s only turnkey solution for operating Kubernetes clusters with GPU support at scale by adding powerful new metrics and dashboards for deeper visibility into GPU health and performance.
The Rafay Kubernetes Operations Platform (KOP) now features a fully integrated GPU Resource Dashboard that visualizes critical GPU metrics so developers and operations teams can seamlessly monitor, operate, and improve performance for GPU-based container workloads – all from one unified platform.
Kubernetes has rapidly become the preferred orchestration layer for enterprises that need the ability to provision and operate GPU-enabled, AI and machine learning applications in the cloud and at edge/remote locations.
According to 2022 Gartner® Emerging Technologies: Edge Technologies Offer Strong Area of Opportunity — Adopter Survey Findings*, “The primary objectives for respondent organizations investing in and adopting edge technologies are to improve employees productivity (41%) and automate business processes (39%). This aligns with existing Gartner research (see Emerging Technologies: Use-Case Patterns in Edge AI) that edge AI is being used to improve business processes, delivering automation and productivity gains that translate into measurable ROI, such as cost savings.”*
However, as enterprises rapidly increase the number of AI and machine learning workloads, addressing several challenges such as visibility and monitoring helps prevent significant delays in application deployment and wasted costs associated with idle or underperforming GPUs in the clusters.
For example, a factory that increasingly relies upon real-time video detection applications powered by AI needs a standardized approach for cross-functional teams to manage the IT infrastructure and applications. The following challenges often result in operational fragility and lack of repeatability that hinders productivity:
- Flawed or overly restrictive access and visibility for developers and operational personnel that need GPU metrics on demand to tune and optimize GPU workloads.
- The struggle of hiring or training a team of experts and spending months to develop, operate and maintain a customized monitoring infrastructure to scrape and centrally aggregate GPU metrics.
- The complexity of developing and maintaining an integration with corporate single sign-on (SSO) systems to provide role-based access to metrics and dashboards.
- Accounting for the organizations’ GPU-enabled workloads that are developed and maintained by external entities (e.g., partners and ISVs). These entities also need visibility to GPU metrics to ensure the workloads are performing optimally.