TIP
Explore the key features of the Monitoring & Ops module designed for Inference Services. This overview introduces core capabilities to help users efficiently monitor, analyze, and optimize AI service operations.
Features Overview
TOC
Logging
- Realtime Pod Logs
Stream logs from Replica pods associated with inference services in real time. Debug issues instantly and track service behavior across deployments.
Monitoring
Resource Monitor
- CPU/Memory Utilization
Track CPU and memory usage metrics for inference services to optimize resource allocation and prevent bottlenecks.
Computing Monitor
- GPU Metrics & VRAM
Monitor GPU compute utilization and video memory (VRAM) consumption to ensure efficient hardware usage for accelerated workloads.
Other Monitor
- Token Throughput
Measure token processing rates to evaluate model performance and scalability. - Request Traffic Analytics
Analyze request volume, latency, and track successful/failed requests per second (QPS) to maintain service reliability and meet SLAs.