Executive Summary
Cost Events — Last 7 Days
8 events| Date | Event | Namespace | Delta | Status |
|---|---|---|---|---|
| Mar 18 | Deployment scaled up (8→24 pods) | payments-svc | +$1,840 | Critical |
| Mar 17 | PVC storage class changed | analytics | +$320 | Warning |
| Mar 16 | Idle pods terminated (rightsizing) | ml-training | -$640 | Saved |
| Mar 15 | Spot instance reclaimed, fallback | batch-jobs | +$210 | Warning |
| Mar 14 | Reserved instance applied | frontend | -$2,100 | Saved |
| Mar 13 | New workload deployed | data-pipelines | +$780 | Info |
Cluster Cost Analysis
Node Details
48 nodes| Node | Instance Type | Pool | CPU Util | Mem Util | Daily Cost | Billing |
|---|
Namespace Analysis
Resource Efficiency
(avg actual / requested)
(avg actual / requested)
(utilization rate)
Workloads with Oversized Requests
18 workloads| Workload | Namespace | CPU Req | CPU Avg | Mem Req | Mem Avg | Waste/mo | Action |
|---|
Idle Resources
All Idle Resources
34 items| Resource | Type | Namespace | Idle Since | CPU (unused) | Mem (unused) | Cost/mo | Sev |
|---|
Rightsizing Recommendations
Rightsizing Queue
68 items| Workload | Namespace | Current CPU | Rec CPU | Current Mem | Rec Mem | Savings/mo | Confidence |
|---|
Workload Breakdown
Workload Cost Table
89 workloads| Workload | Namespace | Type | Pods | CPU Cost | Mem Cost | Total/mo | Efficiency |
|---|
Network Cost Analysis
Storage Cost Analysis
Chargeback / Team Attribution
Chargeback Summary
| Team | Namespace(s) | Budget | Actual | Variance | % Used | Status |
|---|
Alerts & Anomalies
Cost Spike — payments-svc
Spend increased 312% vs 7-day rolling avg. 24 pods running vs typical 8.
Root cause: load test not terminated. Cluster autoscaler added 6 nodes.
GPU Idle — ml-training namespace
4x NVIDIA T4 GPUs idle for 31 hours. Training job completed but pods not terminated.
$96/hour wasted on idle GPU nodes.
Budget Threshold — analytics team at 87%
analytics namespace consumed $20,880 of $24,000 monthly budget.
At current rate, will exceed budget in ~6 days.
Orphaned PVCs Detected
3 PVCs unattached for >7 days in staging namespace. gp3 storage costing $420/mo.
Likely residue from deleted deployments.
Anomaly Detected — egress spike
data-pipelines namespace generated 3.2TB egress in 4h (10x baseline).
ML model detected statistical anomaly. Review data export jobs.
Savings Roadmap
Terminate Idle GPU Pods
4 GPU pods in ml-training idle 31h. Implement job completion hooks or auto-terminate after TTL.
Rightsize Overprovisioned Workloads
68 workloads using <40% of requested CPU/memory. Apply VPA recommendations progressively.
Purchase Reserved Instances
64% on-demand exposure on baseline workloads. 1yr RI commitment saves ~38% on stable nodes.
Reduce Cross-AZ Traffic
Topology-aware routing can reduce cross-AZ calls. Add pod affinity rules to co-locate chat services.
Clean Orphaned PVCs
8 unattached volumes costing $1,120/mo. Implement PVC lifecycle policy + automated cleanup job.
Migrate to Spot for Batch Jobs
batch-jobs namespace runs on on-demand. Batch/fault-tolerant workloads ideal for spot (70% discount).