🔔

Active Alerts

🔴 Cost Spike Detected

payments-svc spend up 312% vs 7d avg. CPU throttling likely cause.

🟡 High Idle CPU

ml-training namespace: 68% requested CPU unused for 48h.

🟡 PVC Orphaned

3 persistent volumes unattached for >7 days — $420/mo wasted.

🔵 Quota Approaching

analytics namespace at 87% of monthly budget limit ($24k).

Executive Summary
Cluster
Period
🔔4
Top-Line KPIs
💵
Total Cluster Spend
$84,320
↑ 9.2%vs last month
💤
Idle Resource Waste
$18,740
↑ 14%of total spend
Efficiency Score
62%
↑ 3.1%vs last month
💰
Potential Savings
$23,100
identified this cycle
📈 Daily Spend Trend
Total cluster cost vs budget · 30 days
🍩 Cost by Category
Compute / Storage / Network breakdown
🏷️ Top Namespaces by Cost
Monthly spend ranked · all teams
🎯 Resource Utilization
CPU / Memory / Storage avg
Recent Activity

Cost Events — Last 7 Days

8 events
DateEventNamespaceDeltaStatus
Mar 18Deployment scaled up (8→24 pods)payments-svc+$1,840Critical
Mar 17PVC storage class changedanalytics+$320Warning
Mar 16Idle pods terminated (rightsizing)ml-training-$640Saved
Mar 15Spot instance reclaimed, fallbackbatch-jobs+$210Warning
Mar 14Reserved instance appliedfrontend-$2,100Saved
Mar 13New workload deployeddata-pipelines+$780Info
🖥️
Node Count
48
↑ 6vs last month
💻
Compute Cost
$52,100
↑ 7.4%CPU + RAM
☁️
On-Demand vs Reserved
64%
on-demand exposure
🏷️
Spot Savings
$8,420
↓ costvs on-demand
🏗️ Cost by Node Pool
Spend distribution across node groups
⏱️ Hourly Cost Heatmap
Average hourly spend by day of week
Node Inventory

Node Details

48 nodes
NodeInstance TypePoolCPU UtilMem UtilDaily CostBilling
📅 Namespace Cost Over Time
Stacked spend — top 5 namespaces · 30d
💡 Request vs Actual Usage
Requested resources vs actual consumption
41%
CPU EFFICIENCY
(avg actual / requested)
58%
MEMORY EFFICIENCY
(avg actual / requested)
29%
GPU EFFICIENCY
(utilization rate)
📉 CPU: Requested vs Used (7d avg)
By workload · millicores
📉 Memory: Requested vs Used (7d avg)
By workload · GiB
Limit vs Request Gaps

Workloads with Oversized Requests

18 workloads
WorkloadNamespaceCPU ReqCPU AvgMem ReqMem AvgWaste/moAction
Idle Pods
34
$4,200/mo
Orphaned PVCs
8
$1,120/mo
Zero-Traffic Svcs
12
$890/mo
Unused ConfigMaps
67
Housekeeping
Total Idle Waste
$6,210
per month
💤 Idle Cost Trend
Rolling 30-day idle resource spend
📊 Idle by Namespace
Which teams own the most idle resources
Idle Resource Inventory

All Idle Resources

34 items
ResourceTypeNamespaceIdle SinceCPU (unused)Mem (unused)Cost/moSev
Total Savings if Applied
$12,840
per month
Workloads Analyzed
142
across all namespaces
Overprovisioned
68
need downsizing
Underprovisioned
9
risk of OOM / throttle
🏆 Top Rightsizing Opportunities
Estimated monthly savings by workload
🔵 Provision State Distribution
Over / under / correctly provisioned
Recommendations

Rightsizing Queue

68 items
WorkloadNamespaceCurrent CPURec CPUCurrent MemRec MemSavings/moConfidence
🚀 Cost by Deployment
Top 8 most expensive
📦 Cost by Controller Type
Deployment / DaemonSet / Job / CronJob
📈 Pod Count vs Cost
Correlation over 30 days
All Workloads

Workload Cost Table

89 workloads
WorkloadNamespaceTypePodsCPU CostMem CostTotal/moEfficiency
Total Network Cost
$9,840
↑ 18%
External Egress
$5,120
internet outbound
Cross-AZ Traffic
$3,400
avoidable
Intra-Cluster
$1,320
pod-to-pod
🌐 Network Cost Breakdown
Egress / cross-AZ / intra-cluster · 30d trend
🗺️ Top Egress Sources
By service generating most outbound traffic
Total Storage Cost
$12,380
↑ 5.2%
Total PVC Capacity
48 TB
provisioned
Avg Utilization
54%
of provisioned
Orphaned PVCs
8
$1,120/mo waste
💾 Cost by Storage Class
gp3 / io1 / standard / ephemeral
📊 PVC Usage vs Provisioned
Top 10 PVCs — actual vs claimed capacity
Teams Tracked
9
active this period
Over Budget
2
teams this month
Tagged Resources
91%
attribution coverage
Unallocated Overhead
$6,240
shared infra
🏢 Team Spend vs Budget
Actual vs allocated budget · all teams
📅 Monthly Chargeback Trend
Team spend history · 6 months
Team Budget Status

Chargeback Summary

TeamNamespace(s)BudgetActualVariance% UsedStatus
Critical Alerts
4
require action
Warnings
11
monitor closely
Anomalies (ML)
3
this week
Resolved (7d)
18
closed alerts
Active Alerts
🔴

Cost Spike — payments-svc

Spend increased 312% vs 7-day rolling avg. 24 pods running vs typical 8.
Root cause: load test not terminated. Cluster autoscaler added 6 nodes.

CRITICAL
2h ago
$1,840 excess
🔴

GPU Idle — ml-training namespace

4x NVIDIA T4 GPUs idle for 31 hours. Training job completed but pods not terminated.
$96/hour wasted on idle GPU nodes.

CRITICAL
31h ago
$2,976 wasted
🟡

Budget Threshold — analytics team at 87%

analytics namespace consumed $20,880 of $24,000 monthly budget.
At current rate, will exceed budget in ~6 days.

WARNING
4h ago
87% used
🟡

Orphaned PVCs Detected

3 PVCs unattached for >7 days in staging namespace. gp3 storage costing $420/mo.
Likely residue from deleted deployments.

WARNING
2d ago
$420/mo
🔵

Anomaly Detected — egress spike

data-pipelines namespace generated 3.2TB egress in 4h (10x baseline).
ML model detected statistical anomaly. Review data export jobs.

ANOMALY
6h ago
$640 excess
Alert History
📊 Alert Volume — 30 Days
Critical / warning / anomaly counts per day
Quick Wins (this week)
$6,420
immediate action
This Month
$10,840
medium effort
This Quarter
$23,100
strategic changes
Annual Impact
$277K
if all applied
Top Opportunities
🎯

Terminate Idle GPU Pods

4 GPU pods in ml-training idle 31h. Implement job completion hooks or auto-terminate after TTL.

$3,456/mo
⚡ Quick win · 1h effort
⬇️

Rightsize Overprovisioned Workloads

68 workloads using <40% of requested CPU/memory. Apply VPA recommendations progressively.

$12,840/mo
📋 Medium effort · 1 week
🏷️

Purchase Reserved Instances

64% on-demand exposure on baseline workloads. 1yr RI commitment saves ~38% on stable nodes.

$8,200/mo
📋 Low effort · Billing change
🌐

Reduce Cross-AZ Traffic

Topology-aware routing can reduce cross-AZ calls. Add pod affinity rules to co-locate chat services.

$3,400/mo
🔧 Medium effort · 2 weeks
💾

Clean Orphaned PVCs

8 unattached volumes costing $1,120/mo. Implement PVC lifecycle policy + automated cleanup job.

$1,120/mo
⚡ Quick win · 2h effort
☁️

Migrate to Spot for Batch Jobs

batch-jobs namespace runs on on-demand. Batch/fault-tolerant workloads ideal for spot (70% discount).

$4,200/mo
🔧 Medium effort · 3 weeks
Implementation Timeline
Week 1 — Immediate
Terminate Idle GPU Pods + Clean Orphaned PVCs
No-risk removals. Set up auto-TTL on training jobs. Run PVC audit script. Expected save: $4,576/mo.
Weeks 2–3 — Short Term
Begin Rightsizing Rollout (Wave 1 — dev namespaces)
Apply VPA recommendations to non-prod first. Validate stability. Expected save: $3,200/mo from dev alone.
Month 1 — Medium Term
Rightsizing Wave 2 (prod) + Spot Migration for Batch
Roll rightsizing to prod with canary. Migrate batch-jobs to spot node group. Expected save: $13,640/mo.
Month 2 — Strategic
Reserved Instance Purchase + Topology-Aware Routing
Commit to 1yr RIs for stable baseline capacity. Deploy topology hints in service mesh. Expected save: $11,600/mo.
Quarter End — Ongoing
FinOps Culture: Budget Alerts + Team Dashboards
Embed cost metrics in CI/CD. Teams own their spend. Quarterly review cadence established.