How to Cut Cloud Bills by 40% Using Kubernetes Autoscaling
Most cloud bills are inflated by over-provisioned static resources sitting idle at 3am. Here is the exact autoscaling strategy I use to bring cloud costs down by 40% or more for my clients, no vendor lock-in required.
Alexis Morin
Senior DevOps & Go Engineer
12 March 2026
7 min read
The Problem With Static Resource Allocation
Most companies over-provision. They provision for peak traffic and pay for that peak capacity 24 hours a day, 7 days a week. For a startup running on GKE or AKS, this can easily mean you're wasting 40–60% of your cloud spend on resources that sit idle overnight and on weekends.
The fix is not magic. It's autoscaling, and most teams either skip it, configure it wrong, or only implement half of it.
Here is the strategy I use to cut cloud costs for my clients, based on real implementations at Fortune 500 companies and high-growth startups.
Step 1: Know What You're Paying For
Before touching a single YAML file, audit your current resource usage:
# See actual CPU/memory usage vs requested
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpuThen compare this to your cloud provider's cost explorer. In most cases, you'll find:
- •Nodes running at 15–25% average utilization
- •Pods with requests 3–5x higher than actual usage
- •Dev/staging environments that never scale down at night
This audit alone will show you where the money is going.
Step 2: Horizontal Pod Autoscaler (HPA)
HPA scales the number of pod replicas based on CPU or memory usage. This is the most impactful change for workloads with variable traffic.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60Key points:
- •Never set minReplicas to 1 for production workloads. You'll have a single point of failure during scale events
- •60% CPU target is a safe sweet spot: aggressive enough to scale before saturation, conservative enough to avoid thrashing
- •For Go services, CPU utilization is a reliable proxy for load
Step 3: Cluster Autoscaler for Node-Level Scaling
HPA scales pods, but Cluster Autoscaler scales nodes. Without it, your HPA will schedule pods that remain in "Pending" because there's no node to place them on.
# Terraform: GKE node pool with autoscaling
resource "google_container_node_pool" "primary" {
name = "primary"
cluster = google_container_cluster.main.name
autoscaling {
min_node_count = 1
max_node_count = 10
}
node_config {
machine_type = "e2-standard-4"
preemptible = true # 60–80% cheaper
}
}The preemptible = true line is important. Preemptible (GCP) or spot (AWS/Azure) nodes are 60-80% cheaper for workloads that can tolerate interruption. For stateless Go microservices, this is a no-brainer.
Step 4: Vertical Pod Autoscaler (VPA) for Right-Sizing
VPA analyzes historical usage and recommends (or applies) optimal CPU and memory requests. This is where you fix the "over-requested resources" problem.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updatePolicy:
updateMode: "Off" # Start with "Off", just get recommendations firstStart VPA in recommendation mode (updateMode: "Off") and let it run for a week. Then check its recommendations:
kubectl describe vpa api-vpaYou'll typically find that services requested 1 CPU but only use 100–200m in practice. Apply those recommendations gradually.
Step 5: Schedule-Based Scaling for Dev/Staging
Production needs HPA. Dev and staging just need to turn off at night.
# Scale down at 8pm, back up at 8am
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-dev
spec:
schedule: "0 20 * * 1-5"
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl
command:
- kubectl
- scale
- deployment/api
- --replicas=0
- -n
- devThis alone can save 30–50% on dev/staging costs.
Real Results
At IQVIA, implementing this strategy across dev/UAT/prod environments cut CI/CD execution time by 40% and significantly reduced idle resource waste. For a pharma startup I worked with, moving their Kubernetes workloads from static node pools to autoscaled spot instances cut their monthly GCP bill by $8,000/month.
The math is simple: if you're running 10 nodes 24/7 at $500/month each ($5,000 total), and autoscaling drops your average to 4 nodes during off-hours (which is 12 of 24 hours on weekdays and all weekend), you save roughly $1,750/month just from Cluster Autoscaler.
Add VPA right-sizing and spot nodes, and you're looking at 40%+ savings without changing a single line of application code.
Getting Started
The hardest part is not the configuration, it's the audit. Most teams don't know their actual resource utilization until they look.
If you want a free infrastructure audit to see exactly where your cloud budget is going, I offer a 15-minute call where I'll walk through your setup and identify the highest-impact optimizations. No commitment required.
FREE CONSULTATION
Want to reduce your cloud costs?
I offer a free 15-minute infrastructure audit. I'll show you exactly where your cloud budget is going and what to fix first.
Book Free Audit