How to Cut Cloud Bills by 40% Using Kubernetes Autoscaling

Most cloud bills are inflated by over-provisioned static resources sitting idle at 3am. Here is the exact autoscaling strategy I use to bring cloud costs down by 40% or more for my clients, no vendor lock-in required.

Alexis Morin

Senior DevOps & Go Engineer

12 March 2026

7 min read

The Problem With Static Resource Allocation

Most companies over-provision. They provision for peak traffic and pay for that peak capacity 24 hours a day, 7 days a week. For a startup running on GKE or AKS, this can easily mean you're wasting 40–60% of your cloud spend on resources that sit idle overnight and on weekends.

The fix is not magic. It's autoscaling, and most teams either skip it, configure it wrong, or only implement half of it.

Here is the strategy I use to cut cloud costs for my clients, based on real implementations at Fortune 500 companies and high-growth startups.

Step 1: Know What You're Paying For

Before touching a single YAML file, audit your current resource usage:

bash

# See actual CPU/memory usage vs requested
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=cpu

Then compare this to your cloud provider's cost explorer. In most cases, you'll find:

•Nodes running at 15–25% average utilization
•Pods with requests 3–5x higher than actual usage
•Dev/staging environments that never scale down at night

This audit alone will show you where the money is going.

Step 2: Horizontal Pod Autoscaler (HPA)

HPA scales the number of pod replicas based on CPU or memory usage. This is the most impactful change for workloads with variable traffic.

yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

Key points:

•Never set minReplicas to 1 for production workloads. You'll have a single point of failure during scale events
•60% CPU target is a safe sweet spot: aggressive enough to scale before saturation, conservative enough to avoid thrashing
•For Go services, CPU utilization is a reliable proxy for load

Step 3: Cluster Autoscaler for Node-Level Scaling

HPA scales pods, but Cluster Autoscaler scales nodes. Without it, your HPA will schedule pods that remain in "Pending" because there's no node to place them on.

hcl

# Terraform: GKE node pool with autoscaling
resource "google_container_node_pool" "primary" {
  name    = "primary"
  cluster = google_container_cluster.main.name

  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }

  node_config {
    machine_type = "e2-standard-4"
    preemptible  = true  # 60–80% cheaper
  }
}

The preemptible = true line is important. Preemptible (GCP) or spot (AWS/Azure) nodes are 60-80% cheaper for workloads that can tolerate interruption. For stateless Go microservices, this is a no-brainer.

Step 4: Vertical Pod Autoscaler (VPA) for Right-Sizing

VPA analyzes historical usage and recommends (or applies) optimal CPU and memory requests. This is where you fix the "over-requested resources" problem.

yaml

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updatePolicy:
    updateMode: "Off"  # Start with "Off", just get recommendations first

Start VPA in recommendation mode (updateMode: "Off") and let it run for a week. Then check its recommendations:

bash

kubectl describe vpa api-vpa

You'll typically find that services requested 1 CPU but only use 100–200m in practice. Apply those recommendations gradually.

Step 5: Schedule-Based Scaling for Dev/Staging

Production needs HPA. Dev and staging just need to turn off at night.

yaml

# Scale down at 8pm, back up at 8am
apiVersion: batch/v1
kind: CronJob
metadata:
  name: scale-down-dev
spec:
  schedule: "0 20 * * 1-5"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: kubectl
              image: bitnami/kubectl
              command:
                - kubectl
                - scale
                - deployment/api
                - --replicas=0
                - -n
                - dev

This alone can save 30–50% on dev/staging costs.

Real Results

At IQVIA, implementing this strategy across dev/UAT/prod environments cut CI/CD execution time by 40% and significantly reduced idle resource waste. For a pharma startup I worked with, moving their Kubernetes workloads from static node pools to autoscaled spot instances cut their monthly GCP bill by $8,000/month.

The math is simple: if you're running 10 nodes 24/7 at $500/month each ($5,000 total), and autoscaling drops your average to 4 nodes during off-hours (which is 12 of 24 hours on weekdays and all weekend), you save roughly $1,750/month just from Cluster Autoscaler.

Add VPA right-sizing and spot nodes, and you're looking at 40%+ savings without changing a single line of application code.

Getting Started

The hardest part is not the configuration, it's the audit. Most teams don't know their actual resource utilization until they look.

If you want a free infrastructure audit to see exactly where your cloud budget is going, I offer a 15-minute call where I'll walk through your setup and identify the highest-impact optimizations. No commitment required.

FREE CONSULTATION

Want to reduce your cloud costs?

I offer a free 15-minute infrastructure audit. I'll show you exactly where your cloud budget is going and what to fix first.

Book Free Audit

← Back to all articles