This is the final post in our series that helps ensure efficient usage of GKE environments. The first two posts focused on tuning applications and clusters to reduce spend. This post focuses on creating a centralized dashboard that gives better visibility across clusters in different projects.

Here’s the high level strategy we’re wrapping up to tune our clusters:

Cost Optimization Process Overview

Phase 1- Workload Level Efficiencies

Review Metrics: requested vs actual CPU and memory utilization.
Identify candidate workloads for optimization
Explore service usage patterns
Rightsize the workload
Revisit pod level metrics to validate more efficient utilization
Summary: GKE Autopilot

Phase 2- Cluster Level Efficiencies (previous blog)

Review node metrics for requested vs actual utilization
Review if cluster scales up and down appropriately based on load
Tune cluster autoscaler and/or node auto provisioner.
Review node level metrics and observe improvements
Efficient use of Committed Use Discounts (CUDs)
Spot instances for appropriate workloads

Phase 3- Extended visibility

Central collection of resource utilization and recommendations
Dashboard for multiple project and cluster visibility

Don’t Read THIS Blog

Know Before You Continue Reading

I wrote this months months back and I’ve been debating if I even publish it but I’ve decided to go ahead an quietly publish this (no announcements). Things shifted quickly in this space and Google released a new repository, new architecture, and new approach in documentation to deploying a dashboard for this type of visibility.

Ameenah and many others on our Solution Architecture team have been huge contributors to the progress in GKE Cost Optimization at scale and much of the approach has shifted before I could publish.

I was looking at the differences in detail this morning and they’re fairly significant, so I’m declaring bankruptcy on this effort and won’t try to adapt the blog since Ameenah has some very helpful resources for the new process.

My recommendation is to follow the new process at the below links, learn about the new method to deploy dashboards at scale, and come back to this document if you find yourself running into any limitations as I do detail how this could work across multiple projects and it is an alternative approach that may help you build a dashboard for your environment.

Explore these Resources First (Instead of This Blog)

New Dashboard Deployment Process

https://cloud.google.com/kubernetes-engine/docs/tutorials/right-size-workloads-at-scale

Google blog that gives deployment overview

https://cloud.google.com/blog/products/containers-kubernetes/optimize-gke-workloads-at-scale

A video walkthrough by Ameenah!

The Old/Alternative Method to Get Visibility

Overview

The Cost Optimization tab within the Google console as shown in previous posts is useful for communicating utilization and efficiency information for clusters and services. It also makes suggestions for right sizing applications but it requires clicking into each service one at a time which makes it challenging to use at scale.

The Google solution architecture team has created a proof of concept dashboard that demonstrates an approach to aggregate utilization and recommendation data from multiple projects and clusters into a single view. Centrally collected data is visualized through a provided Looker Dashboard. 🎉 With this as a starting point to build upon, we’re able to filter, sort, and quickly find relevant workloads to have the largest impact.

Note that the dashboard is not officially supported. It hasn’t been hardened, load tested, or tested to scale with thousands of clusters, etc and it may require customization to meet your needs. It’s meant to be a starting point for proof of concept.

Read the README

This is pretty standard advise that I usually follow 😛 Know that there’s a small gotcha that is called out below if you plan to pull metrics from multiple projects.

Clone the repo:

https://github.com/GoogleCloudPlatform/gke-cost-optimization-monitoring/tree/main/metrics-exporter

I recommend deploying from the main branch but this is the commit id that’s used to document this post: 808011623f34ce2e3f9233ee4dc80ffe35cf7171

The top level README has a link to a google document that has a thorough explanation also walks through creating a test cluster and application.

https://cloud.google.com/architecture/monitoring-gke-clusters-for-cost-optimization-using-cloud-monitoring

Be aware that standing these resources up will incur additional charges.

Setup

Referring to the metrics-exporter README:

In the section “Before you begin”, it asks that we create a new project. This can be accomplished through the GUI or we can use gcloud commands.

First run this command to list your billing account numbers and choose an appropriate billing account to link a new project to.

gcloud alpha billing accounts list --format yaml | grep billing | cut -d'/' -f2

Copy the desired billing account ID and set the BILLING_ACCOUNT_ID variable

BILLING_ACCOUNT_ID=<000000-000000-000000>

PROJECT_NAME=<my-new-project-name>

# Create the new project where the dashboard will be deployed.

gcloud projects create $PROJECT_NAME --name $PROJECT_NAME

gcloud alpha billing accounts projects link $PROJECT_NAME --billing-account $BILLING_ACCOUNT_ID

** It’s highly recommended to create a new project for a couple reasons. If you decide to do multiple projects, you don’t want to affect any existing metrics that are being collected. (this can happen- be careful). A new project is created to make it easier to clean up all resources by deleting the project.

Support for multiple projects.

I’d suggest that most people go this route unless you are only testing. Creation of a scoping project will enable the ability to easily add and remove projects for dashboard visibility.

This section should be done through the console.

In the console, select the newly created dashboard project as the active project. Load the Cloud Monitoring top level menu and on the left navigate to ‘settings’.

In the middle of the screen under “GCP Projects”, click “Add GCP Projects”. A menu appears that allows selection of existing projects. Select the projects you want to import into this dashboard. Keep in mind if you need environments separated through dedicated dashboards such as production, staging, or other grouping of visibility.

At the bottom, because we’ve created a dedicated dashboard project with nothing in it, we’re going to change the toggle to “Use this project as the scoping project” and add the projects.

The image below shows the setting we want to use: “Use this project as the scoping project”.

if you’re trying to re-use a project that already exists, you probably don’t want this!

A big scary warning comes up to let us know that we shouldn’t do this on an existing project. This is the last chance to confess if you’re trying to reuse a project! Click ‘Confirm’.

Begin the installation process

Set variables to deploy the dashboard.

gcloud config set project $PROJECT_NAME

export PROJECT_ID=$PROJECT_NAME

export REGION=us-central1

export ZONE=us-central1-a

export CLUSTER_NAME=online-boutique 

export SERVICE_ACCOUNT=svc-metric-exporter

export PUBSUB_TOPIC=mql_metric_export

export BIGQUERY_DATASET=metric_export

export BIGQUERY_MQL_TABLE=mql_metrics

export BIGQUERY_VPA_RECOMMENDATION_TABLE=vpa_container_recommendations

export EXPORT_METRIC_SERVICE_ACCOUNT=mql-export-metrics@$PROJECT_ID.iam.gserviceaccount.com

These are the services that need enabled before running the installation script

gcloud services enable bigquery.googleapis.com cloudfunctions.googleapis.com pubsub.googleapis.com cloudbuild.googleapis.com

Run the script!

# Run deploy pipeline

./deploy_pipeline.sh

A high level overview of components.

Cloud Function

This is the core of where metric collection occurs. The function pulls cost optimization and utilization data from Cloud Metrics for all GKE pods in the scoping project. It’s configured to run once a day and metrics are placed into a BigQuery table where a running 30 days is retained for the dashboard.

Cloud Scheduler + PubSub Topic

Used to specify the daily cron schedule to execute the cloud function

BigQuery

Tables to store 30 days of cost optimization and utilization data

Looker Dashboard

Central visibility across clusters and projects.

Load the Dashboard

Open the dashboard URL from the README.

https://lookerstudio.google.com/u/0/reporting/c4ac3f37-c7bc-48ba-bb7a-112298dce86e/page/tEnnC/preview

On the top right, click ‘Use my own data’. Next, select the project that houses our data in BigQuery and select the vpa_container_recommendations table

A warning appears to let us know that we’re about to use our own data. Click “Add to report”

At the top right, a green checkbox will verify that we’re using our local data.

This dashboard can be used as is by removing ‘preview’ from the end of the URL

https://lookerstudio.google.com/u/0/reporting/c4ac3f37-c7bc-48ba-bb7a-112298dce86e/page/tEnnC

It can also be customized under the top right menu by selecting ‘Make a Copy’

Summary: Post 3

Congratulations on making it through this cost optimization series. Post 1 about workload level optimizations should be the starting point for all efforts as it provides the foundation for accurate cluster level tuning and because it provides immediate savings especially on Autopilot clusters.

After ensuring the applications are running efficiently, we moved on to cluster level optimizations such as host level visibility, cluster autoscaling, and ensuring efficient use of clusters.

Finally, in this post we covered how to stand up multi-cluster and multi-project cost visibility through use of a cost optimization dashboard.

This series was not an all encompassing guide to cost optimization with GKE but it should serve as a solid foundation. Each environment is unique and there are always additional options to explore so reach out to me or your account team if you’d like a hand exploring.

GKE Tuning and Optimizations to Reduce Cost: (Post 3)