The Hidden Looker Costs

Streamline your analytics spending

Mar 07, 2024

BigQuery and Looker are great products and work nicely together, however, it is a challenge to analyze and understand exactly how Looker usage generates cost to optimize it.

In BigQuery you can get the cost data and in Looker, you get usage data, but you can't get both joined together in one place

Using BigQuery we can gain some deeper knowledge about the costs using the INFORMATION_SCHEMA.JOBS view. This feature provides valuable data, such as total bytes processed, total bytes billed, referenced tables, and the query SQL text.

However, a critical piece of the puzzle remains elusive: a direct correlation between these costs and the reports and dashboards that initiate these queries.

The BigQuery INFORMATION_SCHEMA views are read-only, system-defined views that provide metadata information about your BigQuery objects.
– From https://cloud.google.com/bigquery/docs/information-schema-intro

Considering dashboards, it's important to note their potential to execute multiple queries. In practice, it's not uncommon to encounter dashboards that trigger over 50 query runs, each contributing to the overall cost.

In this article, I wanted to go through how me and my team at Mathem linked Looker usage data with GCP billing to give us a better overview of how much queries, dashboards, and looks cost.

I’ve included a chart of the whole workflow.

Looker - GCP Guide

If you haven’t already done it, first enable cloud billing export to BigQuery.

Create a new Looker user and grant the see_system_activity permission to have access to Looker's System Activity Explores.
Generate API credentials for the Looker user

To link Looker usage data with BigQuery cost data, we will use the history.slug field from History Explore in Looker and the label looker-context-history_slug from gcp_billing_export table in BigQuery

Starting with Looker data:

The fields we will export from Looker are:

history.created_raw
history.id
history.source
history.slug
history.result_source
history.status
history.connection_name
history.message
query.id
query.slug
query.model
query.view
dashboard.id
dashboard.title
look.id
look.slug
look.title
merge_query.id
sql_query.id
sql_query.slug
user.email

We will use Python to create the app for Looker data extraction.

I recommend doing this in a Python virtual environment (e.g virtualevn)

Ensure that you have your credentials in order

The Looker credentials can be stored in a .ini file or as local variables, we will use a .ini file.

Read more about it in sdk-codegen#securing-your-sdk-credentials

The Application

In Looker, once a query is created it's immutable and is not deleted. If you create a query that is exactly like an existing query then the existing query will be returned and no new query will be created.

Lookers API offers multiple result formats, we will use “json”

Install the Looker SDK

pip install looker-sdk

import looker_sdk

from looker_sdk.sdk.api40 import models as models40

looker_client=looker_sdk.init40() # using API 4.0

query_body = models40.WriteQuery(
    model="system__activity",
    view="history",
    fields=[
        "history.created_raw",
        "history.id",
        "history.source",
        "history.slug",
        "history.result_source",
        "history.status",
        "history.connection_name",
        "history.message",
        "query.id",
        "query.slug",
        "query.model",
        "query.view",
        "dashboard.id",
        "dashboard.title",
        "look.id",
        "look.slug",
        "look.title",
        "merge_query.id",
        "sql_query.id",
        "sql_query.slug",
        "user.email",
    ],
)

looker_query = looker_client.create_query(body=query_body)

results = looker_client.run(query_id=looker_query.id, result_format="json")

The payload looks as follows (“flattened”):

[
  {
    "history.created_raw": "2023-11-26 00:00:23",
    "history.id": 12345678,
    "history.source": "dashboard",
    "history.slug": "123abc123abc123bac",
    "history.result_source": "query",
    "history.status": "complete",
    "history.connection_name": "connection_name",
    "history.message": null,
    "query.id": 123123,
    ...
  },
...
]

Let’s create a convert the data to a Dict using our unflatten_dict function.

def unflatten_dict(
    data: Dict[str, Any],
    result: Dict[str, Any] = {},
) -> Dict[str, Any]:
    """
    Convert any keys containing dotted paths to nested dicts
    e.g {'a.b.c': 12}) -> {'a': {'b': {'c': 12}}}
    """

    for key, value in data.items():
        root = result

        ###
        # If a dotted path is encountered, create nested dicts for all but
        # the last level, then change root to that last level, and key to
        # the final key in the path.
        #
        # This allows one final setitem at the bottom of the loop.
        #
        if "." in key:
            *parts, key = key.split(".")

            for part in parts:
                root.setdefault(part, {})
                root = root[part]

        if isinstance(value, dict):
            value = unflatten_dict(value, root.get(key, {}))

        root[key] = value

    return result

Now the data is ready to be published. Each item is published to a PubSub topic, from here a Dataflow job processes the PubSub message and writes it to BigQuery.

Read more about our Dataflow setup in this great post from Robert Sahlin:

Data Engineering on GCP

One StreamProcessor to rule them all

The last 6 months my team at Mathem has gradually pivoted from a data engineering team into a data platform team. The reasons for that is well described in a separate post. Two important components to succeed with this journey are data contracts and a unified architecture. I will cover the parts of our architecture more in detail but thought I would sha…

2 years ago · 8 likes · 2 comments · Robert Sahlin

In BigQuery

Once the data is in BigQeury, it gets deduplicated, and additional fields are added to make it easier to navigate to a specific Looker query, dashboard, or look.

We do this using dbt (data building tool)

Tip:

It’s possible to create links to dashboards and queries referenced in the looker data we just exported.

Query link = base_url_looker_instance/x/query_slug
Sql query link = base_url_looker_instance/x/sql_query_slug

Dashboard link = base_url_looker_instance/dashboards-next/dashboard_id

Look link = base_url_looker_instance/looks/look_id
Query page link = base_url_looker_instance/admin/queries/history_slug

Linking the Data

First, let’s create a query for the GCP billing information:

WITH
  billing AS (
    SELECT
      billing_account_id,
      service.id,
      service.description,
      labels_unnested,
      cost,
      currency
    FROM
      `project.dataset.gcp_billing_export_table,
      UNNEST(labels) AS labels_unnested
  )
SELECT
  *
FROM
  billing
WHERE
  labels_unnested.key = "looker-context-history_slug" 
  AND labels_unnested.value = "123abc456def" # the Looker history slug
  -- only records that generates cost
  AND cost > 0

Next, joining the GCP billing with the Looker data, creating a view/table, let’s call it looker_cost

Note!
The usage/costs for all services are displayed with hourly granularity, which means long-running service usage is spread across multiple hourly windows.
We will therefore create a primary key/surrogate key from 'gcp_billing.export_time', 'gcp_billing.usage_start_time', and 'looker_history.id'
Read more Structure of Standard data export

# looker_cost.sql

WITH
gcp_billing AS (
    SELECT
        billing_account_id,
        export_time,
        usage_start_time,
        label.value AS history_slug,
        cost
    FROM
       ` gcp_billing_export*` AS gcp_billing
    CROSS JOIN UNNEST(labels) AS labels
    WHERE
        labels.key = 'looker-context-history_slug'
        /**
        * Some looker runs has two records in the gcp_billing table, one with cost 0 and one with cost > 0
        * Only including records generating costs
        */
        AND gcp_billing.cost > 0
    
)

SELECT
    looker_history.* ,
    gcp_billing.cost,
    gcp_billing.history_slug,
  {{ dbt_utils.generate_surrogate_key([
        'gcp_billing.export_time',
        'gcp_billing.usage_start_time',
        'looker_history.id'
    ]) }} AS looker_cost_pk
FROM
    gcp_billing
LEFT JOIN
    looker_history_table  AS looker_history
    ON looker_history.slug = gcp_billing.history_slug

Lastly, create a query that is dashboard-focused, let's call it looker_dashboard_cost

# looker_dashboard_cost.sql

SELECT
    DATE(created_raw, "your timezone") AS created_date,
    dashboard.id,
    dashboard.title,
    dashboard.link,
    ARRAY_AGG(
        (SELECT AS STRUCT
            query.model,
            query.view AS explore,
            query.link,
            IF(query.id IS NULL , NULL, cost) as cost)
    ) AS queries,
    ARRAY_AGG(
        (SELECT AS STRUCT
            sql_query.link,
            IF(sql_query.id IS NULL , NULL, cost) as cost)
    ) AS sql_query,
    ARRAY_AGG(
        (SELECT AS STRUCT
            merge_query.id,
            IF(merge_query.id IS NULL , NULL, cost) as cost)
    ) AS merge_query,
    ARRAY_AGG(user.email) AS users,
    COUNT(dashboard.id) AS query_runs,
    SUM(cost) AS cost
    FROM
        looker_cost
    WHERE
        dashboard.id IS NOT NULL
    GROUP BY
        created_date,
        dashboard.id,
        dashboard.title,
        dashboard.link

Voila!

Realm of Data Engineering

Discussion about this post