Scheduled serverless dbt + BigQuery service


My colleague Felipe Hoffa recently published a blog post titled Get started with BigQuery and dbt, the easy way. More specifically, he showed how to install dbt in Google Cloud Shell, configure it and manually run it to create a temporary dataset in BigQuery. This is great for testing dbt + BigQuery but how do you run this kind of setup in production?

dbt documentation states that Running dbt in production simply means setting up a system to run a dbt job on a schedule, rather than running dbt commands manually from the command line.

Cloud Shell is just a temporary VM in the cloud and not suitable for production workloads. One obvious solution is to create a dedicated VM, install dbt and have some kind of cron job on that VM to run dbt on a schedule. This will work but who wants to maintain a VM? Not to mention, you need to pay for the VM per second even when dbt is not running. It’s wasteful. We can do better.

A better solution is to use Cloud Run. Cloud Run is fully managed, no VMs to setup or maintain. Its pricing model is based on request time, you’ll only get charged when the dbt service is running.

In this blog post, I want to show you how to take Felipe’s sample and make it more production-ready by running it as a serverless Cloud Run service on a schedule.

Challenges

Running dbt as a Cloud Run service has a few challenges, namely:

  1. dbt is mainly a command line tool whereas Cloud Run expects HTTP requests. How do you call dbt command from a Cloud Run service?
  2. Cloud Run runs containers. How do you run dbt in a container?
  3. How do you authenticate dbt with BigQuery? OAuth works for end users but for services running in the cloud, it’s probably not the right solution.

Let’s tackle them in that order.

Running shell commands from Cloud Run

Cloud Run has an example on how to run a shell command from an HTTP Server deployed to Cloud Run. It involves setting up a Go based HTTP server that simply calls a shell script upon receiving a GET request. You can take a look the details in invoke.go.

In our case, the shell script, script.sh simply calls dbt with the profile folder:

#!/bin/sh
dbt run --profiles-dir .

Container image for dbt

dbt has some base images that you can rely on (although the documentation is pretty much non-existent). In the container, we want to include the HTTP Server with the script.sh. We also want to include dbt runtime. This is a sample Dockerfile that works:

FROM golang:1.13 as builder
WORKDIR /app
COPY invoke.go ./
RUN CGO_ENABLED=0 GOOS=linux go build -v -o server

FROM fishtownanalytics/dbt:0.17.0
USER root
WORKDIR /dbt
COPY --from=builder /app/server ./
COPY script.sh ./
COPY dbt_project ./

ENTRYPOINT "./server"

Build the container using gcloud:

export SERVICE_NAME=dbt-service
gcloud builds submit \
  --tag gcr.io/$(gcloud config get-value project)/${SERVICE_NAME}

Authentication

By default, Cloud Run uses the Compute Engine default service account and that should be able to make BigQuery calls. However, it’s best practice to assign a more granular permission to your Cloud Run service by assigning a dedicated service account with more restricted IAM roles.

In our case, the Cloud Run service will only talk to BigQuery, so let’s create a service account with bigquery.admin role. You probably want to use even a finer grained role in production:

export SERVICE_ACCOUNT=dbt-sa
gcloud iam service-accounts create ${SERVICE_ACCOUNT} \
   --display-name "DBT BigQuery Service Account"
gcloud projects add-iam-policy-binding \
  $(gcloud config get-value project) \
  --member=serviceAccount:${SERVICE_ACCOUNT}@$(gcloud config get-value project).iam.gserviceaccount.com \
  --role=roles/bigquery.admin

We will use this service account when we deploy the Cloud Run service.

Deploy Cloud Run service

Now that we have all the pieces assembled, deploy to Cloud Run with the service account created earlier and also no-allow-unauthenticated flag to make it a private service:

gcloud run deploy ${SERVICE_NAME} \
    --image gcr.io/$(gcloud config get-value project)/${SERVICE_NAME} \
    --service-account ${SERVICE_ACCOUNT}@$(gcloud config get-value project).iam.gserviceaccount.com \
    --no-allow-unauthenticated

After a few seconds, you should see the service deployed and running:

dbt Cloud Run service

You should also see that the service is private:

dbt Cloud Run service

Scheduling

The final step is to call the Cloud Run service on a schedule. You can do this with Cloud Scheduler.

First, make sure the Cloud Scheduler API is enabled:

gcloud services enable cloudscheduler.googleapis.com

Create a service account for Cloud Scheduler with run.invoker role:

export SERVICE_ACCOUNT=dbt-scheduler-sa
gcloud iam service-accounts create ${SERVICE_ACCOUNT} \
   --display-name "DBT Scheduler Service Account"
gcloud run services add-iam-policy-binding ${SERVICE_NAME} \
   --member=serviceAccount:${SERVICE_ACCOUNT}@$(gcloud config get-value project).iam.gserviceaccount.com \
   --role=roles/run.invoker

Create a Cloud Scheduler job with the service account to call the Cloud Run service every 5 minutes:

export SERVICE_URL="$(gcloud run services list --platform managed --filter=${SERVICE_NAME} --format='value(URL)')"
gcloud scheduler jobs create http ${SERVICE_NAME}-job --schedule "*/5 * * * *" \
   --http-method=GET \
   --uri=${SERVICE_URL} \
   --oidc-service-account-email=${SERVICE_ACCOUNT}@$(gcloud config get-value project).iam.gserviceaccount.com \
   --oidc-token-audience=${SERVICE_URL}

To test that the service gets called and the temporary BigQuery dataset gets created, you can manually trigger the job:

gcloud scheduler jobs run ${SERVICE_NAME}-job

You should see a temp dataset created in BigQuery console:

dbt temp dataset


That’s it. Hopefully, this blog post gave you an idea on how to use dbt + BigQuery + Cloud Run together. This was a very simple dbt job but the same setup can be used for more complicated scenarios. You can check out my Cloud Run Tutorial and its Scheduled Cloud Run dbt service with BigQuery section for all the code for this post.

Feel free to reach out to me on Twitter @meteatamel or read my previous posts on medium/@meteatamel.


See also