Databricks Notebooks Deployment in SQL Warehouse using GitHub Actions

6 min readAug 3, 2023

What is Databricks

Databricks is a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale.

What is a SQL warehouse?

A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. Compute resources are infrastructure resources that provide processing capabilities in the cloud.

To navigate to the SQL warehouse dashboard, click

SQL Warehouses in the sidebar. By default, warehouses are sorted by state (running warehouses first), then in alphabetical order.

To help you get started quickly, Databricks creates a small SQL warehouse called Starter Warehouse automatically.

Problem Statement

A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. As we want to maintain consistency and maintain latest schema objects, so how do we maintain them?

As definitely we don't want to give access to SQL Warehouse cluster access to individuals with write access, so how do I push my changes object changes from lower environment to Prod(higher) environment?

and finally I don't want to keep running my SQL Cluster it has to run only on demand basis whenever there is a

Solution

In order to deal with above problem statement we have designed a simple GitHub action workflow which will capture changes from lower environment and then promote to higher environment which any Manual or Human intervention.

Pre-requisites:

Databricks host name
Databricks Token
Databricks RepoID
Databricks ClusterID

All these values you need to add it in your GitHub secrets.

Databricks hostname you can get it from your Azure portal → once you open Databricks instance → on Overview screen → check URL and copy that path.

Databricks token → for Databricks generation follow this link

https://docs.databricks.com/en/archive/dev-tools/cli/tokens-cli.html

Databricks RepoID: for getting Databricks RepoID follow below link:

https://docs.databricks.com/en/archive/dev-tools/cli/repos-cli.html#:~:text=To%20get%20the%20ID%2C%20run,the%20get%20or%20list%20command.

Databricks ClusterID: Once you login to Azure Databricks → go to Compute section and from there you can get Cluster ID.

Now lets start looking into how we can achieve step by step using GitHub actions

As we are using Views and functions majorly and we want to capture only those file changes and then deploy them onto higher environment. So I have added paths filter in my Github action as mentioned below. Considering “PrashanthTest1” as a Main/Master branch but name has to be replaced during real time.

on:
  push:
    paths:
      - 'Functions/**'
      - 'Views/**'
    branches:
      - PrashanthTest1

2. Next you need to add a step to check Repo refresh in your Azure Databricks. As once you make changes in test repo and once you merge it to Master branch the changes has to be reflected in Azure Databricks. So in order to make sure it has to be in sync you need to use “Databricks repos update” command.

      - name: excecute py
        run: |
            python -m pip install --upgrade databricks-cli
            databricks repos update --repo-id "123456789012345" --branch "PrashanthTest1"

3. Next we don't want to run all notebooks/.sql scripts or views whenever there is a PR merge to master. So I want to capture only recent file changes. In order to do so I am using “git diff” task. Later I want to capture differential changes as an output. Make sure to include “Fetch depth” as I found it’s a mandatory while pulling git differences.

- name: Adding actions/checkout@v2
        uses: actions/checkout@v2
        with:
          fetch-depth: 0

- name: Identify changed files
        id: get_changes
        run: |
          
          CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }}) > changed_files.txt
          echo "::set-output name=changed-files::$CHANGED_FILES"

          ARTIFACT_PATHS=""
          for file in $CHANGED_FILES; do
            ARTIFACT_PATHS="$ARTIFACT_PATHS $file"
          done
          echo "::set-output name=artifact_paths::$ARTIFACT_PATHS"

      - name: upload changed files to artifact
        uses: actions/upload-artifact@v2
        with:
          name: changed-files-artifact
          path: ${{ steps.get_changes.outputs.artifact_paths }}

4. Once you have output of changed files, now we want to capture the names without file extension. As in my repo all these are notebooks however these are being saved with different extensions. for eg: .sql / .py etc.

5. In order to exclude file names we are using remove extension workflow, later I am publishing them as final artifact.

      - name: Show only files without paths
        id: changed_filepaths
        run: |
         
          fileswithextension="${{ steps.get_changes.outputs.artifact_paths }}"
           for file in $fileswithextension; do
            filename_with_extension="${fileswithextension%.*}"
            echo "::set-output name=filename_without_extension::$filename_with_extension"
           
           done < changed_files.txt

      - name: Create generated artifact
        uses: actions/upload-artifact@v2
        with:
          name: modified-paths-artifact
          path: ${{ steps.changed_filepaths.outputs.filename_without_extension }}

6. Finally triggering specific notebooks on Azure Databricks SQL cluster, we need to run below task

- name: Install Azure CLI
        run: |
          pip install databricks-cli

      - name: Deploy SQL script to Azure Databricks SQL cluster
        uses: databricks/run-notebook@v0.0.3
        with:
          workspace-notebook-path: /Repos/SQL_Views/DataStoreDatabricks/${{ steps.changed_filepaths.outputs.filename_without_extension }}
          databricks-token: "$(databricks-token)"
          existing-cluster-id: "$clusterID"

7. GitHub actions output is as below

Added file path along with existing Repos path

Screenshots from Azure Databricks cluster where it shows last updated view(s).

this is a view and it will show last modified timestamp

As well as open workflows and you can see executed jobs.

Please find entire GitHub workflow

name: Deploy on File Changes
on:
  push:
    paths:
      - 'Functions/**'
      - 'Views/**'
    branches:
      - PrashanthTest1

env:
  DATABRICKS_HOST: 'https://adb-xxxxxxxxxxxxxxx.14.azuredatabricks.net/'
  DATABRICKS_TOKEN: xxxxxxxxxxxxxxxxx
  REPO_ID: xxxxxxxxxx

jobs:

  build:
    runs-on: ubuntu-latest
    steps:
      - name: Check out code
        uses: actions/checkout@v2
        
      - name: Adding actions/checkout@v2
        uses: actions/checkout@v2
        with:
          fetch-depth: 0
          
      - name: setup python
        uses: actions/setup-python@v2
        with:
          python-version: 3.8

      - name: excecute py
        run: |
            python -m pip install --upgrade databricks-cli
            databricks repos update --repo-id "xxxxxxxxxx" --branch "PrashanthTest1"
      
      - name: Identify changed files
        id: get_changes
        run: |
          
          CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }}) > changed_files.txt
          echo "::set-output name=changed-files::$CHANGED_FILES"

          ARTIFACT_PATHS=""
          for file in $CHANGED_FILES; do
            ARTIFACT_PATHS="$ARTIFACT_PATHS $file"
          done
          echo "::set-output name=artifact_paths::$ARTIFACT_PATHS"

      - name: upload changed files to artifact
        uses: actions/upload-artifact@v2
        with:
          name: changed-files-artifact
          path: ${{ steps.get_changes.outputs.artifact_paths }}

      - name: Show only files without paths
        id: changed_filepaths
        run: |
         
          fileswithextension="${{ steps.get_changes.outputs.artifact_paths }}"
          # Loop through the files and remove the file extension to show only the file path
          for file in $fileswithextension; do
            filename_with_extension="${fileswithextension%.*}"
            echo "::set-output name=filename_without_extension::$filename_with_extension"
           
           done < changed_files.txt
        

      - name: Create generated artifact
        uses: actions/upload-artifact@v2
        with:
          name: modified-paths-artifact
          path: ${{ steps.changed_filepaths.outputs.filename_without_extension }}

      - name: Install Azure CLI
        run: |
          pip install databricks-cli

      - name: Deploy SQL script to Azure Databricks SQL cluster
        uses: databricks/run-notebook@v0.0.3
        with:
          workspace-notebook-path: /Repos/SQL_Views/DataStoreDatabricks/${{ steps.changed_filepaths.outputs.filename_without_extension }}
          databricks-token: "dapiad1bdxxxxxxxxxxxxxxxxxxxx-2"
          existing-cluster-id: xxxx-xxx-xxxxxxxxx

Databricks Notebooks Deployment in SQL Warehouse using GitHub Actions

What is Databricks

What is a SQL warehouse?

Problem Statement

Solution

Written by Prashanth Kumar

No responses yet