Databricks Notebooks Deployment in SQL Warehouse using GitHub Actions

Prashanth Kumar
6 min readAug 3, 2023

--

What is Databricks

Databricks is a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale.

What is a SQL warehouse?

A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. Compute resources are infrastructure resources that provide processing capabilities in the cloud.

To navigate to the SQL warehouse dashboard, click

SQL Warehouses in the sidebar. By default, warehouses are sorted by state (running warehouses first), then in alphabetical order.

To help you get started quickly, Databricks creates a small SQL warehouse called Starter Warehouse automatically.

Problem Statement

A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. As we want to maintain consistency and maintain latest schema objects, so how do we maintain them?

As definitely we don't want to give access to SQL Warehouse cluster access to individuals with write access, so how do I push my changes object changes from lower environment to Prod(higher) environment?

and finally I don't want to keep running my SQL Cluster it has to run only on demand basis whenever there is a

Solution

In order to deal with above problem statement we have designed a simple GitHub action workflow which will capture changes from lower environment and then promote to higher environment which any Manual or Human intervention.

Pre-requisites:

  • Databricks host name
  • Databricks Token
  • Databricks RepoID
  • Databricks ClusterID

All these values you need to add it in your GitHub secrets.

Databricks hostname you can get it from your Azure portal → once you open Databricks instance → on Overview screen → check URL and copy that path.

Databricks token → for Databricks generation follow this link

https://docs.databricks.com/en/archive/dev-tools/cli/tokens-cli.html

Databricks RepoID: for getting Databricks RepoID follow below link:

https://docs.databricks.com/en/archive/dev-tools/cli/repos-cli.html#:~:text=To%20get%20the%20ID%2C%20run,the%20get%20or%20list%20command.

Databricks ClusterID: Once you login to Azure Databricks → go to Compute section and from there you can get Cluster ID.

Now lets start looking into how we can achieve step by step using GitHub actions

  1. As we are using Views and functions majorly and we want to capture only those file changes and then deploy them onto higher environment. So I have added paths filter in my Github action as mentioned below. Considering “PrashanthTest1” as a Main/Master branch but name has to be replaced during real time.
on:
push:
paths:
- 'Functions/**'
- 'Views/**'
branches:
- PrashanthTest1

2. Next you need to add a step to check Repo refresh in your Azure Databricks. As once you make changes in test repo and once you merge it to Master branch the changes has to be reflected in Azure Databricks. So in order to make sure it has to be in sync you need to use “Databricks repos update” command.

      - name: excecute py
run: |
python -m pip install --upgrade databricks-cli
databricks repos update --repo-id "123456789012345" --branch "PrashanthTest1"

3. Next we don't want to run all notebooks/.sql scripts or views whenever there is a PR merge to master. So I want to capture only recent file changes. In order to do so I am using “git diff” task. Later I want to capture differential changes as an output. Make sure to include “Fetch depth” as I found it’s a mandatory while pulling git differences.

- name: Adding actions/checkout@v2
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: Identify changed files
id: get_changes
run: |

CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }}) > changed_files.txt
echo "::set-output name=changed-files::$CHANGED_FILES"

ARTIFACT_PATHS=""
for file in $CHANGED_FILES; do
ARTIFACT_PATHS="$ARTIFACT_PATHS $file"
done
echo "::set-output name=artifact_paths::$ARTIFACT_PATHS"
      - name: upload changed files to artifact
uses: actions/upload-artifact@v2
with:
name: changed-files-artifact
path: ${{ steps.get_changes.outputs.artifact_paths }}

4. Once you have output of changed files, now we want to capture the names without file extension. As in my repo all these are notebooks however these are being saved with different extensions. for eg: .sql / .py etc.

5. In order to exclude file names we are using remove extension workflow, later I am publishing them as final artifact.

      - name: Show only files without paths
id: changed_filepaths
run: |

fileswithextension="${{ steps.get_changes.outputs.artifact_paths }}"
for file in $fileswithextension; do
filename_with_extension="${fileswithextension%.*}"
echo "::set-output name=filename_without_extension::$filename_with_extension"

done < changed_files.txt

- name: Create generated artifact
uses: actions/upload-artifact@v2
with:
name: modified-paths-artifact
path: ${{ steps.changed_filepaths.outputs.filename_without_extension }}

6. Finally triggering specific notebooks on Azure Databricks SQL cluster, we need to run below task

- name: Install Azure CLI
run: |
pip install databricks-cli

- name: Deploy SQL script to Azure Databricks SQL cluster
uses: databricks/run-notebook@v0.0.3
with:
workspace-notebook-path: /Repos/SQL_Views/DataStoreDatabricks/${{ steps.changed_filepaths.outputs.filename_without_extension }}
databricks-token: "$(databricks-token)"
existing-cluster-id: "$clusterID"

7. GitHub actions output is as below

Recently Modified files
Files without extensions
Added file path along with existing Repos path
final Notebooks run

Screenshots from Azure Databricks cluster where it shows last updated view(s).

Databricks sql views
this is a view and it will show last modified timestamp

As well as open workflows and you can see executed jobs.

Please find entire GitHub workflow

name: Deploy on File Changes
on:
push:
paths:
- 'Functions/**'
- 'Views/**'
branches:
- PrashanthTest1

env:
DATABRICKS_HOST: 'https://adb-xxxxxxxxxxxxxxx.14.azuredatabricks.net/'
DATABRICKS_TOKEN: xxxxxxxxxxxxxxxxx
REPO_ID: xxxxxxxxxx

jobs:

build:
runs-on: ubuntu-latest
steps:
- name: Check out code
uses: actions/checkout@v2

- name: Adding actions/checkout@v2
uses: actions/checkout@v2
with:
fetch-depth: 0

- name: setup python
uses: actions/setup-python@v2
with:
python-version: 3.8

- name: excecute py
run: |
python -m pip install --upgrade databricks-cli
databricks repos update --repo-id "xxxxxxxxxx" --branch "PrashanthTest1"

- name: Identify changed files
id: get_changes
run: |

CHANGED_FILES=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }}) > changed_files.txt
echo "::set-output name=changed-files::$CHANGED_FILES"

ARTIFACT_PATHS=""
for file in $CHANGED_FILES; do
ARTIFACT_PATHS="$ARTIFACT_PATHS $file"
done
echo "::set-output name=artifact_paths::$ARTIFACT_PATHS"

- name: upload changed files to artifact
uses: actions/upload-artifact@v2
with:
name: changed-files-artifact
path: ${{ steps.get_changes.outputs.artifact_paths }}

- name: Show only files without paths
id: changed_filepaths
run: |

fileswithextension="${{ steps.get_changes.outputs.artifact_paths }}"
# Loop through the files and remove the file extension to show only the file path
for file in $fileswithextension; do
filename_with_extension="${fileswithextension%.*}"
echo "::set-output name=filename_without_extension::$filename_with_extension"

done < changed_files.txt


- name: Create generated artifact
uses: actions/upload-artifact@v2
with:
name: modified-paths-artifact
path: ${{ steps.changed_filepaths.outputs.filename_without_extension }}

- name: Install Azure CLI
run: |
pip install databricks-cli

- name: Deploy SQL script to Azure Databricks SQL cluster
uses: databricks/run-notebook@v0.0.3
with:
workspace-notebook-path: /Repos/SQL_Views/DataStoreDatabricks/${{ steps.changed_filepaths.outputs.filename_without_extension }}
databricks-token: "dapiad1bdxxxxxxxxxxxxxxxxxxxx-2"
existing-cluster-id: xxxx-xxx-xxxxxxxxx


--

--

Prashanth Kumar
Prashanth Kumar

Written by Prashanth Kumar

IT professional with 20+ years experience, feel free to contact me at: Prashanth.kumar.ms@outlook.com

No responses yet