Azure Databricks — Databricks Repo update using Github actions

Prashanth Kumar
3 min readJun 6, 2024

--

In this article talk about using GitHub Actions workflow to update a Databricks repo. This workflow uses GitHub Actions to check out a repo, install a Python environment, create an Azure AD token and then update a Databricks repo. This workflow can be triggered manually or Automated way.

Manually updating Databricks repos can be cumbersome and prone to errors. Data engineers often need to maintain and update their repos in different environs (ex: prod and dev) and branches (ex: master and develop). This multi-step process includes checking out a repo, configuring the environment, creating auth tokens and then updating the repo. Performing these steps manually introduces inconsistencies, time wasted and security vulnerabilities if sensitive credentials are not encrypted.

As we don’t want to tight coupling with Databricks generated PAT token.

Workflow

So instead of using Databricks PAT token now we have option to use Azure active Directory Databricks Scoped token. Here you need to use your Azure Service Principal with Secret and you can use SPN with certificate as well.

Problem we are trying to Solve

In this article we will automate refreshing a Databricks repository. We will use GitHub Actions to create an automated workflow that will refresh the repository with the latest changes from time to time with minimum effort. The workflow will:

  1. Checkout the repository code: To ensure we are using the latest code.
  2. Create a Python environment: To provide the required tools for the next steps.
  3. Install dependencies: To ensure we have all the required libraries.
  4. Create Azure AD token: To ensure a secure authentication with Databricks.
  5. Install Databricks CLI: To interact with Databricks.
  6. Refresh Databricks repository: To apply the changes to the required branch.

We will also use GitHub secrets to store sensitive data in a secure way and minimize the risk of credential exposure.

Benefits It’s Going to Bring

By automating the process of refreshing a repository using GitHub Actions we can ensure the following main benefits:

  1. Consistency
  2. Efficiency
  3. Security
  4. Reduced Errors.

YAML File

Here is a full yaml file

name: Refresh Repo
on:
workflow_call:
inputs:
environment:
description: "production or develop databricks environment"
required: true
type: string
branch:
description: "master or develop branch"
required: true
type: string
jobs:
build:
runs-on: ubuntu-latest
environment:
name: ${{ inputs.environment }}
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
REPO_ID: ${{ secrets.REPO_ID }}
TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
CLIENT_ID: ${{ secrets.AZURE_SPN_ID }}
CLIENT_SECRET: ${{ secrets.AZURE_SPN_PASSWORD }}

steps:
- name: Checkout repository
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests

- name: Generate Azure AD token
id: generate-token
run: |
import requests
import json

tenant_id = '${{ secrets.SPN_TENANT_ID }}'
client_id = '${{ secrets.AZURE_SPN_ID }}'
client_secret = '${{ secrets.AZURE_SPN_PASSWORD }}'

token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token"
token_data = {
'grant_type': 'client_credentials',
'client_id': client_id,
'client_secret': client_secret,
'scope': '2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default'
}
token_r = requests.post(token_url, data=token_data)
token_r.raise_for_status()
token = token_r.json().get('access_token')

print(f"::set-output name=access_token::{token}")
shell: python

# - name: Display Token
# run: cat token.txt

- name: Install Databricks CLI
run: |
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

- name: Update Databricks repo
run: |
export DATABRICKS_TOKEN=${{ steps.generate-token.outputs.access_token }}
databricks repos update ${{ env.REPO_ID }} --branch "${{ inputs.branch }}"
env:
DATABRICKS_HOST: ${{ env.DATABRICKS_HOST }}

Make sure you save below variables in your Github secrets.

      DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
REPO_ID: ${{ secrets.REPO_ID }}
TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
CLIENT_ID: ${{ secrets.AZURE_SPN_ID }}
CLIENT_SECRET: ${{ secrets.AZURE_SPN_PASSWORD }}

--

--