Azure Databricks — Databricks Administration using Terraform Integration

Prashanth Kumar

--

Problem Statement

Sometimes managing Databricks resources can be cumbersome/complex, especially in Development/UAT environment where lot of frequent updates/changes are necessary. Automating this process with Infrastructure as Code (IaC) tool like Terraform and with Github pipelines helps you to maintain consistency and Security.

Here I am going to talk about the approach to automate different Databricks Tasks using Github Actions.

Approach

In general logic I am going to create GitHub Actions workflow and one of the important thing to avoid using DB PAT token usage. Rather I want to use ServicePrincipal with certificate/Secret usage.

Pre-requisites

Some of the pre-requisites which you need to do before you go further with your Github Actions approach.

  1. Make sure you add your ClientID, Client SPN ID, Tenant_ID, Databricks_host, Client secret in your Github Action secrets.
  2. Next make sure to set right scope.
  3. Make sure you define your TF Storage account and it should have right TF backend storage key set.
  4. Valid ADB Instance URL.

Workflow

Lets start with the actual Github Actions workflow, In my workflow I am defining Workflow call along with Environment variables which are going to be used.

Then I am defining my Python code which is going to generate shortlived bearer token every time to authenticate against my ADB instance. As most of the folks use DB Pat token however as I mentioned earlier I want to reduce usage of DB Pat token rather leverage ServicePrincipal with either Secret or Certificate. As DB Pat token is tied up with Individual user and Service Principal is subscription based which entire team can use.

Here is my yaml file, I named it as Main.yml

name: "Databricks Permissions Setup"

on:
workflow_call:
inputs:
environment:
description: "develop, prod"
required: true
type: string

jobs:
terraform:
name: "Terraform ${{ inputs.environment }}"
runs-on: ubuntu-latest
environment:
name: ${{ inputs.environment }}
env:
ARM_CLIENT_ID: ${{ secrets.SPN_CLIENT_ID }}
ARM_CLIENT_CERTIFICATE_PASSWORD: ${{ secrets.SPN_CERT_PASSWORD }}
ARM_CLIENT_CERTIFICATE_PATH: /home/runner/work/_temp/spn.pfx
ARM_SUBSCRIPTION_ID: ${{ secrets.SPN_SUBSCRIPTION_ID }}
ARM_TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
ARM_ACCESS_KEY: ${{ secrets.TF_BACKEND_STORAGE_KEY }}
CERT: ${{ secrets.SPN_CERT }}
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
REPO_ID: ${{ secrets.REPO_ID }}
TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
CLIENT_ID: ${{ secrets.AZURE_SPN_ID }}
CLIENT_SECRET: ${{ secrets.SPN_CERT_PASSWORD }}

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Convert BASE64 text to spn
run: |
echo $CERT
echo $CERT | base64 --decode > /home/runner/work/_temp/spn.pfx

- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_wrapper: false

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests

- name: Generate Azure AD token
id: generate-token
shell: bash
env:
SPN_TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
AZURE_SPN_ID: ${{ secrets.AZURE_SPN_ID }}
AZURE_SPN_PASSWORD: ${{ secrets.AZURE_SPN_PASSWORD }}
run: |
token=$(python costing/generate-token.py)
echo "access_token=$token" >> $GITHUB_ENV

One of the important point to note make sure when you generate an output using Shell-Bash use $GITHUB_ENV as setout is going to be deprecated soon over Github actions.

Python Script to generate Short lived Bearer token.

Here is my python script which is going to generate short lived bearer token. This script either you can add it in your Main.yml file or you can call this from other folder. In our case I have created another folder called “costing” and saved with the name “generate-token.py”.

import requests
import json
import os

tenant_id = os.environ['SPN_TENANT_ID']
client_id = os.environ['AZURE_SPN_ID']
client_secret = os.environ['AZURE_SPN_PASSWORD']

token_url = f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token"
token_data = {
'grant_type': 'client_credentials',
'client_id': client_id,
'client_secret': client_secret,
'scope': '2ff814a6-3304-4ab8-85cb-cd0e6f879c1d/.default'
}
token_r = requests.post(token_url, data=token_data)
token_r.raise_for_status()
token = token_r.json().get('access_token')

print(token)

“2ff814a6–3304–4ab8–85cb-cd0e6f879c1d” is a global Databricks variable.

You can read more about this : https://learn.microsoft.com/en-us/azure/databricks/dev-tools/service-prin-aad-token

Terraform Integration

Now lets proceed further with some sample terraform script integration with Databricks. Here for demo there are some pre-requisites you need to follow.

  1. Make sure you add providers.tf and add below module block.
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "3.16.0"
}

databricks = {
source = "databricks/databricks"
version = "1.42.0"
}
}

backend "azurerm" {
storage_account_name = "StorageAccountName"
container_name = "StorageContainerName"
}
}

provider "databricks" {
host = var.dbhost
token = var.dbtoken
auth_type = "pat"
}

Make sure you use latest version of Databricks terraform provider, you can get more info from below links.

https://github.com/databricks/terraform-provider-databricks/
https://registry.terraform.io/providers/databricks/databricks/latest/docs

2. Next add variable.tf and add all required variables, so just for demo I have added below keys

#Access
variable "dbtoken" { type = string } #READ FROM ENV
variable "dbhost" { type = string }

#User groups & SPNs
variable "ug-reader" { type = string }
variable "ug-collaborator" { type = string }
variable "project-spn" { type = string }

#Catalogs
variable "catalogs" { type = list(string) }

#Init Script Path
variable "init_script_path" { type = string }

#environment tag
variable "environment" { type = string }

3. Then I am using another environment specific variable file which has some of the additional key/value pairs as I am using Unity catalog.

{
"ug-reader": "Readergroup_AAD_Group_Name",
"ug-collaborator": "Collaborator_AAD_Group_Name",
"analytics-spn": "ServicePrincipal_AppID",
"catalogs": ["unitycatalog-dev","unitycatalog-tst"], //this one in case if you are using Unity catalog.
"init_script_path": "/Workspace/Repos/Release-TEST/github-repo/utils/install.sh", //this one specifically if you/your team using any specific Library to install then you can add this or else remove this step.
"environment": "development"
}

4. install.sh file

Just for testing I just added sample pip install for Azure Keyvault.

pip install azure-keyvault-secrets

if [ $? -ne 0 ]; then
pip install azure-keyvault-secrets ;
fi

5. Terraform file for Databricks permissions

resource "databricks_grants" "permissions" {
for_each = toset(var.catalogs)
catalog = each.value
grant {
principal = var.ug-collaborator
privileges = [
"USE_CATALOG",
"USE_SCHEMA",
"CREATE_TABLE",
"CREATE_FUNCTION",
"CREATE_VOLUME",
"EXECUTE",
"MODIFY",
"REFRESH",
"SELECT",
"READ_VOLUME",
"WRITE_VOLUME",
"APPLY_TAG",
"BROWSE", # To Confirm Priveleges Exist
"CREATE_MATERIALIZED_VIEW", # To Confirm Priveleges Exist
"CREATE_MODEL"] # To Confirm Priveleges Exist
}
grant {
principal = var.ug-reader
privileges = [
"USE_CATALOG",
"USE_SCHEMA",
"BROWSE",
"SELECT",
"EXECUTE",
"READ_VOLUME"]
}
grant {
principal = var.project-spn
privileges = ["ALL_PRIVILEGES"]
}
}

another sample file i just added to check my Databricks compute permissions (compute.tf) which might be getting changed very often.

# Define locals (Used to specify policy json)
locals {
base_policy = {
"node_type_id" : {
"type" : "allowlist",
"values" : [
"Standard_DS3_v2"
],
"defaultValue" : "Standard_DS3_v2",
"hidden" : false
},
"autotermination_minutes" : {
"type" : "fixed",
"value" : 30,
"hidden" : true
},
"spark_version" : {
"type" : "allowlist",
"values" : [
data.databricks_spark_version.latest.id,
data.databricks_spark_version.latest_ml.id
],
"defaultValue" : data.databricks_spark_version.latest.id,
"hidden" : false
},
"runtime_engine" : {
"type" : "allowlist",
"values" : ["PHOTON", "STANDARD"],
"defaultValue" : "STANDARD"
},
"init_scripts.0.workspace.destination" : {
"type" : "fixed",
"value" : var.init_script_path
},
"custom_tags.taskname" : {
"type": "unlimited",
"defaultValue": "Interactive Development"
},
"custom_tags.environment" : {
"type": "fixed",
"value": var.environment
}
}
single_node = {
"spark_conf.spark.databricks.cluster.profile" : {
"type" : "fixed",
"value" : "singleNode",
"hidden" : true
}
}
extended_node_types = {
"node_type_id" : {
"type" : "allowlist",
"values" : [
"Standard_DS3_v2",
"Standard_DS4_v2",
"Standard_DS5_v2",
"Standard_D4as_v5",
"Standard_D8as_v5",
"Standard_D16as_v5",
"Standard_E4ds_v4",
"Standard_E8ds_v4",
"Standard_E16ds_v4"
],
"defaultValue" : "Standard_DS3_v2",
"hidden" : false
},
}
multi_node = {
"dbus_per_hour" : {
"type" : "range",
"maxValue" : 32
},
}
}

#Data containing latest DBR
data "databricks_spark_version" "latest" {
long_term_support = false
latest = true
}

#Data containing latest DBR ml
data "databricks_spark_version" "latest_ml" {
long_term_support = true
ml = true
latest = true
}

# Create Policies
resource "databricks_cluster_policy" "Personal" {
name = "Personal-Compute"
definition = jsonencode(merge(local.single_node, local.base_policy))
}

resource "databricks_cluster_policy" "Personal-Flexible" {
name = "Personal-Flexible"
definition = jsonencode(merge(local.base_policy, local.extended_node_types, local.single_node))
}

resource "databricks_cluster_policy" "PowerUser" {
name = "PowerUser"
definition = jsonencode(merge(local.base_policy, local.extended_node_types, local.multi_node))
}

#Asign Policies
resource "databricks_permissions" "policy-usage-Personal" {
cluster_policy_id = databricks_cluster_policy.Personal.id

access_control {
group_name = var.ug-collaborator
permission_level = "CAN_USE"
}
}

resource "databricks_permissions" "policy-usage-Personal-Flexible" {
cluster_policy_id = databricks_cluster_policy.Personal-Flexible.id

access_control {
group_name = var.ug-collaborator
permission_level = "CAN_USE"
}
}

Terraform Integration in Github workflow

Now lets add terraform steps onto existing Github workflow, so for this I have added 4 additional step of

  • terraform init
  • terraform validate
  • terraform plan
  • terraform apply.

- name: Terraform Init
run: terraform init -backend-config "key=${{ inputs.environment }}"
working-directory: terraform

- name: Terraform Validate
run: terraform validate
working-directory: terraform

- name: Terraform Plan
run: terraform plan -var-file="variables.${{ inputs.environment }}.json"
working-directory: terraform
env:
TF_VAR_dbtoken: ${{ env.access_token }}
TF_VAR_dbhost: ${{ secrets.DATABRICKS_HOST }}

- name: Terraform Apply
run: terraform apply -var-file="variables.${{ inputs.environment }}.json" -auto-approve
working-directory: terraform
env:
TF_VAR_dbtoken: ${{ env.access_token }}
TF_VAR_dbhost: ${{ secrets.DATABRICKS_HOST }}

So you can see for TF_VAR_dbtoken I am using short lived bearer token which i got it from my python script.

So once you run entire workflow here is the output

Here is the full yaml file for your reference.

name: "Databricks Permissions Setup"

on:
workflow_call:
inputs:
environment:
description: "develop, prod"
required: true
type: string

jobs:
terraform:
name: "Terraform ${{ inputs.environment }}"
runs-on: ubuntu-latest
environment:
name: ${{ inputs.environment }}
env:
ARM_CLIENT_ID: ${{ secrets.SPN_CLIENT_ID }}
ARM_CLIENT_CERTIFICATE_PASSWORD: ${{ secrets.SPN_CERT_PASSWORD }}
ARM_CLIENT_CERTIFICATE_PATH: /home/runner/work/_temp/spn.pfx
ARM_SUBSCRIPTION_ID: ${{ secrets.SPN_SUBSCRIPTION_ID }}
ARM_TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
ARM_ACCESS_KEY: ${{ secrets.TF_BACKEND_STORAGE_KEY }}
CERT: ${{ secrets.SPN_CERT }}
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
REPO_ID: ${{ secrets.REPO_ID }}
TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
CLIENT_ID: ${{ secrets.AZURE_SPN_ID }}
CLIENT_SECRET: ${{ secrets.SPN_CERT_PASSWORD }}

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Convert BASE64 text to spn
run: |
echo $CERT
echo $CERT | base64 --decode > /home/runner/work/_temp/spn.pfx

- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_wrapper: false

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install requests

- name: Generate Azure AD token
id: generate-token
shell: bash
env:
SPN_TENANT_ID: ${{ secrets.SPN_TENANT_ID }}
AZURE_SPN_ID: ${{ secrets.AZURE_SPN_ID }}
AZURE_SPN_PASSWORD: ${{ secrets.AZURE_SPN_PASSWORD }}
run: |
token=$(python costing/generate-token.py)
echo "access_token=$token" >> $GITHUB_ENV


- name: Terraform Init
run: terraform init -backend-config "key=${{ inputs.environment }}"
working-directory: terraform

- name: Terraform Validate
run: terraform validate
working-directory: terraform

- name: Terraform Plan
run: terraform plan -var-file="variables.${{ inputs.environment }}.json"
working-directory: terraform
env:
TF_VAR_dbtoken: ${{ env.access_token }}
TF_VAR_dbhost: ${{ secrets.DATABRICKS_HOST }}

- name: Terraform Apply
run: terraform apply -var-file="variables.${{ inputs.environment }}.json" -auto-approve
working-directory: terraform
env:
TF_VAR_dbtoken: ${{ env.access_token }}
TF_VAR_dbhost: ${{ secrets.DATABRICKS_HOST }}

Like wise you can add many other databricks tasks using terraform modules.

Conclusion

This workflow demonstrates how you can automate the process of setting up permissions and other administrative tasks in Databricks using GitHub Actions and Terraform. By automating these tasks, we ensure consistency, reduce manual errors, and maintain a secure and efficient environment for managing Databricks resources. This approach can be extended and customized for various environments and use cases by following DevOps guidelines.

Reference links:

--

--

No responses yet