Automating .whl File Deployment to Azure Databricks with GitHub Actions

Prashanth Kumar
6 min readOct 13, 2023

Introduction

In the world of data engineering and data science, the ability to streamline and automate workflows is invaluable. One common task is deploying Python packages, such as .whl files, to Azure Databricks. In this article, we’ll explore how to automate this process using GitHub Actions, a powerful and flexible workflow automation tool.

What is a .whl File?

A .whl file, short for "Wheel," is a built package format for Python that simplifies the process of distributing Python projects. Wheels are the standard binary package format used for packaging Python libraries.

Prerequisites

Before you begin, you should have the following:

  1. Azure Databricks Workspace: You need access to an Azure Databricks workspace.
  2. Valid mount point on Databricks
  3. Azure Databricks Cluster: You should have a running cluster in your workspace.
  4. .whl File: You should have the .whl file that you want to install. You can create this file using pip by packaging your Python library, or you can obtain it from a trusted source.
  5. Github repository.
  6. Secrets setup in your Github repository.

Steps to Create and Push .whl File to Azure Databricks using Github actions

Step 1: Build and Package Your Python Code

Currently I have uploaded all my required python code along with requirements.txt file onto my GitHub repository. In this section i am building and packaging python code. Later as a part of additional steps I am adding SonarQube and Python test steps as well which will give additional benefits for my python library and i can avoid any vulnerabilities.

  1. Lets start with the Github actions workflow, as I want to get package versions so adding a step to add package_version along with Environment
jobs:
version_number:
runs-on: ubuntu-latest
outputs:
output1: ${{steps.step1.outputs.test}}
steps:
- id: step1
run: |
date_part=$(date +%y%m.%-d%H)
PKG_VERSION="1."$date_part
echo $PKG_VERSION
echo "PKG_VERSION=${PKG_VERSION}" >> $GITHUB_ENV
echo "::set-output name=test::${PKG_VERSION}"

2. Next step we have a requirement to get Python package based on different versions so for that I am using matrix method and passing different versions of Python. Currently I am running only with Python version 3.8

  test_on_ubuntu:
runs-on: ubuntu-latest # This is a special 8 core runner
needs: version_number
strategy:
matrix:
python-version: [ '3.8' ]

###if you want other version of pythons then please replace last line
### python-version: [ '3.8', '3.9', '3.10', '3.11' ]

3. Now lets go further and start adding additional steps, next I am installing Python dependencies as they may require for getting coverage, running python tests etc.

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint
pip install pytest
pip install coverage
pip install --upgrade databricks-cli

4. Next I am using some of the secrets which needs to be changed during runtime so running replace command.

- name: Insert the credentials
run: |
sed -i "s/env_PITSSecret/${{ secrets.env_PITSSecret }}/g" tests/testconfig.ini
sed -i "s/env_Azure/${{ secrets.env_Azure }}/g" tests/testconfig.ini
sed -i "s/env_dbtoken/${{ secrets.env_dbtoken }}/g" tests/testconfig.ini

5. In next step I am running SonarQube tests to find out any bugs/vulnerability/hotspots etc.

    - name: SonarQube Scan
uses: sonarsource/sonarqube-scan-action@master
with:
args: >
-Dsonar.projectKey=com.projectname
-Dsonar.python.coverage.reportPaths=./reports/sonar-report.xml
-Dsonar.python.coverage.testExecutionReportPaths=./reports/sonar-report.xml
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}

6. Next step I am running command to build a wheel file and creating a release to store generated Build file onto my Github artifact location

- name: Buildwheel
run: |
python -m pip install --user --upgrade build
python -m build .
echo "VERSION = $(python setup.py --version)" >> $GITHUB_ENV

- name: Create Release
uses: softprops/action-gh-release@v1
with:
files: ./dist/*.whl
tag_name: v1.2.0-${{ matrix.python-version }} # Replace with the name of the existing tag you want to use
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
if: startsWith(github.ref, 'refs/tags/') # Check if the workflow is running on a tag

### Later uploading wheel file incase if you want to debug on your local machine

- name: Upload the wheels
uses: actions/upload-artifact@v3
with:
name: built_wheel
path: dist/*.whl

7. Currently I have commented out code python tests however if anyone wants to use they can leverage this

# - name: Install dependencies
# run: |
# echo "Python version is : ${{ matrix.python-version }}"
# pip install dist/analytics-2.0.0.dev0-py3-none-any.whl[Full]
# python -m unittest discover

# - name: List the tests
# run: pytest --collect-only tests

# - name: Test with pytest
# run: |
# conda install pytest
# pytest
# - name: Install tox
# run: pip install tox

# - name: Run tests with tox
# run: tox

Step 2: Pushing the file to Azure Databricks dbfs mount.

Now that your Python package is packaged and released, it’s time to deploy it to Azure Databricks. This step involves setting up the Databricks CLI, configuring it with your Databricks host and access token, and copying the .whl file to your Databricks workspace.

 - name: Deploy .whl to Databricks DBFS
run: |
# Set the Databricks host and token from secrets

DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN }}
DATABRICKS_HOST=${{ secrets.DATABRICKS_HOST }}

# Configure the Databricks CLI with the token
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg
echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg

# Copy wheel files
for f in dist/*.whl; do
databricks fs cp $f dbfs:/analytics-automation/ --overwrite #- you can replace dbfs path with your actual location
don

Step 3: Verify file upload

After installing the package, you should verify that it’s available for your Azure Databricks.

  1. Login to Azure databricks → Click on Catalog → Expand your metastore → Click on new schema → make sure you enable “Browse DBFS” option.

Here is the complete Github actions file

name: Python package

on:
push:
branches:
- 'feature/prashanth-ver1'

jobs:
version_number:
runs-on: ubuntu-latest
outputs:
output1: ${{steps.step1.outputs.test}}
steps:
- id: step1
run: |
date_part=$(date +%y%m.%-d%H)
PKG_VERSION="1."$date_part
echo $PKG_VERSION
echo "PKG_VERSION=${PKG_VERSION}" >> $GITHUB_ENV
echo "::set-output name=test::${PKG_VERSION}"
test_on_ubuntu:
runs-on: ubuntu-latest # This is a special 8 core runner
needs: version_number
strategy:
matrix:
python-version: [ '3.8' ]

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'

- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pylint
pip install pytest
pip install coverage
pip install --upgrade databricks-cli

- name: Insert the credentials
run: |
sed -i "s/env_PITSSecret/${{ secrets.env_PITSSecret }}/g" tests/testconfig.ini
sed -i "s/env_Azure/${{ secrets.env_Azure }}/g" tests/testconfig.ini
sed -i "s/env_dbtoken/${{ secrets.env_dbtoken }}/g" tests/testconfig.ini

- name: Run pylint
run: pylint ${{ github.workspace }}/*.py

- name: SonarQube Scan
uses: sonarsource/sonarqube-scan-action@master
with:
args: >
-Dsonar.projectKey=com.analytics
-Dsonar.python.coverage.reportPaths=./reports/sonar-report.xml
-Dsonar.python.coverage.testExecutionReportPaths=./reports/sonar-report.xml
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
SONAR_HOST_URL: ${{ secrets.SONAR_HOST_URL }}

- name: Buildwheel
run: |
python -m pip install --user --upgrade build
python -m build .
echo "VERSION = $(python setup.py --version)" >> $GITHUB_ENV

- name: Create Release
uses: softprops/action-gh-release@v1
with:
files: ./dist/*.whl
tag_name: v1.2.0-${{ matrix.python-version }} # Replace with the name of the existing tag you want to use
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
if: startsWith(github.ref, 'refs/tags/') # Check if the workflow is running on a tag

- name: Upload the wheels
uses: actions/upload-artifact@v3
with:
name: built_wheel
path: dist/*.whl

# - name: Install dependencies
# run: |
# echo "Python version is : ${{ matrix.python-version }}"
# pip install dist/qgcanalytics-2.0.0.dev0-py3-none-any.whl[Full]
# python -m unittest discover

# - name: List the tests
# run: pytest --collect-only tests

# - name: Test with pytest
# run: |
# conda install pytest
# pytest
# - name: Install tox
# run: pip install tox

# - name: Run tests with tox
# run: tox



- name: Deploy .whl to Databricks DBFS
run: |
# Set the Databricks host and token from secrets

DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN }}
DATABRICKS_HOST=${{ secrets.DATABRICKS_HOST }}

# Configure the Databricks CLI with the token
echo "[DEFAULT]" > ~/.databrickscfg
echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg
echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg

# Copy wheel files
for f in dist/*.whl; do
databricks fs cp $f dbfs:/xxx-analytics-automation/ --overwrite
done


Conclusion

Automating the deployment of Python packages to Azure Databricks using GitHub Actions streamlines your development and data engineering workflows. With this automated process, you can ensure that the latest versions of your packages are readily available in your Databricks workspace, making it easier to collaborate and integrate data pipelines.

By implementing this workflow, you save time and reduce the chances of human error, enabling you to focus on building data solutions and applications.

--

--