Azure Databricks — Multiple Asset Bundle Deployment and Runs
Introduction to Azure Databricks Asset Bundles
Azure Databricks Asset Bundles simplify the deployment and management of Databricks notebooks, libraries, and other resources as cohesive units. They ensure consistency across environments and streamline the deployment process.
Why Use Azure Databricks Asset Bundles?
Asset Bundles offer several advantages:
- Consistency and Reproducibility: Bundles package notebooks, libraries, and configurations together, ensuring consistent deployments across various environments.
- Simplified Deployment: By consolidating all necessary assets into a single deployable unit, errors during deployment are minimized.
- Version Control: Enables tracking of changes and easy rollback to previous versions.
- Environment Isolation: Facilitates controlled deployment across different environments, enhancing stability and testing capabilities.
Common Problems and Solutions
Problem:
One common problem when deploying and running an Asset bundle in Databricks using any CI tool is that you have to define the bundle name along with other parameters. How do you manage this when deploying multiple bundles?
databricks bundle run <name> --refresh-all
Solution:
Let’s start with the steps to achieve this. To handle multiple bundles, here is the solution: I have added my new bundles to my GitHub repository
Now in order to deal with Multiple bundles here is the solution, I have added my new bundles in my Github repository either you can store them as .dbc or .yml format based on your requirement. Once you are ready to save the files it is good to save them with the job name how they are going to get deploy in Azure Databricks.
- Let’s first save all the new .dbc or .yml files in your respective folder; here, I am saving everything under the “Resources” folder.
2. Now, let’s check the Events1.yml file, which contains my job definition and its subsequent tasks.
3. Here, you can see that I have five jobs that need to be part of the Asset Bundle Deployment and run. To proceed, instead of providing an individual command for each asset bundle, I want to dynamically pick all of them.
4. For that, in my GitHub workflow, I have created a new YML Workflow file and defined all my Databricks parameters. First, I want to verify if all new .yml files are being captured or not by adding this step and getting the output:
- name: List bundle files
run: |
ls xxx/Resources/*.yml | xargs -n 1 basename | sed 's/\.yml$//' > xxx/Resources/bundle_names.txt
5. This step is optional if you want to check and read the contents.
- name: Read bundle names
id: read-bundle-names
run: |
while IFS= read -r line; do
echo "Found bundle: $line"
echo "::set-output name=bundle_names::$line"
done < xxx/Resources/bundle_names.txt
6. Finally, to run the bundle, I am retrieving the bundle names from the bundle_names.txt file and adding them to my final command:
- name: Run Databricks bundles
run: |
cd WellEventLog/Resources
for bundle_name in $(cat bundle_names.txt); do
echo "Running bundle: $bundle_name"
databricks bundle run "$bundle_name" --refresh-all
done
How the Provided Script Helps
The GitHub Actions workflow Deploy Databricks Asset Bundle automates the deployment of Databricks asset bundles to a specified environment (develop in this case). Here’s how it works:
Trigger: It triggers on a manual workflow dispatch or a specific branch.
Deployment Steps:
Deploy New Bundle: Deploys the bundle to the develop environment using the databricks bundle deploy command.
Pipeline Update: Once deployed, it triggers a pipeline update (pipeline_update-develop) to validate and execute the bundle.
Execution:
List and Read Bundle Names: Lists all .yml files in xxx/Resources/ and sets them as output variables.
Run Bundles: Iteratively runs each bundle found in bundle_names.txt using databricks bundle run, refreshing all associated resources.
Here is the full yml file which anyone can use
name: Deploy Databricks Asset Bundle
on:
workflow_dispatch:
jobs:
deploy-develop:
if: github.ref == 'refs/heads/feature/prashanth1'
name: "Deploy develop bundle"
runs-on: ubuntu-latest
environment: develop
env:
DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
DATABRICKS_BUNDLE_ENV: develop
steps:
- name: Checkout code
uses: actions/checkout@v2
- uses: databricks/setup-cli@main
- run: databricks bundle destroy --auto-approve
working-directory: RootPath/
- run: databricks bundle deploy -t develop
working-directory: RootPath/
pipeline_update-develop:
if: github.ref == 'refs/heads/feature/prashanth1'
name: "Run pipeline update for develop"
runs-on: ubuntu-latest
environment: develop
env:
DATABRICKS_HOST: ${{ vars.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
DATABRICKS_BUNDLE_ENV: develop
needs:
- deploy-develop
steps:
- uses: actions/checkout@v3
- uses: databricks/setup-cli@main
- name: List bundle files
run: |
ls RootPath/Resources/*.yml | xargs -n 1 basename | sed 's/\.yml$//' > WellEventLog/Resources/bundle_names.txt
- name: Read bundle names
id: read-bundle-names
run: |
while IFS= read -r line; do
echo "Found bundle: $line"
echo "::set-output name=bundle_names::$line"
done < RootPath/Resources/bundle_names.txt
# Run the Databricks bundles for each retrieved bundle name.
- name: Run Databricks bundles
run: |
cd WellEventLog/Resources
for bundle_name in $(cat bundle_names.txt); do
echo "Running bundle: $bundle_name"
databricks bundle run "$bundle_name" --refresh-all
done
this is how the output looks while you deploy, in my case specifically Jobs.
and this is my Internal tasks execution runs.
Conclusion
By using Azure Databricks Asset Bundles and automating their deployment with the provided GitHub Actions workflow, teams can ensure consistent, reproducible deployments across environments, mitigate common deployment issues, and maintain better control over their Databricks workflows and configurations. This approach enhances reliability, facilitates collaboration, and supports efficient development practices.