Azure Databricks — Different ways to achieve Platform level monitoring

13 min readJan 13, 2024

Here I will be talking about my experiences with Azure Databricks Monitoring. One of the common element for any backend support teams is availability and collecting telemetry for any Services. I will be focusing on Azure Databricks Monitoring what are the different ways we can collect/view Databricks metrics and take corrective actions.

Insights

Azure Databricks offers powerful capabilities for big data analytics and machine learning workloads. To effectively manage and monitor your Databricks environment, you can leverage different methods such as using Azure Log Analytics, using PowerShell and finally with REST API for tasks such as checking metrics, inspecting workflow runs, and gathering cluster details. In this article, we’ll explore key aspects of Azure Databricks monitoring using all mentioned ways.

Prerequisites

Before diving into monitoring, make sure you have the following:

An Azure Databricks workspace.
An Azure Databricks PAT Token.
An Azure Service Principal (optional).
Azure Log Analytics workspace.
Azure Monitor
Appropriate permissions to access the Databricks REST API.
PowerShell / Azure CLI access.
Tools like Postman for API testing.

Various tools

Azure Databricks UI
Using PowerShell
Using Azure REST API
Using Azure AppInsights
Azure Log Analytics Workspace

Getting Metrics

Monitoring the performance of your Databricks clusters is crucial for optimizing resource utilization and identifying potential issues. Databricks REST API provides endpoints for retrieving metrics, cluster activities, workflow activities/runs etc.

Using PowerShell

Lets start with PowerShell.

Azure Databricks provides a powerful platform for big data analytics and machine learning. Monitoring and managing clusters are essential tasks for maintaining optimal performance. Here I specifically show how to use PowerShell in conjunction with the Azure Databricks REST API to query and retrieve metrics for a specific Databricks cluster.

Prerequisites:

Azure Databricks workspace deployed in the specified Azure region.
Azure CLI installed on your local machine.
PowerShell installed on your local machine.
Appropriate permissions to access the Databricks workspace.

Steps: Lets look into action here I will be mentioning command along with Output.

Step1: Setting variables

Make sure you define variables which are going to be used multiple times rather than passing every time I am setting it as variables. Open a PowerShell script and set the necessary variables such as Azure region, resource group name, Databricks workspace name, and cluster name (if you want to get details only for 1 known cluster).

$region = “region-where-your-databricksinstance-hosted”
$rgName = “Resource_group_name”
$workspaceName = “Databricks_workspace_name”
$clusterName = “Cluster-name-incase-if-you-want-to-retrieve-data-from-specific-cluster”

Step2: Retrieve Databricks Workspace ID

Next step we need to get Retrieve Databricks Workspace ID: Use the Azure CLI to retrieve the Databricks workspace ID. WorkspaceID is a unique ID allocated for each environment.

$WORKSPACE_ID = (az resource show — resource-type Microsoft.Databricks/workspaces — resource-group $rgName — name $workspaceName — query ‘id’ — output tsv)

you can get workspace URL from Databricks properties console as well.

Step3: Getting Databricks and Scope Token

Lets proceed further and now lets Get Databricks Access Token. Obtain the Databricks access token and Azure Management token using the Azure CLI. Databricks offers a specific scoped token in Azure and it uses identifier with “2ff814a6–3304–4ab8–85cb-cd0e6f879c1d”.

$TOKEN = (az account get-access-token — resource 2ff814a6–3304–4ab8–85cb-cd0e6f879c1d | ConvertFrom-Json).accessToken
$AZ_TOKEN = (az account get-access-token — resource https://management.core.windows.net/ | ConvertFrom-Json).accessToken

you might be wondering about resource key “2ff814a6–3304–4ab8–85cb-cd0e6f879c1d”, It represents the programmatic ID for Azure Databricks in Azure Active Directory. You can find more information at: https://learn.microsoft.com/en-us/azure/databricks/dev-tools/service-prin-aad-token

The first command will give you bearer token from Databricks scope.

Second command will give you token from Azure resource manager https://management.core.windows.net/ and I have retrieved it in json format

Both are short lived tokens and you have to generate/refresh every 1 hour. If you want to change the time to less than hour you can do that as well. Another intention for using Databricks token is you can avoid using Databricks PAT Token which is linked with Individual user identity and if the person leaves the organization or moves out then system may stop working. You can use that as well but I found this is a good way to minimize risk.

you can read more about Databricks PAT token → https://docs.databricks.com/en/dev-tools/auth/pat.html

Step4: Generating PAT token

If you want to generate short lived PAT token rather than using individual’s identity you can use below script and it will give you token in this format.

 $WORKSPACE_ID = (az resource show --resource-type Microsoft.Databricks/workspaces --resource-group '$(rgName)' --name '$(workspaceName)' --query id --output tsv)
          $TOKEN = (az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | jq --raw-output '.accessToken')
          $AZ_TOKEN = (az account get-access-token --resource https://management.core.windows.net/ | jq --raw-output '.accessToken')
          $HEADERS = @{
              "Authorization" = "Bearer $TOKEN"
              "X-Databricks-Azure-SP-Management-Token" = "$AZ_TOKEN"
              "X-Databricks-Azure-Workspace-Resource-Id" = "$WORKSPACE_ID"
          }
          $BODY = @'
          { "lifetime_seconds": 1200, "comment": "Azure DevOps pipeline" }
'@
          $DB_PAT = ((Invoke-RestMethod -Method POST -Uri "https://$(region).azuredatabricks.net/api/2.0/token/create" -Headers $HEADERS -Body $BODY).token_value)
          Write-Output "##vso[task.setvariable variable=DB_PAT]$DB_PAT"

Here is the output

Step5: Setting headers for API requests

In next we setup Headers for API Request that will be sent to Azure Databricks. This will help in authenticating the request rather than using individual identity and provide context to the service which we want to use

Set the required headers for the API request.

$HEADERS = @{
 “Authorization” = “Bearer $TOKEN”
 “X-Databricks-Azure-SP-Management-Token” = “$AZ_TOKEN”
 “X-Databricks-Azure-Workspace-Resource-Id” = “$WORKSPACE_ID”
}

This is how my output looks

Step6: Construct URI for API requests

Lets proceed further to Construct URI for Clusters API Request. This command will build actual URL which you need to use it in next command. It will add your region onto the actual URL.

$DB_URI_CLUSTERS = “https://{0}.azuredatabricks.net/api/2.0/clusters/list" -f $regio

Step7: Getting clusters list

This specific command will help in retrieving list of Clusters which are being used in your Databricks environment.

$clustersList = Invoke-RestMethod -Method GET -Uri $DB_URI_CLUSTERS -Headers $HEADERS
Write-Output $clustersList

by default it will give output in 1 single line, to have it in some readable format you can use format table switch

$clustersList | Format-Table
$clustersinformattedtable = $clustersList | Format-Table -AutoSize -Wrap | Out-String

Step8: Getting Cluster specific information

Now you have the details about how many clusters are being used and you can fetch Cluster_ID, Cluster_name. If you want to do further filtering condition you can add $clusterName variable in variables section.

$selectedCluster = $clustersList.clusters | Where-Object { $_.cluster_name -eq $clusterName }

This is how my output looks like

You can find the same information when you open your Databricks UI Console/instance, click on specific compute cluster, under configuration section you can find cluster related info.

Step9: Cluster level Information:

Next step is to get/retrieve cluster specific information ta specifically related with “Cpu”, “memory”, “Swap space” and “file system”.


$metrics = "cpu", "memory", "swap", "fs"
$BODY = @{
    "cluster_id" = $selectedCluster.cluster_id
    "metrics" = $metrics -join ","
} | ConvertTo-Json

$DB_URI_METRICS = "https://{0}.azuredatabricks.net/api/2.0/clusters/get" -f $region
$clusterMetrics = Invoke-RestMethod -Method POST -Uri $DB_URI_METRICS -Headers $HEADERS -Body $BODY

Write-Output $clusterMetrics

These are some of the common tasks which any support team wants to do, however you can start jobs/workflows/stop jobs/workflows etc and get granular details.

Here is full PowerShell script

$region = "region-where-your-databricksinstance-hosted"
$rgName = "Resource_group_name"
$workspaceName = "Databricks_workspace_name"
$clusterName = "Cluster-name-incase-if-you-want-to-retrieve-data-from-specific-cluster"

# Retrieve Databricks workspace ID
$WORKSPACE_ID = (az resource show --resource-type Microsoft.Databricks/workspaces --resource-group $rgName --name $workspaceName --query 'id' --output tsv)

# Get Databricks access token
$TOKEN = (az account get-access-token --resource 2ff814a6-3304-4ab8-85cb-cd0e6f879c1d | ConvertFrom-Json).accessToken
$AZ_TOKEN = (az account get-access-token --resource https://management.core.windows.net/ | ConvertFrom-Json).accessToken

# Set headers for API request
$HEADERS = @{
    "Authorization" = "Bearer $TOKEN"
    "X-Databricks-Azure-SP-Management-Token" = "$AZ_TOKEN"
    "X-Databricks-Azure-Workspace-Resource-Id" = "$WORKSPACE_ID"
}

# Construct URI for API request to get list of clusters
$DB_URI_CLUSTERS = "https://{0}.azuredatabricks.net/api/2.0/clusters/list" -f $region

# Invoke REST API to get list of clusters
$clustersList = Invoke-RestMethod -Method GET -Uri $DB_URI_CLUSTERS -Headers $HEADERS

# Display the list of clusters (optional)
Write-Output $clustersList

# Select a specific cluster by name
$selectedCluster = $clustersList.clusters | Where-Object { $_.cluster_name -eq $clusterName }

# Check if the selected cluster exists
if ($selectedCluster -eq $null) {
    Write-Output "Cluster with name '$clusterName' not found."
} else 
   {
    # Set body for API request
    $BODY = @{
        "cluster_id" = $selectedCluster.cluster_id
        "metrics" = "ALL"
    } | ConvertTo-Json }
# Construct URI for API request to get metrics for the selected cluster
$DB_URI_METRICS = "https://{0}.azuredatabricks.net/api/2.0/clusters/get" -f $region

# Invoke REST API to get metrics for the selected cluster
$clusterMetrics = Invoke-RestMethod -Method POST -Uri $DB_URI_METRICS -Headers $HEADERS -Body $BODY

# Display or process cluster metrics as needed
Write-Output $clusterMetrics

Using REST API

Now lets get started & see how we can do Databricks platform level monitoring using REST API calls. Here I am using Postman to do some quick demo.

Step 1: Get Databricks Access Token

Databricks personal access token authentication uses Databricks users or service principals for authentication. Databricks personal access token authentication uses short-lived or long-lived strings for authentication credentials. These access tokens can be set to expire in as short as one day or less, or they can be set to never expire. There are different ways to get/generate access token one of the common method for generating token is:

Generate via Databricks portal : you can find more on this link → https://docs.databricks.com/en/dev-tools/auth/pat.html
Generate via automated Rest API: However I am more interested in using this method, which is secured and short lived for my Rest API calls.

There are 2 ways to get token, one either if you have existing service principal clientID and secret and next one if you want to generate short lived databricks PAT token eg “dapicXXX”

Lets first look into when you have clientID & secret.

Open Postman and create a new request Set the request type to POST.
Enter the Databricks Access Token URL: https://login.microsoftonline.com/<your-tenant-id>/oauth2/token
Set the request body to x-www-form-urlencoded with the following parameters:
grant_type: client_credentials
client_id: <your-client-id>
client_secret: <your-client-secret>
resource: 2ff814a6–3304–4ab8–85cb-cd0e6f879c1d

Click “Send” to get the access token.

Step 2: Get Azure Resource Manager Token

Now lets generate Management based token, for this again open Postman

Create another request.
Set the request type to POST.
Enter the Azure Resource Manager Token URL: https://login.microsoftonline.com/<your-tenant-id>/oauth2/token
Set the request body to x-www-form-urlencoded with the following parameters:
grant_type: client_credentials
client_id: <your-client-id>
client_secret: <your-client-secret>
resource: https://management.core.windows.net/
Click “Send” to get the Azure Resource Manager token.

Step 3: Get PAT Token ID

Now

Create another request.
Set the request type to POST.
Enter the Databricks Region URL: https://region.azuredatabricks.net/api/2.0/token/create
Set Authorization header using the Databricks Access Token obtained in Step 1.
Set X-Databricks-Azure-SP-Management-Token obtained from Step2
Set X-Databricks-Azure-Workspace-Resource-Id from your azure databricks properties window.
Click “Send” to get the Databricks Workspace ID.

If you still use Old tokens then you may receive 401 Unauthorized message.

Step 4: List Clusters

Next step now we have PAT Token, Databricks Scope Bearer token, Management token. Now if you want to view clusters you can use either the combination of “Authorization with Databricks scope token along with “X-Databricks-Azure-SP-Management-Token” & “X-Databricks-Azure-Workspace-Resource-Id” or you can just use “PAT token”. Let me show you with both of them.

First lets try with Management and scoped tokens.

Create another request.
Set the request type to GET.
Enter the List Clusters URL: https://<your-region>.azuredatabricks.net/api/2.0/clusters/list
Set Authorization header using the Databricks Access Token obtained in Step 1.
Set X-Databricks-Azure-SP-Management-Token obtained from Step2
Set X-Databricks-Azure-Workspace-Resource-Id from your azure databricks properties window.
Click “Send” to get the list of clusters.

I have just used pm.visualizer script thats the reason it shows it in graphical format, else if you click on Pretty tab you can see it in json format.

Now lets try the same using Generated PAT token.

Step 5: Specific Cluster Information

Lets go further now if you want to view cluster specific details you can modify your URL and add cluster_id on the GET request.

Rest all the details will remain same.

Create another request.
Set the request type to GET.
Enter the List Clusters URL:

https://<your-region>.azuredatabricks.net/api/2.0/clusters/get?cluster_id=XXXXX

Based on the response from Step 4, identify the cluster_id of the cluster you want to get metrics for.

Step 6: Get Cluster Metrics

Now if you want to get cluster specific metrics you can use filter switch, lets say for demo I want to know start and end time of my cluster.

In order to get Cluster specific metrics you can use below request

Create another request.
Set the request type to GET.
Enter filter condition condition URL:

https://<your-region>.azuredatabricks.net/api/2.0/clusters/list?filtercondition

Step7: Cluster specific events

If you want to view cluster specific events for any failure or any inactivity you can use events condition

Enter filter condition condition URL:

https://<your-region>.azuredatabricks.net/api/2.0/clusters/events?cluster_id=XXX

Add POST request and you need to add parameter under params.

Step8: Starting any workflows:

If you want to start any backend workflows while you are aware of cluster_id and specific Job_id which needs to be triggered you can do that as well using REST API.

Create another request.
Set the request type to POST.
Enter the List Clusters URL:

https://<your-region>.azuredatabricks.net/api/2.0/jobs/run-now

Add job_id and cluster_id in your response body.

{
  "job_id": "xxxxxxxxxxxxx",
  "existing_cluster_id": "0xxx-xxxxxxx-xxxxxxxx"
}

You can find different type of REST API responses at: https://docs.databricks.com/api/workspace/introduction

Using Azure Monitor

Next option is using Azure monitor which is a inbuilt capability from Microsoft, Once you enable monitoring Azure Databricks can send this monitoring data to different logging services. Here is the quick snippet of Monitoring screen.

Using Log Analytics workspace

Last but not least if you want to check Databricks metrics. Databricks specifically offers metrics on 4 data types

Operation
SparkListenerEvent_CL
SparkMetric_CL
SparkLoggingEvent_CL

Step1: Lets look into all the operations

If you want to see consolidately how much data has been ingested onto the logs then you can run below Kusto query

let TotalIngestion=toscalar(
    Usage
    | where TimeGenerated > ago(7d)
    | summarize IngestionVolume=sum(Quantity));
Usage
| where TimeGenerated > ago(7d)
| project TimeGenerated, DataType, Solution, Quantity, IsBillable
| where DataType in ('Operation', 'SparkListenerEvent_CL', 'SparkMetric_CL', 'SparkLoggingEvent_CL')

This is how the output will be

and if you want to see more granular then you can run all the lines from 1–14 then it will be

Step2: Operations

It will give you the information about specific operation events which are happening inside Databricks workspace.

Step3: SparkListenerEvent_CL

Spark provides a mechanism for users to implement custom listeners by extending the org.apache.spark.scheduler.SparkListener interface. Users can then register these listeners with a Spark application to receive events related to job, stage, task, and other lifecycle events.

Step4: SparkMetric_CL

This specifically deals with custom logging for Spark clusters, this key you can use to monitor/gauge performance & resource utilization.

Step5: SparkLoggingEvent_CL

These are used for event logging you can embedd with log4j appender as well. Here you can specifically capture events based on ERROR , INFO , WARNING etc.

If you want to add anything extra then you need to make changes in your spark-monitoring.sh file. you can find more details at: https://github.com/mspnp/spark-monitoring/blob/main/src/spark-listeners/scripts/spark-monitoring.sh

Step6: Additional LogManagement options

Apart from above mentioned data types you can collect telemetry based on different levels in Azure Log Analytics. Here is a quick snippet of the available functions.

Step7: User Auditing

If you want to audit who logged in at certain time and get some extended details, then you can use below query

DatabricksAccounts 
| where Identity contains "firstname.lastname@domain.com"
| where OperationName contains "Microsoft.Databricks/accounts/aadBrowserLogin"
| where OperationName contains "Microsoft.Databricks/accounts/tokenLogin"
| where ActionName == "tokenLogin"
| where ServiceName == "accounts"

Points to remember:

Make sure to replace placeholders like

<your-tenant-id>, <your-client-id>, <your-client-secret>, <your-subscription-id>, <your-resource-group>, <your-workspace-name>, <your-region>, and <your-cluster-id> with your actual values.

When you generate any bearer token make sure it is in right format and no spaces left behind while pasting it in Postman cost or in your REST API headers. for eg you can add space arrow

Some of the important reference links

https://docs.databricks.com/api/workspace/introduction -

https://github.com/mspnp/spark-monitoring

https://docs.databricks.com/en/workspace/workspace-details.html#:~:text=Workspace%20instance%20names%2C%20URLs%2C%20and%20IDs,-An%20instance%20name&text=Some%20types%20of%20workspaces%20have,the%20workspace%20ID%20is%206280049833385130%20.

Monitor Azure Databricks - Azure Architecture Center

Learn how to extend the core monitoring functionality of Azure Databricks to send Apache Spark metrics, events, and…

learn.microsoft.com