Azure Databricks — Cost Management In-depth Analysis

Prashanth Kumar
17 min readJan 21, 2024

--

Overview

As we are in January (as the time of writing Jan 2024), most organizations have begun their year-end savings projections plan. As part of this initiative, I have also started reviewing my Databricks clusters and exploring opportunities to save costs. I have been searching for references to track my Databricks usage and identify areas where I can save money.

I came across an article at https://www.linkedin.com/pulse/how-analyze-azure-databricks-costs-chris-payne/, which provided valuable insights on checking the costs of Databricks usage. Thanks to the author for guiding me on how to proceed in searching for costs using the REST API method or the PowerShell approach & try something outside of actual Databricks environment.

How to start cost investigation?

Databricks stands out as one of the most powerful tools for data processing, ETL/ELT (Extract, Transform, Load), and crunching. It is a cloud-based big data analytics platform built on top of Apache Spark. Offering an integrated environment for data processing, machine learning, and collaborative data science, its key components include Databricks Workspace (a collaborative environment), Databricks Runtime (optimized Spark clusters), and Databricks Notebooks (interactive, collaborative notebooks).

While Databricks provides exceptional features, it’s essential to be mindful of associated costs. As with any powerful tool, these capabilities come at a price, and understanding and managing these costs is crucial.

One of the eye opener article that I came across recently : →

As we aim to be the best in the business, our journey begins without any pre-existing data, leading to potential cost implications. Some common reasons for these implications include:

  1. Overprovisioning of clusters/backend compute/pools, etc.: Sometimes, we allocate more resources (CPU, memory) to clusters than needed. For instance, when starting a new project without historical data or transitioning from an on-premise solution running on virtual machines, making accurate comparisons can be challenging.
  2. Underutilization of resources: In many cases, we heed business demands and run large clusters for small workloads, resulting in wasted resources.
  3. Storage: Storing unnecessary or redundant data, such as temporary logs, can contribute to increased costs.
  4. Inefficient workloads: Poorly optimized Spark jobs or queries directly impact point #1, leading to higher Databricks Unit (DBU) consumption. For us, it’s a learning curve, but unfortunately, we are already paying the price. :(

How to Start identifying Databricks incurred costs?

To begin identifying Databricks incurred costs, it’s essential to understand the various components of the pricing model and closely monitor your usage. Some common methods include:

  1. Azure Cost Management: The Azure Cost Management blade provides a wide variety of filtered conditions based on tags, service name, meter, meter category, meter subcategory, etc. (See screenshot below.)
  2. Databricks SKU/Tier: Databricks instances are offered in Standard and Premium tiers, each with different cost models and features. More information can be found here.
  3. Assigning Additional Tags: Databricks offers the capability to assign additional tags internally to workloads, compute, pools, etc., allowing for data viewing based on customized tags. (See screenshot below.)
Tags based on compute level inside Databricks
Tags based on Jobs level inside Databricks

Limitations:

One significant limitation of the Azure Cost Analysis blade, particularly concerning meters and meter categories, is that it does not provide individual costs for components like Jobs or Compute, which can contribute significantly to the overall expense. Instead, it presents a consolidated cost without breaking down the specific costs incurred by each element.

So based on the limitations we cant fetch more info, so lets deep dive.

Deep Dive to Achieve/ Reduce costs.

Now, let’s delve into the internals of Azure Databricks and identify areas where we are incurring significant costs. We’ll explore how to capture these costs and drill down to the individual compute and job ID levels.

I will be focusing on two methods:

  1. Rest API
  2. PowerShell

Lets Start

Databricks provides a wide range of REST API capabilities to collect internal information, including details on Jobs, Job runs, and Cluster information.

Our initial focus will be on Notebooks and their corresponding jobs.

REST API Way →

First step in order to drill down into costs specific to job we are going to run

Retrieving jobs/list.

  1. Get-Jobs Function: First step is we need to retrieve list of running jobs from our databricks instance/workspace. we need to utilize /api/2.0/jobs/list endpoint to obtain job details.

The REST API filters jobs based on their format, distinguishing between “MULTI_TASK” and “SINGLE_TASK” activities. It also provides additional details such as “Job_id” and “Tags.” Here is the schema output.

      {
"job_id": 12345678901234,
"creator_user_name": "firstname.lastname@domain.com",
"settings": {
"name": "Name of your Job",
"email_notifications": {
"no_alert_for_skipped_runs": false
},
"timeout_seconds": 0,
"schedule": {
"quartz_cron_expression": "10 30 2 * * ?",
"timezone_id": "Australia/Brisbane",
"pause_status": "PAUSED"
},
"max_concurrent_runs": 1,
"format": "MULTI_TASK"
},
"created_time": 1705644489523
}
]
}

From this generated schema, our primary focus is on obtaining “job_id,” “Name,” and “format.”

For those who are new to Databricks, please note that the “created_time” format is in Unix (epoch) time format. If you prefer to view it in the standard date/time format, you can use a tool like Epoch Converter.

As an example, I am converting 1705644489523.

Retrieving Jobs runs list:

The next step involves retrieving the list of Databricks job runs within a specified date range. To accomplish this, we will utilize the /api/2.1/jobs/runs/list endpoint, which allows us to obtain detailed information about job runs. This endpoint provides options to filter runs based on start and end dates.

{
"runs": [
{
"job_id": 12345678901234,
"run_id": 386301660124979,
"creator_user_name": "firstname.lastname@domain.com",
"number_in_job": 386301660124979,
"original_attempt_run_id": 386301660124979,
"state": {
"life_cycle_state": "RUNNING",
"state_message": "",
"user_cancelled_or_timedout": false
},
"schedule": {
"quartz_cron_expression": "21 30 0/6 * * ?",
"timezone_id": "Australia/Brisbane",
"pause_status": "UNPAUSED"
},
"start_time": 1705717821256,
"setup_duration": 0,
"execution_duration": 0,
"cleanup_duration": 0,
"end_time": 0,
"run_duration": 116440,
"trigger": "PERIODIC",
"run_name": "Name of your Job",
"run_page_url": "https://adb-databricks.azuredatabricks.net/?o=workspaceid#job/xxxxxxxxxxx/run/386301660124979",
"run_type": "JOB_RUN",
"format": "MULTI_TASK"
},

By comparing information from /api/2.0/jobs/listand /api/2.1/jobs/runs/listyou can gather details such as “job_id” and “run_name.” Searching with a specific “job_id” in /api/2.1/jobs/runs/listprovides additional information about how many times the job ran during the specified period.

for example, from below screenshot you can see it ran for 28 times.

Retrieving cost consumption details

Now lets try to fetch cost consumption details using below REST API call, from this response we need to collect info based on ClusterId, Jobid and further Filter costs based on criteria for storage account, virtual machines, and Databricks DBUs.

For reference I have 3 sample json requests, here you can see “ClusterId” & “JobId”, “meterid”. Like wise you can see information about costs by consumed services specially by:

microsoft.databricks

microsoft.compute

microsoft.storage

example1:

{
"kind": "legacy",
"id": "/subscriptions/subscriptionId/providers/Microsoft.Billing/billingPeriods/20240101/providers/Microsoft.Consumption/usageDetails/name",
"name": "00000-a53b-4c16-272c-0000000",
"type": "Microsoft.Consumption/usageDetails",
"tags": {
"DataClassification": "Restricted",
"ApplicationOwner": "firstname.lastname@domain.com",
"SoldToCode": "1234567",
"LineOfBusiness": "LOB",
"ApplicationId": "ApplicationID",
"MonitoringAlertContact": "firstname.lastname@domain.com",
"SubscriptionSPN_Contributors_0": "",
"PackageVersion_BusinessCI": "CI-Version",
"SubscriptionSPN_Readers_0": "",
"AADGroup_Contributors_0": "Contributors_0",
"PackageVersion_CommonCI": "CommonCI-version",
"DateOfLastSubscriptionUpdate": "2023-11-16 01:58:43",
"WBSE": "",
"AADGroup_Readers_0": "Readers",
"VMWithJITException": "",
"LeadContributor": "firstname.lastname@domain.com",
"ServiceType": "NONBC",
"RequestId": "requestid",
"SpendLimit": "1000",
"IOTOptIn": "false",
"MBSOptIn": "true",
"ClusterId": "0113-000000-xxxxxxxx",
"DatabricksInstancePoolCreatorId": "1234512345123452",
"DatabricksInstancePoolId": "0109-062012-lotus21-pool-0000000",
"ClusterName": "job-9xxxxxxxxxxxxxxxx0-run-00000000000000",
"JobId": "97065897648060",
"RunName": "Run_Analytics_Notebook_Instance",
"databricks-usage": "analytics",
"Creator": "000000-000000-0000000-0000000",
"Vendor": "Databricks",
"environment": "development",
"DatabricksInstanceGroupId": "-0000002206352200000",
"databricks-pool-usage": "Pool_Name_prod"
},
"properties": {
"billingAccountId": "00000000",
"billingAccountName": "accountname",
"billingPeriodStartDate": "2024-01-01T00:00:00.0000000Z",
"billingPeriodEndDate": "2024-01-31T00:00:00.0000000Z",
"billingProfileId": "00000000",
"billingProfileName": "profilename",
"accountOwnerId": "enterpriseaccountowner@domain.onmicrosoft.com",
"accountName": "enterpriseaccountname",
"subscriptionId": "subscriptionId",
"subscriptionName": "subscriptionname",
"date": "2024-01-13T00:00:00.0000000Z",
"product": "Azure Databricks - Premium Jobs Compute Photon - DBU",
"partNumber": "AAM-12345",
"meterId": "1f776206-2ae1-54b8-98f7-fa1bcccf96a0",
"quantity": 0.574796354166667,
"effectivePrice": 0.2036,
"cost": 0.117028537708333,
"unitPrice": 0.2036,
"billingCurrency": "EUR",
"resourceLocation": "AustraliaEast",
"consumedService": "Microsoft.Databricks",
"resourceId": "/subscriptions/subscripotionid/resourceGroups/resourcegroupname/providers/Microsoft.Databricks/workspaces/workspacename",
"resourceName": "workspace-resourcename",
"invoiceSection": "IT Services & Operations",
"resourceGroup": "resourcegroup-name,
"offerId": "MS-AZR-00000",
"isAzureCreditEligible": true,
"publisherName": "Microsoft",
"publisherType": "Azure",
"planName": "Premium",
"chargeType": "Usage",
"frequency": "UsageBased",
"payGPrice": 0.2808,
"pricingModel": "OnDemand",
"meterDetails": null
}
}

Example 2

 {
"kind": "legacy",
"id": "/subscriptions/subscriptionId/providers/Microsoft.Billing/billingPeriods/20240101/providers/Microsoft.Consumption/usageDetails/name",
"name": "00000-a53b-4c16-272c-0000000",
"type": "Microsoft.Consumption/usageDetails",
"tags": {
"environment": "development",
"databricks-usage": "analytics",
"DataClassification": "Restricted",
"ApplicationOwner": "firstname.lastname@domain.com",
"SoldToCode": "1234567",
"LineOfBusiness": "LOB",
"ApplicationId": "ApplicationID",
"MonitoringAlertContact": "firstname.lastname@domain.com",
"SubscriptionSPN_Contributors_0": "",
"PackageVersion_BusinessCI": "CI-Version",
"SubscriptionSPN_Readers_0": "",
"AADGroup_Contributors_0": "Contributors_0",
"PackageVersion_CommonCI": "CommonCI-version",
"DateOfLastSubscriptionUpdate": "2023-11-16 01:58:43",
"WBSE": "",
"AADGroup_Readers_0": "Readers",
"VMWithJITException": "",
"LeadContributor": "firstname.lastname@domain.com",
"ServiceType": "NONBC",
"RequestId": "requestid",
"SpendLimit": "1000",
"IOTOptIn": "false",
"MBSOptIn": "true",
"IOTOptIn": "false",
"DatabricksEnvironment": "workerenv-658137803218454",
"DatabricksInstanceGroupId": "-6460800000000002324",
"DatabricksInstancePoolCreatorId": "1234512345123452",
"DatabricksInstancePoolId": "0109-062012-lotus21-pool-0000000",
"Vendor": "Databricks",
"databricks-instance-name": "700b6c6000d000b58e00000c5d1fb009",
"databricks-pool-usage": "Pool_Name_prod",
"management_service": "instance_manager_service"
},
"properties": {
"billingAccountId": "00000000",
"billingAccountName": "accountname",
"billingPeriodStartDate": "2024-01-01T00:00:00.0000000Z",
"billingPeriodEndDate": "2024-01-31T00:00:00.0000000Z",
"billingProfileId": "00000000",
"billingProfileName": "profilename",
"accountOwnerId": "enterpriseaccountowner@domain.onmicrosoft.com",
"accountName": "enterpriseaccountname",
"subscriptionId": "subscriptionId",
"subscriptionName": "subscriptionname",
"date": "2024-01-13T00:00:00.0000000Z",
"product": "Rtn Preference: MGN - Standard Data Transfer Out - Zone 2",
"partNumber": "Q5H-00002",
"meterId": "fe167397-a38d-43c3-9bb3-8e2907e56a41",
"quantity": 0.000000273808836936951,
"effectivePrice": 0.081,
"cost": 0.00000002,
"unitPrice": 0.081,
"billingCurrency": "EUR",
"resourceLocation": "AustraliaEast",
"consumedService": "microsoft.compute",
"resourceId": "/subscriptions/subscripotionid/resourceGroups/resourcegroupname/providers/Microsoft.Databricks/workspaces/workspacename",
"resourceName": "workspace-resourcename",
"invoiceSection": "IT Services & Operations",
"resourceGroup": "resourcegroup-name,
"offerId": "MS-AZR-00000",
"isAzureCreditEligible": true,
"publisherName": "Microsoft",
"publisherType": "Azure",
"planName": "Standard",
"chargeType": "Usage",
"frequency": "UsageBased",
"payGPrice": 0.112,
"pricingModel": "OnDemand",
"meterDetails": null
}
},

Example3

{
"kind": "legacy",
"id": "/subscriptions/subscriptionId/providers/Microsoft.Billing/billingPeriods/20240101/providers/Microsoft.Consumption/usageDetails/name",
"name": "00000-a53b-4c16-272c-0000000",
"type": "Microsoft.Consumption/usageDetails",
"tags": {
"DataClassification": "Restricted",
"ApplicationOwner": "firstname.lastname@domain.com",
"SoldToCode": "1234567",
"LineOfBusiness": "LOB",
"ApplicationId": "ApplicationID",
"MonitoringAlertContact": "firstname.lastname@domain.com",
"SubscriptionSPN_Contributors_0": "",
"PackageVersion_BusinessCI": "CI-Version",
"SubscriptionSPN_Readers_0": "",
"AADGroup_Contributors_0": "Contributors_0",
"PackageVersion_CommonCI": "CommonCI-version",
"DateOfLastSubscriptionUpdate": "2023-11-16 01:58:43",
"WBSE": "",
"AADGroup_Readers_0": "Readers",
"VMWithJITException": "",
"LeadContributor": "firstname.lastname@domain.com",
"ServiceType": "NONBC",
"RequestId": "requestid",
"SpendLimit": "1000",
"MBSOptIn": "true",
"IOTOptIn": "false"
},
"properties": {
"billingAccountId": "00000000",
"billingAccountName": "accountname",
"billingPeriodStartDate": "2024-01-01T00:00:00.0000000Z",
"billingPeriodEndDate": "2024-01-31T00:00:00.0000000Z",
"billingProfileId": "00000000",
"billingProfileName": "profilename",
"accountOwnerId": "enterpriseaccountowner@domain.onmicrosoft.com",
"accountName": "enterpriseaccountname",
"subscriptionId": "subscriptionId",
"subscriptionName": "subscriptionname",
"date": "2024-01-13T00:00:00.0000000Z",
"product": "Rtn Preference: MGN - Standard Data Transfer Out - Zone 2",
"partNumber": "Q5H-12345",
"meterId": "fe167397-a38d-43c3-9bb3-8e2907e56a41",
"quantity": 0.000054,
"effectivePrice": 0.081,
"cost": 0.00000432,
"unitPrice": 0.081,
"billingCurrency": "EUR",
"resourceLocation": "AustraliaEast",
"consumedService": "Microsoft.Storage",
"resourceId": "resourceId": "/subscriptions/subscripotionid/resourceGroups/resourcegroupname/providers/Microsoft.Storage/storageAccounts/azasstaexnseq01629fid",
"resourceName": "storageaccountname",
"invoiceSection": "IT Services & Operations",
"resourceGroup": "resourcegroup-name,
"offerId": "MS-AZR-00000",
"isAzureCreditEligible": true,
"publisherName": "Microsoft",
"publisherType": "Azure",
"planName": "Standard",
"chargeType": "Usage",
"frequency": "UsageBased",
"payGPrice": 0.112,
"pricingModel": "OnDemand",
"meterDetails": null
}
},

However, the limitation with Postman is that you can’t retrieve data in a grouped format. Instead, you have to download it as JSON and then use a custom converter to convert it to .xls for calculation.

PowerShell way →

Now let’s attempt to retrieve Databricks usage data based on jobs, their associated cluster IDs, and non-clusters (serverless). Let’s start by pulling the jobs list using a PowerShell script.

#variable declaration section
$baseUrl = "https://databricksinstance.azuredatabricks.net"
$pat = "databrickspattoken"


# Part1: Listing running jobs.
function Get-Jobs {
param (
[string]$baseUrl,
[string]$pat
)

$headers = @{
'Authorization' = "Bearer databrickspattoken"
}

$response = Invoke-RestMethod -Uri "https://databricksinstance.azuredatabricks.net/api/2.0/jobs/list" -Method Get -Headers $headers

#Once you retrieve the list you need to save it as array.
$jobs = @()

foreach ($job in $response.jobs) {
$jobName = $job.settings.name
$jobID = $job.job_id
$clusterspec = $null

#Getting job listing based on "Multi_Task" or "Single_task" which uses same cluster.
#Here I am retrieving only the one's with Multi_Task activities however if you want "Single_Task" you can add additional boolean condition.


if ($job.settings.format -eq "MULTI_TASK") {
$mresponse = Invoke-RestMethod -Uri "https://databricksinstance.azuredatabricks.net/api/2.0/jobs/get?job_id=$jobID" -Method Get -Headers $headers

# Check if job_clusters and new_cluster properties exist
if ($mresponse.settings.job_clusters -and $mresponse.settings.job_clusters.Length -gt 0 -and $mresponse.settings.job_clusters[0].new_cluster) {
$clusterspec = $mresponse.settings.job_clusters[0].new_cluster.node_type_id
}
} else {
$clusterspec = $job.settings.new_cluster.node_type_id
}

#saving it as array to collect JobName and Cluster specification"

$j = @{
'JobName' = $jobName
'ClusterSpec' = $clusterspec
}

$jobs += New-Object PSObject -Property $j
}

return $jobs
}

# Get Databricks jobs
$databricksJobslist = Get-Jobs -baseUrl $baseUrl -pat $pat

Here, you can see that I am retrieving data about the jobs along with their cluster specifications. Additionally, some jobs don’t have any cluster details because they are serverless.

The next step involves retrieving job runs, filtered based on completion status. Relevant details such as cluster IDs and run times are then extracted.

# Part2: Getting Databricks jobs which has job runs.
function Get-JobRuns {
param (
[string]$baseUrl,
[string]$pat,
[datetime]$startDate = (Get-Date).AddDays(-1),
[datetime]$endDate = (Get-Date)
)

Write-Host "Retrieving $startDate to $endDate..."

$uStartDate = [long]($startDate.ToUniversalTime() - (Get-Date "1970-01-01").ToUniversalTime()).TotalMilliseconds
$uEndDate = [long]($endDate.ToUniversalTime() - (Get-Date "1970-01-01").ToUniversalTime()).TotalMilliseconds

$runs = @()
$offset = 0

$headers = @{
'Authorization' = "Bearer databrickspattoken"
}

while ($offset -ge 0) {
$jobrunsextract = "https://databricksinstance.azuredatabricks.net/api/2.1/jobs/runs/list?completed_only=true&limit=25&offset=$offset&expand_tasks=true&start_time_from=$uStartDate&start_time_to=$uEndDate"
$response = Invoke-RestMethod -Uri $jobrunsextract -Method Get -Headers $headers

foreach ($r in $response.runs) {
$clusterID = $null
$clusterSpec = $null

if ($r.tasks[0].cluster_instance) {
$clusterID = $r.tasks[0].cluster_instance.cluster_id
$clusterLink = "https://databricksinstance.azuredatabricks.net/api/2.0/clusters/get?cluster_id=$clusterID"
$cresponse = Invoke-RestMethod -Uri $clusterLink -Method Get -Headers $headers
$clusterSpec = $cresponse.node_type_id
} elseif ($r.tasks[0].new_cluster) {
$clusterSpec = $r.tasks[0].new_cluster.node_type_id
}

$runTime = [TimeSpan]::FromMilliseconds($r.end_time - $r.start_time).TotalSeconds

$o = @{
'JobName' = $r.run_name
'RunID' = $r.run_id
'ClusterID' = $clusterID
'RunTime' = $runTime
'StartTime' = (Get-Date).AddMilliseconds($r.start_time)
'ClusterSpec' = $clusterSpec
}

$runs += New-Object PSObject -Property $o
}

if ($response.has_more -eq $true) {
$offset += 25
if ($offset % 250 -eq 0) {
Write-Host "Retrieved $offset runs..."
}
} else {
$offset = -1
}
}

Write-Host "Retrieved $($runs.Count) runs..."
return $runs
}


# Get Databricks job runs
$databricksjobRuns = Get-JobRuns -baseUrl $baseUrl -pat $pat -startDate $startDate -endDate $endDate

The next step involves extracting cost information from https://management.azure.com/subscriptions/subscriptionID/providers/Microsoft.Consumption/usageDetails . To do this, we need to obtain a bearer token for authentication. We collect the entire dataset and save it as an array with specific required details such as “UsageDate, Service, Cost, Product, InstanceName, ClusterSpec, JobName, ResourceType, ResourceID, ClusterID, ResourceLocation.” The ParseJson class is used to handle JSON data by converting it into a PowerShell object.

# Part3: Getting Databricks compute costs.
class ParsedJson {
[hashtable]$__dict__

ParsedJson ([hashtable]$kwargs) {
$this.__dict__ = $kwargs
}

[string] ToString() {
$keys = $this.__dict__.Keys | Sort-Object
$items = $keys | ForEach-Object { "$_=$($this.__dict__[$_])" }
return "$($this.GetType().Name)($($items -join ', '))"
}

[bool] Equals([object]$other) {
return $this.__dict__ -eq $other.__dict__
}
}

function Get-AzureCostData {
param (
[datetime]$startDate = (Get-Date).AddDays(-1),
[datetime]$endDate = (Get-Date)
)

$baseURL = "https://management.azure.com/subscriptions/subscriptionID/providers/Microsoft.Consumption/usageDetails"

$startDateS = $startDate.ToString('yyyy-MM-dd')
$endDateS = $endDate.ToString('yyyy-MM-dd')

Write-Host "Retrieving $startDateS through $endDateS..."

$token = (az account get-access-token --resource https://management.core.windows.net/ | ConvertFrom-Json).accessToken
$uri = "https://management.azure.com/subscriptions/subscriptionid/providers/Microsoft.Consumption/usageDetails?api-version=2021-10-01&startDate=$startDateS&endDate=$endDateS&`$expand=properties/meterDetails,properties/additionalInfo"
$results = @()

$headers = @{
'Authorization' = "Bearer $token"
}

do {
try {
$response = (Invoke-RestMethod -Uri $uri -Headers $headers -Method Get).value
$results += $response | ForEach-Object {
[PSCustomObject]@{
UsageDate = [datetime]::ParseExact($_.properties.date.Substring(0, 10), 'yyyy-MM-dd', $null)
Service = $_.properties.consumedService
cost = $_.properties.cost
Product = $_.properties.product
InstanceName = $_.properties.instanceName
ClusterSpec = $_.properties.additionalInfo.resourceTags.ClusterSpec
JobName = $_.properties.additionalInfo.resourceTags.JobName
ResourceType = $_.properties.additionalInfo.resourceTags.ResourceType
ResourceID = $_.properties.instanceId
ClusterID = $_.properties.additionalInfo.resourceTags.ClusterID
ResourceLocation = $_.properties.instanceLocation
}
}
$uri = $response.jobrunsextract
} catch {
Write-Host "Error: $_"
$uri = $null
}
} while ($uri -ne $null)

return $results
}

In the next and final part (Part 4), we will retrieve information from three parts and perform joins and group-by operations with specified conditions.

# Part4: Final output extraction
$startDate = (Get-Date).AddDays(-2)
$endDate = Get-Date


# Get Azure cost data
$allCostData = Get-AzureCostData -startDate $startDate -endDate $endDate

# Filter costs based on criteria
$storageCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Storage" }
$vmCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" }
$dbuCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Databricks" }

#$storageCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" -and $_.ClusterName -ne $null }
#$vmCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" -and $_.ClusterName -ne $null }
#$dbuCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Databricks" -and $_.ClusterName -ne $null }


# Group costs by job names along with cluster details and costs
$jobCosts = $databricksRuns | ForEach-Object {
$job = $databricksJobs | Where-Object { $_.JobName -eq $_.JobName }
[PSCustomObject]@{
JobName = $_.JobName
ClusterSpec = $_.ClusterSpec
ClusterID = $_.ClusterID
RunID = $_.RunID
StartTime = $_.StartTime
RunTime = $_.RunTime
ClusterType = $job.ClusterSpec
StorageCost = ($storageCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
VMCost = ($vmCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
DBUCost = ($dbuCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
}
} | Group-Object -Property JobName | ForEach-Object {
$jobGroup = $_
$totalStorageCost = $jobGroup.Group | Measure-Object -Property StorageCost -Sum | Select-Object -ExpandProperty Sum
$totalVMCost = $jobGroup.Group | Measure-Object -Property VMCost -Sum | Select-Object -ExpandProperty Sum
$totalDBUCost = $jobGroup.Group | Measure-Object -Property DBUCost -Sum | Select-Object -ExpandProperty Sum


# Divide by 1000 to convert to the original unit
$totalStorageCost /= 1000
$totalVMCost /= 1000
$totalDBUCost /= 1000

[PSCustomObject]@{
JobName = $jobGroup.Name
ClusterSpec = $jobGroup.Group[0].ClusterSpec
ClusterID = $jobGroup.Group[0].ClusterID
TotalRuns = $jobGroup.Count
TotalRunTime = ($jobGroup.Group | Measure-Object -Property RunTime -Sum).Sum / 60
AverageRunTime = ($jobGroup.Group | Measure-Object -Property RunTime -Average).Average / 60
TotalStorageCost = $totalStorageCost
TotalVMCost = $totalVMCost
TotalDBUCost = $totalDBUCost
OverallCost = $totalStorageCost + $totalVMCost + $totalDBUCost
CostPerHour = $OverallCost / ($jobGroup.Group | Measure-Object -Property RunTime -Sum).Sum / 3600
}
}

# Display the results
$jobCosts | Format-Table -AutoSize | Out-String -Width 4096

Here is the output.

Here, I have pulled the details only for the jobs that have an associated cluster. However, you can add many other conditions in the last Part 4.

Here is the full PowerShell script

#variable declaration section
$baseUrl = "https://databricksinstance.azuredatabricks.net"
$pat = "databrickspattoken"


# Part1: Listing running jobs.
function Get-Jobs {
param (
[string]$baseUrl,
[string]$pat
)

$headers = @{
'Authorization' = "Bearer databrickspattoken"
}

$response = Invoke-RestMethod -Uri "https://databricksinstance.azuredatabricks.net/api/2.0/jobs/list" -Method Get -Headers $headers

#Once you retrieve the list you need to save it as array.
$jobs = @()

foreach ($job in $response.jobs) {
$jobName = $job.settings.name
$jobID = $job.job_id
$clusterspec = $null

#Getting job listing based on "Multi_Task" or "Single_task" which uses same cluster. Here i am retrieving only the one's with Multi_Task activities.

if ($job.settings.format -eq "MULTI_TASK") {
$mresponse = Invoke-RestMethod -Uri "https://databricksinstance.azuredatabricks.net/api/2.0/jobs/get?job_id=$jobID" -Method Get -Headers $headers

# Check if job_clusters and new_cluster properties exist
if ($mresponse.settings.job_clusters -and $mresponse.settings.job_clusters.Length -gt 0 -and $mresponse.settings.job_clusters[0].new_cluster) {
$clusterspec = $mresponse.settings.job_clusters[0].new_cluster.node_type_id
}
} else {
$clusterspec = $job.settings.new_cluster.node_type_id
}

$j = @{
'JobName' = $jobName
'ClusterSpec' = $clusterspec
}

$jobs += New-Object PSObject -Property $j
}

return $jobs
}

# Get Databricks jobs
$databricksJobs = Get-Jobs -baseUrl $baseUrl -pat $pat



# Part2: Getting Databricks jobs which has job runs.
function Get-JobRuns {
param (
[string]$baseUrl,
[string]$pat,
[datetime]$startDate = (Get-Date).AddDays(-1),
[datetime]$endDate = (Get-Date)
)

Write-Host "Retrieving $startDate to $endDate..."

$uStartDate = [long]($startDate.ToUniversalTime() - (Get-Date "1970-01-01").ToUniversalTime()).TotalMilliseconds
$uEndDate = [long]($endDate.ToUniversalTime() - (Get-Date "1970-01-01").ToUniversalTime()).TotalMilliseconds

$runs = @()
$offset = 0

$headers = @{
'Authorization' = "Bearer databrickspattoken"
}

while ($offset -ge 0) {
$jobrunsextract = "https://databricksinstance.azuredatabricks.net/api/2.1/jobs/runs/list?completed_only=true&limit=25&offset=$offset&expand_tasks=true&start_time_from=$uStartDate&start_time_to=$uEndDate"
$response = Invoke-RestMethod -Uri $jobrunsextract -Method Get -Headers $headers

foreach ($r in $response.runs) {
$clusterID = $null
$clusterSpec = $null

if ($r.tasks[0].cluster_instance) {
$clusterID = $r.tasks[0].cluster_instance.cluster_id
$clusterLink = "https://databricksinstance.azuredatabricks.net/api/2.0/clusters/get?cluster_id=$clusterID"
$cresponse = Invoke-RestMethod -Uri $clusterLink -Method Get -Headers $headers
$clusterSpec = $cresponse.node_type_id
} elseif ($r.tasks[0].new_cluster) {
$clusterSpec = $r.tasks[0].new_cluster.node_type_id
}

$runTime = [TimeSpan]::FromMilliseconds($r.end_time - $r.start_time).TotalSeconds

$o = @{
'JobName' = $r.run_name
'RunID' = $r.run_id
'ClusterID' = $clusterID
'RunTime' = $runTime
'StartTime' = (Get-Date).AddMilliseconds($r.start_time)
'ClusterSpec' = $clusterSpec
}

$runs += New-Object PSObject -Property $o
}

if ($response.has_more -eq $true) {
$offset += 25
if ($offset % 250 -eq 0) {
Write-Host "Retrieved $offset runs..."
}
} else {
$offset = -1
}
}

Write-Host "Retrieved $($runs.Count) runs..."
return $runs
}

# Part3: Getting Databricks compute costs.
class ParsedJson {
[hashtable]$__dict__

ParsedJson ([hashtable]$kwargs) {
$this.__dict__ = $kwargs
}

[string] ToString() {
$keys = $this.__dict__.Keys | Sort-Object
$items = $keys | ForEach-Object { "$_=$($this.__dict__[$_])" }
return "$($this.GetType().Name)($($items -join ', '))"
}

[bool] Equals([object]$other) {
return $this.__dict__ -eq $other.__dict__
}
}

function Get-AzureCostData {
param (
[datetime]$startDate = (Get-Date).AddDays(-1),
[datetime]$endDate = (Get-Date)
)

$baseURL = "https://management.azure.com/subscriptions/subscriptionID/providers/Microsoft.Consumption/usageDetails"

$startDateS = $startDate.ToString('yyyy-MM-dd')
$endDateS = $endDate.ToString('yyyy-MM-dd')

Write-Host "Retrieving $startDateS through $endDateS..."

$token = (az account get-access-token --resource https://management.core.windows.net/ | ConvertFrom-Json).accessToken
$uri = "https://management.azure.com/subscriptions/subscriptionid/providers/Microsoft.Consumption/usageDetails?api-version=2021-10-01&startDate=$startDateS&endDate=$endDateS&`$expand=properties/meterDetails,properties/additionalInfo"
$results = @()

$headers = @{
'Authorization' = "Bearer $token"
}

do {
try {
$response = (Invoke-RestMethod -Uri $uri -Headers $headers -Method Get).value
$results += $response | ForEach-Object {
[PSCustomObject]@{
UsageDate = [datetime]::ParseExact($_.properties.date.Substring(0, 10), 'yyyy-MM-dd', $null)
Service = $_.properties.consumedService
cost = $_.properties.cost
Product = $_.properties.product
InstanceName = $_.properties.instanceName
ClusterSpec = $_.properties.additionalInfo.resourceTags.ClusterSpec
JobName = $_.properties.additionalInfo.resourceTags.JobName
ResourceType = $_.properties.additionalInfo.resourceTags.ResourceType
ResourceID = $_.properties.instanceId
ClusterID = $_.properties.additionalInfo.resourceTags.ClusterID
ResourceLocation = $_.properties.instanceLocation
}
}
$uri = $response.jobrunsextract
} catch {
Write-Host "Error: $_"
$uri = $null
}
} while ($uri -ne $null)

return $results
}

# Part4: Final output extraction
$startDate = (Get-Date).AddDays(-2)
$endDate = Get-Date

# Get Databricks jobs
$databricksJobs = Get-Jobs -baseUrl $baseUrl -pat $pat

# Get Databricks job runs
$databricksRuns = Get-JobRuns -baseUrl $baseUrl -pat $pat -startDate $startDate -endDate $endDate

# Get Azure cost data
$allCostData = Get-AzureCostData -startDate $startDate -endDate $endDate

# Filter costs based on criteria
$storageCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Storage" }
$vmCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" }
$dbuCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Databricks" }

# if you want to pull data based where clustername not equal to null, we can add addpitional conditions as well
#$storageCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" -and $_.ClusterName -ne $null }
#$vmCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Compute" -and $_.ClusterName -ne $null }
#$dbuCosts = $allCostData | Where-Object { $_.Service -eq "Microsoft.Databricks" -and $_.ClusterName -ne $null }


# Group costs by job names along with cluster details and costs
$jobCosts = $databricksRuns | ForEach-Object {
$job = $databricksJobs | Where-Object { $_.JobName -eq $_.JobName }
[PSCustomObject]@{
JobName = $_.JobName
ClusterSpec = $_.ClusterSpec
ClusterID = $_.ClusterID
RunID = $_.RunID
StartTime = $_.StartTime
RunTime = $_.RunTime
ClusterType = $job.ClusterSpec
StorageCost = ($storageCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
VMCost = ($vmCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
DBUCost = ($dbuCosts | Where-Object { $_.JobName -eq $_.JobName }).cost | Measure-Object -Sum | Select-Object -ExpandProperty Sum
}
} | Group-Object -Property JobName | ForEach-Object {
$jobGroup = $_
$totalStorageCost = $jobGroup.Group | Measure-Object -Property StorageCost -Sum | Select-Object -ExpandProperty Sum
$totalVMCost = $jobGroup.Group | Measure-Object -Property VMCost -Sum | Select-Object -ExpandProperty Sum
$totalDBUCost = $jobGroup.Group | Measure-Object -Property DBUCost -Sum | Select-Object -ExpandProperty Sum


# Divide by 1000 to convert to the original unit
$totalStorageCost /= 1000
$totalVMCost /= 1000
$totalDBUCost /= 1000

[PSCustomObject]@{
JobName = $jobGroup.Name
ClusterSpec = $jobGroup.Group[0].ClusterSpec
ClusterID = $jobGroup.Group[0].ClusterID
TotalRuns = $jobGroup.Count
TotalRunTime = ($jobGroup.Group | Measure-Object -Property RunTime -Sum).Sum / 60
AverageRunTime = ($jobGroup.Group | Measure-Object -Property RunTime -Average).Average / 60
TotalStorageCost = $totalStorageCost
TotalVMCost = $totalVMCost
TotalDBUCost = $totalDBUCost
OverallCost = $totalStorageCost + $totalVMCost + $totalDBUCost
CostPerHour = $OverallCost / ($jobGroup.Group | Measure-Object -Property RunTime -Sum).Sum / 3600
}
}

# Display the results
$jobCosts | Format-Table -AutoSize | Out-String -Width 4096

Make sure to change PatToken, Databricks instance URL, SubscriptionID, StartTime, EndTime. I chose only for 2 days, but you can change it based on your requirements.

Conclusion:

In summary, this script fetches information about running Databricks jobs and their runs, retrieves Azure cost data, filters and groups costs based on specific criteria, and presents the results in a formatted table for analysis. This script is useful for gaining insights into the costs associated with Databricks job executions and their impact on Azure consumption. Now, I can start taking actions to correct my notebooks, their execution flow, etc.

For converting epoch Unix time using a PowerShell script, you can use below code:

$startTimeUnix = $response.runs[0].start_time
$endTimeUnix = $response.runs[0].end_time

# Convert Unix timestamps to DateTime objects
$startTime = Get-Date -Date (Get-Date "1970-01-01 00:00:00").AddMilliseconds($startTimeUnix) -Format "yyyy-MM-dd HH:mm:ss"
$endTime = Get-Date -Date (Get-Date "1970-01-01 00:00:00").AddMilliseconds($endTimeUnix) -Format "yyyy-MM-dd HH:mm:ss"

Make sure to use right conversion of total vm costs, total dbu cost as it may show in 100’s or in 1000's.

This script can be scheduled to run from Outside of Databricks such as with Github actions, AzureDevOps pipelines, Azure Automation runbook, Functionapp, Inside Azure VM (if it has internet access).

Feel free to provide your comments. Happy reading.

--

--

Prashanth Kumar
Prashanth Kumar

Written by Prashanth Kumar

IT professional with 20+ years experience, feel free to contact me at: Prashanth.kumar.ms@outlook.com

Responses (5)