Azure Speech Service — Automating Speech-to-Text Transcription using Python
In today’s digital landscape, converting speech to text is a powerful tool for creating accessible content, improving searchability, and analyzing audio data. Azure Cognitive Services offers a robust Speech-to-Text API that can be easily integrated into Python applications. This article will walk you through how to automate the transcription of an audio file using Azure’s Speech-to-Text service and Python.
Overview
We’ll use the Azure Speech-to-Text API to transcribe an audio file hosted on Azure Blob Storage. Our Python script will:
- Create a new transcription job.
- Poll for the transcription job status.
- Retrieve and display the transcription results.
Prerequisites
Before diving into the code, ensure you have:
- An Azure subscription with the Speech-to-Text API enabled.
- A Python environment set up with the
requests
library installed. - Access to an audio file stored in Azure Blob Storage.
- Postman
Lets Start
Lets start our speech service journey using Postman tool first, which will give us step by step guidance to generate text from Audio streaming files.
- Lets first download sample .wav file, I just used: https://filesampleshub.com/format/audio/wav
- Create a new Azure Speech service with Standard Tier.
3. Next open Postman → Create a new Post Rest API request → Make sure to copy endpoint address from Azure speech service. Add https://<region>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=en-US
You can find more Azure Speech Service endpoints at:
4. Go to Headers Tab → Add “Ocp-Apim-Subscription-Key” : Copy your Azure Speech service Key.
Next add “Content-Type” : audio/wav
Go to Body → change it to Binary → upload your local .wav file to check connectivity with Azure Speech Service.
5. Now lets proceed further to generate actual output and this time I uploaded my .wav file to Azure Storage account.
6. Create a new Post request, add endpoint as /speechtotext/v3.2/transcriptions
→ Go to Headers, Add “Ocp-Apim-Subscription-Key” with your Azure Speech service key.
→ “Content-Type” as “application/json”.
7. Go to Body, Add your storage account wav file location along with SAS token.
Now you can see you get 201 response along with Transcription and File output
8. Next we need to check generated Transcription, Open a new GET Request. Go to Headers add Azure speech service Key.
Make sure you copy API URL from Step 7 (“self”)
9. Now the final request is to see the actual output, You can copy “contentUrl” onto your browser.
This is how the output looks
Above request is only when you are testing with 1 file and starting your Azure Speech service journey.
Lets enhance it further to test with Python script.
Python Script
Lets proceed with creating Python script if you want to use it as batch processing or continuous processing.
So now I have converted my Postman logic to actual Python script.
Lets start with Base logic to test getting transcript only from 1 wav file. Added all parameters such as “Subscription_key”, “service_region”, “
1. Configuration
Start by setting up the configuration variables:
SUBSCRIPTION_KEY
: Your Azure subscription key for the Speech-to-Text API.SERVICE_REGION
: The region where your Azure Speech service is hosted.TRANSCRIPTION_API_URL
: The endpoint URL for the Speech-to-Text API.AUDIO_FILE_URI
: The URI of the audio file you want to transcribe.DISPLAY_NAME
,DESCRIPTION
,LOCALE
: Metadata and language settings for the transcription job.
# Configuration
SUBSCRIPTION_KEY = "xxxxxx"
SERVICE_REGION = "eastus"
TRANSCRIPTION_API_URL = f"https://{SERVICE_REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
AUDIO_FILE_URI = "https://xxxxx.blob.core.windows.net/source/sample-0.wav?sp=r&st=2024-08-31T00:09:47Z&se=2024-08-31T08:09:47Z&spr=https&sv=2022-11-02&sr=c&sig=xxcxxxxxx%3D"
DISPLAY_NAME = "prademoss2"
DESCRIPTION = "Azure Speech Service"
LOCALE = "en-US"
2. Create Transcription Job
The create_transcription
function sends a POST request to the Speech-to-Text API to initiate a new transcription job. It includes metadata about the job and the URI of the audio file.
def create_transcription():
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY,
"Content-Type": "application/json"
}
body = {
"displayName": DISPLAY_NAME,
"description": DESCRIPTION,
"locale": LOCALE,
"contentUrls": [AUDIO_FILE_URI],
"properties": {}
}
response = requests.post(TRANSCRIPTION_API_URL, headers=headers, json=body)
response.raise_for_status()
return response.json()["self"]
Output → Creating new transcription…
3. Transcription Status
The get_transcription_status
function checks the status of the transcription job by making a GET request. It will help you monitor the progress of the transcription process.
def get_transcription_status(transcription_id):
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY
}
response = requests.get(f"{TRANSCRIPTION_API_URL}/{transcription_id}", headers=headers)
response.raise_for_status()
return response.json()
Output → Polling for transcription status…
4. Transcription Results
The get_transcription_results
function retrieves the transcription results once the job is complete. It looks for the result file in JSON format and fetches the transcription data.
def get_transcription_results(transcription_id):
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY
}
response = requests.get(f"{TRANSCRIPTION_API_URL}/{transcription_id}/files", headers=headers)
response.raise_for_status()
files = response.json()["values"]
for file in files:
if file["name"].endswith(".json"):
content_url = file["links"]["contentUrl"]
result_response = requests.get(content_url)
result_response.raise_for_status()
return result_response.json()
Output → Status: Running
Status: Running
Status: Running
Status: Succeeded
Retrieving transcription results…
5. Main Function
The main
function orchestrates the workflow:
- Creates a transcription job.
- Polls the job status until it’s either succeeded or failed.
- Retrieves and displays the transcription results.
def main():
# Step 1: Create a new transcription
print("Creating new transcription...")
transcription_url = create_transcription()
transcription_id = transcription_url.split("/")[-1]
# Step 2: Poll for transcription status
print("Polling for transcription status...")
status = "Running"
while status not in ["Succeeded", "Failed"]:
time.sleep(5) # Wait for 5 seconds before polling again
status_response = get_transcription_status(transcription_id)
status = status_response["status"]
print(f"Status: {status}")
if status == "Failed":
print("Transcription failed.")
return
# Step 3: Retrieve and view transcription results
print("Retrieving transcription results...")
results = get_transcription_results(transcription_id)
print("Transcription Results:")
print(json.dumps(results, indent=4))
if __name__ == "__main__":
main()
Final Output →
This entire example is when you deal with 1 single .wav file. Here is the full script
import requests
import time
import json
# Configuration
SUBSCRIPTION_KEY = "xxxxxx"
SERVICE_REGION = "eastus"
TRANSCRIPTION_API_URL = f"https://{SERVICE_REGION}.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions"
AUDIO_FILE_URI = "https://storageaccount.blob.core.windows.net/source/sample-0.wav?sv=2022-11-02&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-09-03T08:56:37Z&st=2024-09-03T00:56:37Z&spr=https&sig=xxxxxxx"
DISPLAY_NAME = "prademoss2"
DESCRIPTION = "Azure Speech Service"
LOCALE = "en-US"
def create_transcription():
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY,
"Content-Type": "application/json"
}
body = {
"displayName": DISPLAY_NAME,
"description": DESCRIPTION,
"locale": LOCALE,
"contentUrls": [AUDIO_FILE_URI],
"properties": {}
}
response = requests.post(TRANSCRIPTION_API_URL, headers=headers, json=body)
response.raise_for_status()
return response.json()["self"]
def get_transcription_status(transcription_id):
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY
}
response = requests.get(f"{TRANSCRIPTION_API_URL}/{transcription_id}", headers=headers)
response.raise_for_status()
return response.json()
def get_transcription_results(transcription_id):
headers = {
"Ocp-Apim-Subscription-Key": SUBSCRIPTION_KEY
}
response = requests.get(f"{TRANSCRIPTION_API_URL}/{transcription_id}/files", headers=headers)
response.raise_for_status()
files = response.json()["values"]
for file in files:
if file["name"].endswith(".json"):
content_url = file["links"]["contentUrl"]
result_response = requests.get(content_url)
result_response.raise_for_status()
return result_response.json()
def main():
# Step 1: Create a new transcription
print("Creating new transcription...")
transcription_url = create_transcription()
transcription_id = transcription_url.split("/")[-1]
# Step 2: Poll for transcription status
print("Polling for transcription status...")
status = "Running"
while status not in ["Succeeded", "Failed"]:
time.sleep(5) # Wait for 5 seconds before polling again
status_response = get_transcription_status(transcription_id)
status = status_response["status"]
print(f"Status: {status}")
if status == "Failed":
print("Transcription failed.")
return
# Step 3: Retrieve and view transcription results
print("Retrieving transcription results...")
results = get_transcription_results(transcription_id)
print("Transcription Results:")
print(json.dumps(results, indent=4))
if __name__ == "__main__":
main()
Now if you are dealing with Multiple files inside Storage account container then you can use something like this. Add connection string in your configuration section.
STORAGE_CONNECTION_STRING = "XXXXXX"
and then you need to initialize blob storage
from azure.storage.blob import BlobServiceClient, generate_blob_sas, BlobSasPermissions
POLLING_TIMEOUT = 600
POLLING_INTERVAL = 5
blob_service_client = BlobServiceClient.from_connection_string(STORAGE_CONNECTION_STRING)
def generate_sas_token(blob_name):
sas_token = generate_blob_sas(
account_name=blob_service_client.account_name,
container_name=CONTAINER_NAME,
blob_name=blob_name,
account_key=blob_service_client.credential.account_key,
permission=BlobSasPermissions(read=True),
expiry=datetime.utcnow() + timedelta(hours=1)
)
return sas_token
def get_latest_blob():
print("Connecting to Azure Blob Storage...")
container_client = blob_service_client.get_container_client(CONTAINER_NAME)
print(f"Looking for .wav files in container '{CONTAINER_NAME}'...")
blobs = list(container_client.list_blobs())
wav_files = [blob for blob in blobs if blob.name.endswith(".wav")]
if not wav_files:
print("No .wav files found.")
return None
latest_blob = max(wav_files, key=lambda b: b.last_modified)
print(f"Found latest .wav file: {latest_blob.name}")
return latest_blob
###and then in your Main class add additional arguments.
latest_blob = get_latest_blob()
blob_url = f"https://{blob_service_client.account_name}.blob.core.windows.net/{CONTAINER_NAME}/{latest_blob.name}?{sas_token}"
Conclusion
This python scripts demonstrates how to automate the transcription of audio files using Postman and Python. By integrating this script into your applications, you can efficiently convert spoken content into text, making it more accessible and searchable. For further customization, you can adjust the configuration parameters and handle additional scenarios as needed.
This script can be hosted with Azure Functions, Azure Automation account or even if you want to run it with Azure Batch service.
Same script I have used in my remote IoT devices as well to capture any end device anomalies.
Meanwhile I am writing Microsoft .net console app code to convert this python code to host it as webjob.