Azure Data Platform & BigData

Prashanth Kumar

14 min readMay 8, 2020

using different Azure Components

1 Purpose of the Document

The purpose of this document is to

a. Show some high level design guidelines which can be adopted and followed for Azure Data Platform program.

b. Outline the details of the proposed Architecture.

2 Scope of the Document

The scope of the document is divided into Data ingestion, processing and loading Implementation phases.

· Ingestion Phase

o Realtime Data Ingestion

o Batch data Ingestion

· Processing Phase

o Cleansing

o Transformation

· Loading Phase

o Loading mechanism using Polybase

3 Big Data Solution Design

Following image represents bigdata storage and processing layers elaborated from overall solution architecture:

4 Azure Components

4.1 Azure Data Factory

· Azure Data Factory is a globally deployed data movement service in the cloud. It can reliably move data from internal and external sources to a common platform for enterprise analytics and decision support systems. Data Factory supports movement of both structured and unstructured data located on-premises and in cloud data sources and targets, making it easier to integrate varied data types originating from diverse sources like traditional databases, machine data sources, and web sources.

· Azure Data Factory helps us create, schedule, monitor and manage data pipelines. A data pipeline consists of activities that can move data between on-premises and cloud data stores and transform data

· Azure Data Factory will copy data between various data stores in a secure, reliable, performant and scalable way. As our volume of data or data movement throughput needs grow, Azure Data Factory will scale out to meet those needs.

· Copy Activity in Data Factory will copy data from a source data store to a sink data store (Azure Blob storage).

· Monitor data pipelines in one place. For this, we will use an up-to-the-moment monitoring dashboard to quickly assess end-to-end data pipeline health, pinpoint issues, and take corrective action if needed.

Pipelines and Activities in Azure Data Factory

In addition to the above functionalities, ADF also provides a dynamic monitoring dashboard where the users can monitor the data pipelines.

Dataset: Datasets are named references/pointers to the data you want to use as an input or an output of an Activity. E.g. table, files.

Activity: Activities define the actions to perform on our data.

Pipeline: Pipelines are a logical grouping of Activities. E.g. manage, monitor, schedule.

Linked service: Linked services define the required information for Data Factory to connect to various external resources (Connection Strings). E.g. SQL server, FTP etc

Following are the Key components of ADF to load data from any external source to blob storage.

A) Data Movement Activities

Copy Activity in Data Factory copies data from a source data store to a sink data store.

To securely move data between an on-premises data store and a cloud data store, Data Management Gateway need to be installed on on-premise machine. Data Management Gateway is an agent that enables hybrid data movement and processing. This can be installed on the same machine as the data store itself, or on a separate machine that has access to the data store. As final step, connect on-premises sources with a data management gateway and use Data Factory to get data to its destination on Azure platform.

B) Data Transformation Activities

Azure Data Factory supports the following transformation activities that can be added to pipelines either individually or chained with another activity

Data transformation activity

Compute environment

Spark

HDInsight [Hadoop]

Stored Procedure

Azure SQL, Azure SQL Data Warehouse(Synapse), or SQL Server

C) Pipeline with Copy Activity

We will use Data Factory Editor in the Azure portal, Visual Studio, or Azure PowerShell to create a JSON definition for a pipeline (by using Copy Activity). Then, we will deploy it to create the pipeline in Data Factory.

The activities section can have one or more activities defined within it. Each activity has the following top-level structure:

{

“name”: “ActivityName”,

“description”: “description”,

“type”: “<ActivityType>”,

“inputs”: “[]”,

“outputs”: “[]”,

“linkedServiceName”: “MyLinkedService”,

“typeProperties”:

{

“policy”:

{

}

“scheduler”:

{

}

D) Data Factory scheduling and execution

A pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it is not executed irrespective of its start and end time. For a pipeline to run, it should not be paused. You find these settings (start, end, paused) in the pipeline definition:

“start”: “2019–04–01T08:00:00Z”,

“end”: “2019–04–01T11:00:00Z”

“isPaused”: false

E) Monitor data pipelines in one place

The Azure Portal to view details about the Data Factory resource, like linked services, datasets and their details, the latest runs of the activities and their status, etc. You can also configure the resource to send notifications when an operation is complete or has failed. By using the Azure portal, you can monitor ADF pipeline using:

o Diagram View

To see the Diagram view of your data factory, click Diagram on the home page for the data factory.

Right-click the pipeline, and then click Open pipeline to see all activities in the pipeline, along with input and output datasets for the activities. This feature is useful when your pipeline includes more than one activity and you want to understand the operational lineage of a single pipeline.

o Monitor & Manage App

To launch the Monitor and Management app, click the Monitor & Manage tile on the Data Factory blade for your data factory.

4.2 Spark

Apache Spark is a powerful open source engine provides real-time stream processing, interactive processing, graph processing, in-memory processing as well batch processing with very fast speed, ease of use and standard interface.

In any BigData project there is a demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode.

Spark SQL which is Apache Spark’s module for working with structured data will be used.

Spark Streaming which enables powerful interactive and analytics application across live streaming data will also be used in the project.

https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview

4.3 PolyBase

PolyBase is a technology that accesses data outside of the database via the t-sql language. In SQL Server 2016, it allows us to run queries on external data in Hadoop or to import/export data from Azure Blob Storage.

PolyBase introduces the below capabilities:

· Query data stored in Azure blob storage. Azure blob storage is a convenient place to store data for use by Azure services. PolyBase makes it easy to access the data by using T-SQL.

· Import data from Azure blob storage. Leverage the speed of Microsoft SQL’s columnstore technology and analysis capabilities by importing data from Azure blob storage into relational tables. There is no need for a separate ETL or import tool.

PolyBase is the most straightforward approach to import data into SQL DW from Blob storage. Following are the steps:

Step one: Create the master key

The first step is to ensure that a master key has been created in the database.

Step two: Create the data source

An external table is mapped to a container in Azure blob storage using an external data source

Step three: Create the file format

The file format provides instructions on how to interpret the files in your container

Step four: Create the external table

The CREATE EXTERNAL TABLE command is used to wire up a new external table to the data source and file format

Step five: Fire SQL at your new table

Data isn’t actually accessed until you throw a SQL statement at it, at which point PolyBase gets to work processing all the files it can find in the data source. From here loading data is a matter of writing SQL scripts that move data from one table to another.

Step six: Create store procedure to load data from External table to final intenal SQL DW table.

https://docs.microsoft.com/en-us/sql/relational-databases/polybase/polybase-guide?view=sql-server-ver15

4.4 Blob Storage

Windows Azure BLOB storage service will be used to store and retrieve ingested data. BLOB storage is considered here as raw data storage because it provides most cost effective persistent data storage on Azure cloud.

Other benefits of using Blob storage are: Compatibility, Performance, Availability, Durability, Scalability and Security

It is assumed that customer data in the cloud will be growing at an exponential pace, and we need to work on ways to manage the cost of storing this data for extended period of time. An important aspect of managing storage costs is tiering this ever growing data based on attributes like frequency of access and retention period.

A common tier of customer data is cool data which is infrequently accessed but requires similar latency and performance to hot data. We will use cool tier to store infrequently accessed data and hot tier for frequently accessed data.

https://azure.microsoft.com/en-in/services/storage/blobs/

4.5 Event Hub

Azure Event Hubs is a hyper-scale streaming ingestion service that collects, transforms, and stores millions of events/messages. As a distributed streaming platform, it gives us low latency and configurable time retention, which enables us to ingress massive amounts of telemetry into the cloud and read the data from multiple applications using publish-subscribe semantics.

Most of the time we are leveraging the Event Hubs Archive feature, which allows a single stream to support real-time and batch based pipelines, and reduces the complexity of our solution. Build for today’s batch processing on a platform that supports tomorrow’s real-time analytics and processing. Event Hubs Archive is the easiest way to load data into Azure.

https://azure.microsoft.com/en-in/services/event-hubs/

5 ETL Design

5.1 ETL Overview

The ETL Process will consist of Extraction of Data from actual on premise source system and loading into Azure SQL Data Warehouse Staging Layer. The data undergoes transformation as per the cleansing rules, standardization and loadeding into the destination structures. The Azure data factory (pipe line activity) will facilitate the data ingetion activity from multiple on premise data sources using Data management Getway, and copy data into Azure Blob storage.

The below steps will be taken into consideration for the ETL Flow:

A. Full Load/Incremental Load: It is a process to identify what data needs to copy to Blob storage.

a. There will be a one time full load which will extract data from history date (TBD) till date

b. Incremental load would be scheduled based on frequency requirement for facts and dimensions

c. The delta detection will be done at the source system based on timestamp to identify the fresh records for every scheduled incremental load.

B. Validation process: This is a process to measure the Accuracy and Itegrity of the loaded data as per defined business rules.

C. Transformation Process: This is a process to perfom transformation in ETL flow. In design considerations the transformation will be taken care by Ploybase and Spark Job activity.

D. Exception Handling: In Spark job we will use log4j as the standard library for custom logging.

E. We will use the Monitoring and Management app to monitor, manage, and debug our Data Factory pipelines. This app also provides information on how to create alerts to get notified on failures.

https://azure.microsoft.com/en-in/services/data-factory/

5.2 Historical and Incremental Data Load Strategy

The ADF pipeline are designed in a manner that both the Historical and Incremental load can be captured using feature of Copy data pipleline activity. In Historical Data processing (one time Full Load), the Entire data till the date of execution will be ingested by copy dataset pipeline activity of ADF.

The incremental delta detection will happen at the source layer. The process of data incremental load is described below.

- Each source table will have last modified datetime column.

- Identify the last updated date columns for each table of source system.

- Write parameterized queries to pick only new or modified values with time slices and system variables in ADF.

- Copy the the incremental data into blob storage.

- Process the data using Spark SQL and store the processed data into staging layer.

- Polybase will load the data from staging layer to Data warehouse using T-SQL.

5.3 Modeling Directory Storage on Blob Storage

Ingested data will be stored in the specified folder. Each source system will have its own specific folder to store data and the folder will be created based on timestamp. For example for PRADEMO data source the incremental load destination folder will be PRADEMO/<<subject-area>>/YY/MM/DD/<<file_name>>

Automatic Folder creation, orchestaration, scheduling will be done by Azure Data factory.

5.4 Copy data on-premises to Azure BLOB STORAGE

Lambda architecture as a data-processing architecture is designed here to handle massive amount of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.

ADF Freamwork is being used used here to monitor the ETL process while loading data from on-premises to Azure Blob Storage using ADF pipeline activity.

All of the source system data wheteher Batch or Realtime (Historical and Incremental) has to be first landed into Blob Storage and a copy of this original data will be kept forever for future use.

There are two different kind of Batch Data

It is expected that only Relational data is being ingested into the system through batch mode. To facilitate the load of batch data “Data Management Gateway” will be installed on a on premise system which will make unique connection with different source system and with the help of ADF copy the the different sorce system data into specified location in Blob storage.

5.5 Load/ETL using SPARK

Spark Processing:

· Load file from blob container by initializing Spark core context

· Split each line of the raw file into separate token using Spark’s map () function that creates a new Spark RDD.

· Next, Iterate RDD and apply validation & transformation rules on required fields using standard functions of Spark SQL.

· Segregate valid & rejected records into separate DataFrame using the toDF() function.

· Finally, DataFrame will write records into Blob staging Layer. Blob Staging Layer will be used to create an external table and provide opportunity to keep the validation passed and failed records in separate tables.

· The Blob staging Layer will have temporary layer which will be completely replaced/refreshed after each incremental load.

5.6 Load/ETL using Polybase

Ploybase processing:

· To process data from Blob staging layer to SQL DW using polybase.

· Create Dataset: Create input and output datasets that represent input and output data for the copy operation from blob staging layer to SQL DW using Polybase

· Crate and Configure external objects in SQL data warehouse and point it to the data stored in blob storage.

· Crate a store procedure to load data into a SQL DW staging table with required fields transformation (i.e datatype conversions if required).

· Schedule and run the ADF pipeline.

5.7 Real time Processing

1. Realtime messages will be published into Even Hub via Biztalk server.

2. Retrieve the messages from Event Hub in real-time using an application running in Spark cluster on Azure HDInsight.

3. We will build streaming analytic pipelines to persist data to Azure Blob storage.

4. After landing into the blob storage, the storage and processing of realtime data will be handled the same way as being handled with batch data. However, the frequency will be higher to consider latency requirements of the application

5.8 Reject Record handling

The records rejected during the ETL load will be handled in the below given way.

All records that passed the validation rule during the data processing through Spark jobs will be loaded to staging table and failed record will be stored in blob in a separate folder (TBD).

Datawarehouse in this case will keep the record of passed records and based on requirements the selected records will be used for further analysis, or reporting.

Notes:

· Since all of the rejected records are being stored in Blob storage, no additional effort or logic is required to reprocess or segregation.

5.9 DATA load frequency

It varies per source and type of integration. ADF pipleline will be scheduled to run at specified interval for specified source system.Detailed Design document for each source system will contain this detailed information.

5.10 ETL jobs and scheduling strategy

A ETL pipeline is active only between its start time and end time. It is not executed before the start time or after the end time. If the pipeline is paused, it is not executed irrespective of its start and end time. For a pipeline to run, it should not be paused. These settings (start, end, paused) are found in the pipeline definition:

“start”: “2019–04–01T08:00:00Z”,

“end”: “2019–04–01T11:00:00Z”

“isPaused”: false

Specify schedule for an activity:

It is not the pipeline that is executed, It is the activities in the pipeline that are executed in the overall context of the pipeline. We will specify a recurring schedule for an activity by using the scheduler section of activity JSON. For example, we schedule an activity to run hourly as follows:

“scheduler”: {

“frequency”: “Hour”,

“interval”: 1

Specify schedule for a dataset:

An activity in a Data Factory pipeline can take zero or more input datasets and produce one or more output datasets. For an activity, you can specify the cadence at which the input data is available or the output data is produced by using the availability section in the dataset definitions.

Frequency in the availability section specifies the time unit. The allowed values for frequency are: Minute, Hour, Day, Week, and Month. The interval property in the availability section specifies a multiplier for frequency. In the following example, the input data is available hourly and the output data is produced hourly (“frequency”: “Hour”, “interval”: 1).

In this example, the activity runs hourly between the start and end times of the pipeline. The output data is produced hourly for three-hour windows (8 AM — 9 AM, 9 AM — 10 AM, and 10 AM — 11 AM).

Each unit of data consumed or produced by an activity run is called a data slice. The following diagram shows an example of an activity with one input dataset and one output dataset:

The diagram shows the hourly data slices for the input and output dataset. The diagram shows three input slices that are ready for processing. The 10–11 AM activity is in progress, producing the 10–11 AM output slice.

This example also uses WindowStart and WindowEnd to select relevant data for an activity run and copy it to a blob with the appropriate folderPath. The folderPath is parameterized to have a separate folder for every hour.

Offset Dataset availability:

By default, daily (“frequency”: “Day”, “interval”: 1) slices start at 12 AM UTC time (midnight). If you want the start time to be 6 AM UTC time instead, set the offset as shown in the following:

“availability”:

{

“frequency”: “Day”,

“interval”: 1,

“offset”: “06:00:00”

}

Dataset policy:

A dataset can have a validation policy defined that specifies how the data generated by a slice execution can be validated before it is ready for consumption. In such cases, after the slice has finished execution, the output slice status is changed to Waiting with a sub status of Validation. After the slices are validated, the slice status changes to Ready. If a data slice has been produced but did not pass the validation, activity runs for downstream slices that depend on this slice are not processed.