ADF to ADLS Integration
Integration between ADF to ADLS
Overview
Data Lake Store is an enterprise-wide hyper-scale repository for big data analytics workloads. It is fully aligned with Hadoop ecosystem and standards, with full support for Hadoop tools and engines as well as unique Microsoft capabilities.
Service Details
Architecture
Following picture depicts the data lake analytics architecture. Please note that, it reflects to be design. Some of the features are not available as of document writing.
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios
SKU: There are two different pricing models available for Data Lake Analytics.
· Pay-as-You-Go is a model that is paid by the second with no long-term commitments.
· Monthly commitment packages provide with a significant discount (up to 33%) compared to Pay-as-You-Go pricing
References:
https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
Preferred deployment method: This service can easily be deployed through Azure Portal. However, it is recommended to deploy the service with an ARM Template.
Security
Access management: Data Lake Store is integrated with Azure Active Directory by default. Azure Data Lake Store implements an access control model that derives from HDFS, which in turn derives from the POSIX access control model. Access control lists (ACL) can be defined at file and folder level using a universal naming space. Navigate to Azure Portal Data Lake Analytics Store Data Explorer Access. Ensure least necessary privileges are granted on each folder/file.
· Requirement: Don’t grant any permissions (Read/Write/Execute) to everyone else on any folder or any file.
Set up firewall rules: Data Lake Store can be used to further lock down at network level. Firewall feature can be enabled, and only specified IP addresses or IP ranges can be allowed. If other Azure services, like Azure Data Factory or VMs, connect to the Data Lake Store account, Allow Azure Services must be turned on.
· Requirement: Enable Firewall to restrict access to Data Lake Store
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-security-overview#network-isolation
Use appropriate keys for Data Lake Store encryption: Encryption for Data Lake Store is set up during account creation, and it is always enabled by default. The Data Lake Store service can use keys stored in a Key Vault in the same subscription. Encryption becomes transparent whereas key rotation is under user responsibility. It is recommended to use this option for high confidential data. The Data Lake Store service can own the key. Key and encryption become transparent. It is recommended to use this option if non-high confidential data.
· Requirement: Use keys managed by Data Lake Store for non-High Confidential Data
· Requirement: Use keys from your own Key Vault for High-Confidential Data
Plan for backup and disaster recovery:
Data in a data lake store account is resilient to transient hardware failures within a region through automated replicas but to protect data from a region-wide outage, developers will need to copy it to another Data Lake Store account in another region using tools such as ADF.
· Recommended: Backup and Disaster Recovery must be planned for the default Data Lake Store account
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-disaster-recovery-guidance
Monitoring
Azure Monitor can be used to monitor health and usage. OMS can be configured to maintain diagnostics logs and create alerts. Diagnostics logging allows to collect data access audit trails:
· Request logs capture every API request made on the Data Lake Store account.
· Audit Logs are like request Logs but provide a much more detailed breakdown of the operations being performed on the Data Lake Store account. For example, a single upload API call in request logs might result in multiple “Append” operations in the audit logs.
References:
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-diagnostic-logs
Integration between ADF to ADLS
I’ve been working on POC recently & we were exploring the different data and storage capabilities available in Azure.
One particular scenario we’ve been testing is using Azure Data Factory (ADF) to copy and transform data to Azure Data Lake Storage Gen1 (ADLS).
ADF has native support for an extensive library of linked data sources including ADLS. However, connecting ADF to ADLS is not as seamless as you might expect due to the requirement to configure an SPN and some other permissions.
Azure Data Factory
Azure Data Factory (ADF )is Microsoft’s cloud hosted data integration service. In a nutshell, it’s a fully managed service that allows you to define ETL (Extract Transform Load) pipelines within Azure.
Those pipelines are typically authored through the Azure Portal (code free!) and are then scheduled, triggered and monitored as needed.
ADF supports a multitude of linked services supporting data sources across cloud and on premise. One of ADF’s most interesting features is the way it can transform data by leveraging other Azure services such as Azure HDInsight, Data Lake Analytics, Data Bricks and Machine Learning.
Azure Data Lake Storage (Gen1)
Azure Data Lake Storage is Microsoft’s massive scale, Active Directory secured and HDFS-compatible storage system. ADLS is primarily designed and tuned for big data and analytics workloads.
Azure Data Lake (Gen1)
Open Azure Portalà click on “Create a resource” à Search with Data lake à select the option “Data lake storage Gen1” à click on Create à
Enter required details such as “Name” , Subscription, Resource group, Location, and pricing package as “Pay-as-you-go” and then click on Create.
For the first time the screen will look like this
Now we need to start creating different folder structures.
For this scroll down à go to Datalake Storage Gen1 section à click on Data Explorer option
Now we need to start creating Individual folders for different project specific.
Azure Data Factory
This demo deals with the creation of user interface for Azure Data Factory. Here we will be creating ADLS Gen1, pipeline, datasets and will monitoring the data factory. Triggers will also be executed with the same.
Creating Data Factory:
In the Azure portal, go to + Create Resource -> Analytics -> Data Factory. Give a name for the resource, select a subscription, a resource group, version, location and click on Create.
In the overview page of the data factory, click on Author & Monitor tile to navigate into management portal of azure data factory.
When you open it for the first time it will open a window something like this.
To start new work à click on the option Author.
Adding Connection’s in Data Factory:
In order to connect with different systems we need to create connections first, in this case as we are getting the info from REST API which will be consumed by ADF and then it will be put onto ADLS and Azure Synapse. So I need 3 connections
- ADLS connection
2. REST connection
3. Azure Synapse Analytics
1. ADLS Connection:
To begin creating ADLS connection go to management portal, click on Connections button in the bottom and click on +New to add a connection. Select the option as “Azure Data Lake Storage Gen1”
à click on Continue. It will take you to new screen to create new Linked Server
In this case I am going to use Managed Identity option (However it’s up to Application architecture and users can select “Service Principal” option as well).
Before you click on Create button (always a good practice to click on Test Connection). This will verify the connectivity with ADLS exist or any permissions needs to be fixed on ADLS level. (Below you can see sample error message).
In order to resolve this issue go to Azure Data Lake à Select your ADLS-> Go to select the folder where you want to drop the files à click on Access Buttonà Under Everyone’s else select “Read”,”Write” and “Execute” option.
And then test the connectivity again.
2. Rest API Connection
For REST API connectivity , click on connections à Go to Linked Services à +New à
Click on Continue à In a new window à
· Integration runtime (incase if you are using a separate one)
· Base url (Provide base url from where you are trying to fetch data)
· Authentication type (Anonymous, Basic, AAD Principal, Managed Identity)
· Enable Server Certificate (Always set this option as Enable)
Click on Create
3. Azure Synapse
For to insert the Data into Azure Synapse we have to create a new Linked Service à click on connections à Go to Linked Services à +New à
· Account selection method (Based on your required you can either go with From azure subscription or you can enter manually)
· Authentication type ( SQL Authentication or Managed Identity) [again it is always better to go with Managed Identity]
Check connectivity
Make sure you update Firewall rules in Synapse SQL pool (data warehouse)
Under Firewalls and Virtual networks.
After adding my IP address I have validated it again.
ADF Pipeline creation
Now we want to create a new Pipeline to get the data from external sources based on Rest call. Go to Factory Resources à click on Pipelines -> click on New Pipeline
It will open a new screen with additional details. In our case we will be selecting option “
We will start entering details for Rest API call.
Now in order to test to see if the Rest API is happening or not, click on Debug option.
Output will be shown below with the result (either Success/failure)
Now the next step is we need to put these files into Data Lake storage. So, I am going to add a new step in Pipeline.
Under activities search for the option “Copy Data”. Select “Copy data” drag and drop next to existing task.
Now as after my REST API step I want to trigger “Copy Data”, please select the green highlighter and drag and join it to “Copy Data” task.
Now we have to update “Copy Data” task where do we want to drop the files. Select “Copy Data”à below in properties window. Select source à click on +new
Under source we need to put the source on the rest connection, in this case we will be selecting “RestResource1” linked service. Please add required Request boby and below please add headers as well. Such as “Content-Type” or “Ocp-Apim-Subscription” keys etc.
Next under sink à as we want to put the data in ADLS à Please Click on +New à Under search type “Data lake” à Select “Azure Data lake Storage Gen1” as data store àContinue àSelect the format how you want to save the data. à select json à Continue à
As a best practice we are going with Json.
Under Settings à select Data Integration unit as “Auto”
After entering all the details à click on Debug option.
Now we want to validate if the files being dropped to the location or not.
Now lets open Azure Data lake à go to the folder
If you want to see the content of the file, click on that file it will open a new window with the content on it.