Big Data on Azure: How and where to start journey?
What differentiates today’s thriving organizations?
For any Organization :
•Data is currency in the twenty-first century
•Companies that take advantage of data opportunities have the potential to outperform those that do not
•Data Comes in all forms & sizes is being generated faster than ever before.
•Capture & combine it for new insights & better, faster decisions
Enormous amounts of data comes from different places and its a combination of many items
- can be from social networking users .
- It may be some medical records
- Collection of shoppers details
- crime statistics of cities
- Data coming from IoT devices
Introducing Big Data
Structured →Semi-structured →Unstructured
One of the main Challenge is combining transactional data stored in relational databases with less structured data.
Get the right information to the right people at the right time in the right format.
BigData- Comes with 5V’s
- Variety
- Volume
- Veracity
- Velocity
- Value
Value is the most important of all, having access to BigData is no good unless we can turn it into Value.
This is the place where Modern Data architecture emerges with BigData/Hadoop platform.
How do i start the architecture : That’s the place where we start with Canonical architecture platform which is for any beginner.
This is the place Hadoop comes into the picture.
- Apache Hadoop is for big data — Open Source for reliable, scalable, distributed computing.
- It is a set of open source projects that transform commodity hardware into a service that can:
- Store petabytes of data reliably
- Allow huge distributed computations
3. Key attributes:
- Hadoop common — utilities to support modules
- HDFS (Hadoop Distributed File System) — high throughput
- YARN — job scheduling and cluster RM
- MapReduce — YARN-based for parallel processing
- Spark — compute engine
- Pig — data-flow language & execution framework
- Oozie — workflow scheduler
- Ambari — provisioning, managing and monitoring clusters
- Sqoop — bulk data transfer between Hadoop & Relational DB
- Batch processing centric — using a “Map-Reduce” processing paradigm
While working with Hadoop, some of the most important considerations to look for:
Azure HDInsight:
- Azure HDInsight is Microsoft’s Hadoop-based service that enables big data solutions in the cloud
- A cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack
- Hortonworks Data Platform that is the go-to solution for big data analysis
About Microsoft Azure HDInsight:
1) Microsoft’s managed Hadoop as a Service
2) 100% open source Apache Hadoop
3) Built on the latest releases across Hadoop (3.2.1)
a) YARN
b) Stinger Phase 2 (Faster queries)
4) Up and running in minutes with no hardware to deploy
5) Access Data with Pig and Hive
6) Utilize familiar BI tools for analysis including Microsoft Excel
7) implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari, and so on.
8) Key attributes:
9) integrates with Microsoft BI & scripting tools:
a) Power BI,
b) Excel
c) SSAS and
d) SSRS
e) PowerShell
Hortonworks Data Platform On Azure: HDP is the only completely open Hadoop data platform available. All solutions in HDP are developed as projects through the Apache Software Foundation (ASF). There are NO proprietary extensions in HDP.
Introducing the zoo: Zookeeper is a unit where the information regarding configuration, naming and group services are stored. It is a centralized unit and using these information. Zookeeper maintains Hadoop as a Single Unit and is responsible for synchronization of Hadoop tasks.
HDInsight Programming compatibility: Since HDInsight is a service-based implementation, you get immediate access to the tools you need to program against HDInsight/Hadoop
1) Existing Ecosystem
a) Hive, Pig, Sqoop, Mahout, Cascading, Scalding, Scoobi, Pegasus, etc.
2) .NET
a) C#, F# Map/Reduce, LINQ to Hive, .Net Management Clients, etc.
3) JavaScript
a) JavaScript Map/Reduce, Browser-hosted Console, Node.js management clients
4) DevOps/IT Pros:
a) PowerShell, Cross-Platform CLI Tools
Microsoft Big Data Solution
Challenges of the modern data platform : Inefficiencies from fragmented architecture.
• Disparate systems and processes
• Multiple tools and skillsets
• Siloed insights on disconnected data
- High cost of ownership
Some of the BigData components on Azure are:
- Azure SQL
· On-premise and Cloud-based
· Cloud-first but not cloud-only
· Use SQL Database to improve core SQL Server features and cadence
· Many interesting and compelling on-premises ← →cloud scenarios
- Cortana Analytics Suite :
Cortana Analytics Suite delivers an end-to-end platform with integrated and comprehensive set of tools and services to help you build intelligent applications that let you easily take advantage of Advanced Analytics.
First Cortana Analytics Suite provides services to bring data in, so that you can analyze it. It provides information management capabilities like Azure Data Factory so that you can pull data from any source (relational DB like SQL or non-relational ones like your Hadoop cluster) in an automated and scheduled way, while performing the necessary data transforms (like setting certain data colums as dates vs. currency etc). Think ETL (Extract, Transform, Load) in the cloud. Event hub does the same for IoT type ingestion of data that streams in from lots of end points.
The data brought in then can be persisted in flexible big data storage services like Data Lake and Azure SQL DW.
You can then use a wide range of analytics services from Azure ML to Azure HDInsight to Azure Stream Analytics to analyze the data that are stored in the big data storage. This means you can create analytics services and models specific to your business need (say real time demand forecasting).
The resultant analytics services and models created by taking these steps can then be surfaced as interactive dashboards and visualizations via Power BI
These same analytics services and models created can also be integrated into various different UI (web apps or mobile apps or rich client apps) as well as via integrations with Cortana, so end users can naturally interact with them via speech etc., and so that end users can get proactively be notified by Cortana if the analytics model finds a new anomaly (unusual growth in certain product purchases- in the case of real time demand forecasting example given above) or whatever deserves the attention of the business users.
- Azure Data Factory: A managed cloud service for building & operating data pipelines, Part of the Cortana Analytics Suite
- DocumentDB →Schema Free DB:
A NoSQL document database-as-a-service, fully managed by Microsoft Azure.
For cloud-designed apps when query over schema-free data; reliable and predictable performance; and rapid development are key. First of its kind database service to offer native support for JavaScript, SQL query and transactions over schema-free JSON documents.
Perfect for cloud architects and developers who need an enterprise-ready NoSQL document database.
- SQL Server and Azure DocumentDB
CapabilityGreatly: enhances developer productivity
Benefits: Added native JSON support in the core database engine supports schema-free data. Tackle more diverse data types right in SQL Server, support in DocumentDB
- Non Relational and NoSQL:
NoSQL’s popularity originally started because of limitations of relational stores. Many of these limitations have since been addressed, but there are still a lot of compelling reasons to choose a NoSQL store.
PolyBase: Provides a scalable, T-SQL-compatible query processing framework for combining data from both universes. Component of the PDW Region in APS
Highly parallelized distributed query engine accessing heterogeneous data via SQL
Seamless Integration
Unique Innovative Technology
Polybase Architecture
PolyBase is agnostic = No vendor lock in
PolyBase supports Hadoop on Linux & Windows
PolyBase integrates with the cloud
PolyBase supports HDInsight in APS & external Hadoop clusters
PolyBase builds the bridge
1) Just-in-Time data integration
a) Across relational and non-relational data
b) High performance parallel architecture
c) Fast, simple data loading
2) Best of both worlds
a) Uses computational power at source for both relational data & Hadoop
b) Opportunity for new types of analysis
3) Uses existing analytical skills
a) Familiar SQL semantics & behaviour
4) Query with familiar tools
a) SSDT
Dynamic Data Masking: Dynamic data masking helps prevent unauthorized access to sensitive data by enabling customers to designate how much of the sensitive data to reveal with minimal impact on the application layer. It’s a policy-based security feature that hides the sensitive data in the result set of a query over designated database fields, while the data in the database is not changed.
On-the-fly obfuscation of data in query results
Policy-driven on the table and column
Multiple masking functions available for various sensitive data categories
Flexibility to define a set of privileged logins for un-masked data access
By default, database owner is unmasked
Benefits of dynamic data masking
Finally What is R?
R is a language and environment for statistical computing and graphics. … R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible
Why R is famous?
- Box plot
- Bar plot
- Histogram
- Contour
- Dot plot
- Mosaic
- Scatter
- Latticist
Last but not the least : Standard approach to learn R
Final Wrap up → Summary
1) Big Data refers to data sets so large and/or complex that they become awkward to work with in conventional ways
2) Hadoop and HDInsight = Microsoft’s answer to Big Data
3) Hadoop can store petabytes of data reliably and execute huge distributed computations
a) However — Big Data query results often involve significant latency
4) Power BI includes authoring add-ins to query, analyze and visualize data sourced from Azure HDInsight
a) Preload data in advance of business user queries
5) Big Data is just another data source!