Big Data
Variety, Volume, Volatility/Velocity
A Big Data project is defined as a project that has to handle data of great:
Many different types of data to be processed – maybe some from a relational database, some from a NOSQL database and some from “things” from “the Internet of Things” (sensors, cellphones, raspberry pi devices like Amazon Echos / Alexas, cars, etc.)
Huge quantities of data – terabytes of data, petabytes even to be processed
Data that comes in at huge speed (real-time or near real-time) or changes a lot in real-time
Big Data Problems
There are some problems that are specific to Big Data projects.
Data Collection
Data Collection is an area that usually has to be planned and designed carefully for for Big Data projects. It may be important not to miss a single piece of data (for example, in an Internet of Things project missing one piece of data may result in a “thing” not changing status when it should), or it may be acceptable to miss some in a real-time streaming environment (for example, missing some sales of stocks or shares might not affect the overall volume of transactions or the overall price of the share/stock).
Data Collection usually involves collecting data from “things” or external APIs, either directly into a simple message queue (Google Cloud PubSub, AWS SQS, Azure Queues or an AMQP queue on premise using RabbitMQ or similar), a data stream of some kind (for example, Apache Flink, Confluent Kafka or MSK Kafka, STRIIM, or AWS Kinesis Data Streams/AWS Kinesis Firehose, or Azure Event Hubs) or an IOT Hub.
IOT Hubs (Azure IOT Hub, AWS IOT Core, GCP Cloud IOT Core) have been created in all three main clouds to manage the statuses of “things” when messages are lost due to devices going “offline”. For example, when a cellphone is switched to airplane mode or loses connectivity to a cell tower, it is important that when the cellphone comes back online, it still gets any messages sent to it while offline. Alexa/Echo devices and raspberry pi devices/ sensors (thermometers, light sensors, wind sensors etc.), similarly should receive any messages missed while the device was turned off or not connected to wifi.
Event Hubs in all three clouds handle the situation where real-time data streams (stocks/shares or weather data streams or other real-time data streams) experience “gaps” in data.
Event Grid (GO - pushes data) highly scalable publish subscribe like Google pub sub.
Azure Event Grid uses AMQTT like RabbitMQ integrates with various Azure (and non Azure) services, including Azure Functions, Azure Logic Apps, Azure Event Hub, Azure Service Bus, and Azure Notification Hubs. A bit like AWS Event bridge - a bridge between different services inside the cloud typically. Low cost, scalable, serverless, AT LEAST ONCE delivery.
Two options: Azure Event Grid, a fully managed PaaS service on Azure, and Event Grid on Kubernetes with Azure Arc (on premise/multi cloud)
Event Hub allows subscribers to pull. Big Data streaming - decoupling of layers. HUGE data where if you miss a few its ok Millions of events per second.Event stream data - capture, retention, replay/telemetry, of huge amounts of data. FAST and AT LEAST ONCE delivery.
Azure Service Bus - fully managed enterprise message broker with message queues and publish-subscribe topics for enterprise applications that require transactions, ordering, duplicate detection, and instantaneous consistency. High value messages that cannot be lost or out of order? Use Service Bus. Reliable, ordered optional ordered optional FIFO optional dead letters - most like SQS/RabbitMQ
Compare Event Tools

For example, there might be a rule that if no data comes in for up to 10 seconds for a particular share or stock, that is acceptable, and will make no difference to later calculations – such as average stock price, stock sales volumes etc. However, if data is missing for more than 10 seconds, there may need to be a process set up for fixing the data later by rerunning the streams.
Data Ingestion
Ingestion of Big Data usually involves the use of a Data Pipeline tool. AWS Data Pipeline, Azure Data Factory or Google Cloud Dataflow for example – or using Talend or Informatica for on-premise data.
A Data Pipeline typically consists of three steps or more. Standard three step pipelines are often known as ETL or ELT (Extract Transform Load or Extract Load Transform). Data is extracted from the “raw” source (and perhaps cleaned up) and then either processed and transformed into a different format before loading the data into a Data Warehouse or is loaded into the Data Warehouse first and then is processed and transformed into a different format within the Data Warehouse.
Whether a Data Pipeline process consists of thre.e steps or five (or more), there is usually some form of a data lake for storing the data at each step. In some Data Pipelines there will be a “Raw” data lake, a “Clean” data lake, a “Final” data lake (“Trusted”), and a “Pre-load” data lake for example. Usually data is transformed between each stage using either Spark/PySpark (batch processing in parallel/speed) or Pig/Hive (Sql like) or a tool like Talend/Informatica/Azure Data Factory/DataBricks which uses Spark or PySpark “under the hood”.
An example of what might be done in each stage in a Data Pipeline:
Raw to Clean – remove nulls, remove “bad” data, deduplication, remove data with missing relationships (foreign keys).
Clean to Final – consolidate/join data, create data relationships/views in such a way that it can be ingested into the Data Warehouse and used by BI tools like PowerBi, Tableau, Qlik etc.
Storing and Management
Storing and management of huge amounts of data is another problem unique to Big Data projects. Blob storage (AWS S3, Azure Blobs or Data Lakes (ADLS2) and Google Cloud Cloud Storage) and Data Lakes are the main solutions to this problem because they make it cheap and easy to store huge amounts of structured/unstructured data in a manageable manner. They also offer “lifecycle” management so that files can be stored more cheaply if they are used less frequently.
Processing Data at High Speeds
Huge amounts of data mean huge amounts of processing. The solution to Big Data processing at hish speeds started with two distributed file system solutions. Google started to solve these issues with GFS – Google File System and Apache (Hadoop) open source project started to solve these issues with HDFS and Map Reduce. Distributed processing evolved as Hadoop evolved from using the original components of Hadoop (HDFS, Map Reduce and Yarn) to now a host of components and options that can sit on top of Hadoop. Most cloud big data solutions sit on top of Hadoop components and have evolved Hadoop in some way.
Big Data Open Source Options
HADOOP
Apache Hadoop is a suite of tools designed to make it easier to distribute data processing over multiple nodes (servers). It is also open source which makes it very popular with many software engineers and architects because it is free to use and transparent (engineers can see the code that makes it work) and very extensible.
The main technologies which make up Hadoop are YARN, HDFS and Map Reduce. But many other technologies sit on top of Hadoop such as Hive, Pig, Spark/PySpark etc.
YARN
YARN, or Yet Another Resource Negotiator, is a core component of Apache Hadoop. YARN is a large-scale distributed operating system for big data applications. It manages resources and schedules jobs in Hadoop. YARN allows you to use various data processing engines to process data stored in HDFS (Hadoop Distributed File System). These engines include batch processing, stream processing, interactive processing, and graph processing. YARN increases the efficiency of the system.
YARN was introduced in Hadoop 2.0 to remove a bottleneck on Job Tracker. YARN is responsible for managing resources amongst applications in the cluster. It allocates system resources to the various applications running in a Hadoop cluster and schedules tasks to be executed on different cluster nodes.
HDFS
HDFS – Highly Distributed File System is an Apache file system that enables workloads to be distributed across a number of nodes/servers. The primary components are known as the Name Node and Data Node. File metadata is stored on the Name Node – meta data such as File size, directory location, file block info.
Map Reduce
Map Reduce works with Hadoop and consists of a framework of APIs and Services to allow for development using the “Map Reduce programming model”, which is, effectively, a way to distribute work across many nodes using functional programming. An example is below:
Python (pySpark) readlines() counts how many lines in a file
HDFS can split the file up into smaller parts if it’s a big file (blocks)
Map Reduce will allow the Count only the lines in the block on the node
So all nodes are counting only a few lines each
If 14 nodes do it all at the same time, the response comes back 14x faster
Then the reduce() part will take the results from all 14 nodes and produce the total.
Hive SQL
Hive SQL, in simple terms, takes unstructured data and makes it into tables with rows and columns to make it easier to use SQL queries on that data. It provides DDL for creating tables and use SQL to query that data – Hive converts the back end to do the map reduce work.
Hive was once based upon Map reduce (APIs and Services) for parallel processing.
Hive was slower than RDBMS queries because it had to convert to Map reduce.
Map reduce was only available in Java at first. However, to make it faster and able to work well with data lakes, Hive now bypasses Map Reduce and goes straight to HDFS. Hive also ran on containers in YARN but now has the option to use other containers. Yarn forced use of containers on top of Hadoop but the industry wanted to use more flexible container management so now Mesos, Docker, Kubernetes are options on top.
Spark SQL
Apache Spark came in to improve on Hadoop to allow for multiple languages (previously only Python was available – now Spark allowed Scala, Python, R or Java to be used on top of Hadoop), multi container options and storage in data lakes etc.
Apache Spark works with Hadoop Data Lakes (unstructured data) with or without the Hadoop layer.
Without Hadoop Apache Spark enables the Data Lakehouse architecture.
Spark can be 10 – 100 x faster than Hadoop in this architecture.
It is easy to develop using SQL queries/engine and API with Spark.
There are multiple storage options – including HDFS or Cloud Storage (EMR in AWS provides either option).
Map Reduce and Hive are less popular now that Spark SQL is available.
Presto
Presto is an open-source SQL query engine that runs on Hadoop. It was originally developed at Facebook to run interactive queries on their large data warehouse in Apache Hadoop. Presto is designed for fast, interactive queries on data in HDFS and other data sources. It can be used for interactive and batch workloads, and can scale from a few to thousands of users.
Presto is complementary to Hadoop because it doesn't have its own storage system. Organizations often use both Presto and Hadoop to solve business challenges.
Presto's architecture consists of one coordinator node that works with multiple worker nodes. It can connect to multiple data sources and allow you to query them at the same time. This is known as "federated queries".
Presto can handle limited amounts of data for e-commerce. It's better to use Hive when generating large reports.
Other Open Tools on Top of Hadoop
Management options available include Yarn, Zookeeper, Mesos, Kubernetes and Docker.
Other tools that sit on top of Hadoop include Apache Airflow (a Data Pipeline tool), Apache Beam (for data workflows) and Apache Flink (data streaming in a distributed, open, reliable manner) etc.
Apache Iceberg for a high-performance format for huge analytic tables – ACIDness to big data in data lakes – enables the use of SQL on unstructured data and enables Spark, Trino, Flink, Presto, Hive, Impala, StarRocks, Doris, and Pig
Apache Beam (Apache Beam runs on top of Flink) is like DataFlow in GCP
Apache Airflow is for workflows including processing, not just data
DataBricks – Open, based upon Spark
Databricks is a unified platform for building, deploying, and maintaining data, analytics, and AI solutions. DataBricks works on all major cloud platforms. On Azure, Microsoft Fabric is a new all-in-one analytics solution for enterprises, built on top of DataBricks.
DataBricks marries Azure (or other cloud platform) with Spark and adds in ACID (data governance) to data held in data lakes.
Many data lakes are built today using Azure Databricks as a general-purpose data and analytics processing engine. The data itself is physically stored in data lakes - example ADLS Gen2 on Azure, or S3 on AWS, or Google Cloud Storage on GCP, but is transformed and cleaned using Azure Databricks.
DataBricks also offers a set of tools offering a faster price/performance data warehouse with a set of tools to make it easier to gain insights from data, create and share analytics, reports etc.
DataBricks simplifies the discovery of insights within the data and can be used with BI tools like Tableau and PowerBI, Looker, Qlik.
Authentication is simple – for example, in Azure, using Azure Active Directory.
DataBricks comes with a SQL native editor with SQL syntax to explore schemas, and it also offers snippets and notebooks to aid developers of code and SQL.
It provides visualizations and dashboards within DataBricks also. It is possible to share databricks with stakeholders, configure dashboards, optimize cluster and configuration for best price and performance for access queries and it is also possible to use DataBricks for ML and Data Science.
Databricks is managed rather than serverless - as you set up clusters and determine auto scaling etc driver and worker instance details.
Note that Spark can be built on top of hdfs, mesos and dbfs (Data brick file system).
Big Data in Azure
Here is a typical Big Data architecture in Azure.
Azure Data Factory (Data Pipeline)
Azure Data Factory provides Data Pipelines for Azure. Typically Raw to Clean to Final data lakes or ETL and ELT are provided out of the box. Serverless, scalable. Connectors for bringing in data in all formats – structured, unstructured, from devices, from iot hubs etc. from databases (sql or no sql). Code free data flows. Apache Spark under the hood for code generation.
A bit like Talend or Informatica. Also like an advanced SSIS (SQL Server Integration Services – cleaning, aggregating, merging data and import/export etc).
No code solution to transforming data.
Azure Data Lake Storage
ADLS2 is an evolution of Azure Blobs to make it easier to perform analytics on big data workloads; Azure Blobs is still better for storing files and unstructured data. In a typical architecture, Blobs might store the “raw” data and ADLS2 might store data in one of the other stages in a Data Pipeline. ADLS2, however, still does not provide ACID (Atomicity, Consistent, Isolated Durable) data governance without Databricks.
Azure Synapse Analytics (Data Warehouse)
Synapse is really a combination of a data warehouse (with SQL queries across metrics and dimensions), speedy big data processing with Apache Spark technologies, and Azure Data explorer for log and time series data analytics.
Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. It brings together the best of SQL technologies used in enterprise data warehousing, Apache Spark technologies for big data, and Azure Data Explorer for log and time series analytics. ELT rather than ETL Loads first
Synapse allows you to collect, transform, and analyze data from just one platform.
Azure Synapse utilizes a 3-component architecture; Data storage, processing, and visualization in a single, unified platform. On the other hand, Databricks utilizes a lakehouse architecture that enables the best data warehouse and data lake features into one continuous platform, Can support unstructured data
Synapse serverless SQL pools analyze unstructured and semi-structured data in Azure Data Lake Storage by using standard T-SQL.

Azure Data Lake
Raw -> Enriched -> Curated
Raw -> Clean -> Final (Consumer ready data)
Azure Synapse Link for Azure Cosmos DB and Azure Synapse Link for Dataverse enable you to run near real-time analytics over operational and business application data, by using the analytics engines that are available from your Azure Synapse workspace: SQL Serverless and Spark Pools.
When using Azure Synapse Link for Azure Cosmos DB, use either a SQL Serverless query or a Spark Pool notebook. You can access the Azure Cosmos DB analytical store and then combine datasets from your near real-time operational data with data from your data lake or from your data warehouse.
The resulting datasets from your SQL Serverless queries can be persisted in your data lake. If you are using Spark notebooks, the resulting datasets can be persisted either in your data lake or data warehouse (SQL pool).
Azure CosmosDB
CosmosDB has a number of flavors, as it supports flexible schemas and hierarchical data. In one “mode” it can mimic a MongoDB (NOSQL) document database, but it can also host transactional SQL data, and this is often what it is used for in Big Data scenarios.
PowerBI
Microsoft Power BI is a business intelligence tool that helps organizations analyze and visualize data to make data-driven decisions. Power BI is a collection of software services, apps, and connectors that can turn data from a variety of sources into static and interactive visualizations. Power BI can connect disparate data sets, transform and clean the data, and create charts or graphs. These visualizations can be shared with other Power BI users within the organization.
Power BI can read data directly from a database, webpage, PDF, or structured files such as spreadsheets, CSV, XML, and JSON. Power BI can also be combined with Azure analytics services to analyze petabytes of data.
Power BI is part of the Microsoft Power Platform. The full Power BI suite includes Power BI Desktop, the Power BI web service, and Power BI Mobile.
Microsoft Fabric (Sort of a LakeHouse) IN PREVIEW ONLY!!! Microsoft specific, not open
Just announced recently, covers data science, real time analytics, business intelligence. Includes management of data lake, data engineering, and data integration, all in one place. Brings Power BI, Azure Synapse, and Azure Data Factory into a single integrated environment. Based upon OneLake, which unifies all your data lakes and connects the data without the need to copy, using shortcuts.
Azure Synapse is the Data Warehouse.
Synapse Data Warehouse is fully managed SAAS solution. Separates storage and compute – storage can be on ADSL2 (or even blobs) or Data Warehouse tables. Compute engine might be Data Factory, Data Engineering, Data Science, Analytics etc.
Azure Synapse Analytics expands this to allow for data pipelines and powerbi (data analytics) and data science with Spark/Python/SQL. This is evolving towards Data Lakehouse/Data Fabric/Data Mesh.
DATA LAKE HOUSE
Microsoft OneLake is Data Lake House on top of blobs or ADLS2 – data lake. ADLS2 supports parquet or delta parquet. Similar to Big Lake in Google Cloud and Spectrum in AWS.
Parquet is a columnar storage format known for its high compression and query performance,
Delta Parquet adds versioning and log history to parquet pretty much and it makes it slightly easier to do updates, deletes and merges. Delta format files provide transactional capabilities and ACID compliance. Delta Parquet pretty much gives you data bricks format.
DELTA (python, scala and java, for versioning and updates/delete/merges with acid)
Delta is storing the data as parquet, just has an additional layer over it with advanced features, providing history of events, (transaction log) and more flexibility on changing the content like, update, delete and merge capabilities. This link delta explains quite good how the files organized. If there are lots of updates, can get fragmented so you have to be careful
https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html
Delta and Databricks may be a little incompatible
Databricks can sit on top of Delta but with these limitations

ML using databricks is a good idea.
Azure Machine Learning Data Studio
SDK exists TensorFlow plugins, NVIDIA plugins Notebooks - Jupyter or Fabric has other tools
Create models with notebooks or Azure ML which is pretty much Spark
MLOps - machine learning for devops for machine learning monitor performance, detect data drift, trigger retraining of models
ML Experiments
Data governance - Purview
Manage and govern data on prem, multicloud, SAAS - auto data discovery, sensitive data classification, map of data - where it is
Delta-Parquet
Problems with old paradigm
- Difficult to do updates/merges/deletes
- Difficult to track history
- ACID issues
Databricks and Fabric provide answers to that. Databricks is more open, Fabric is specific to Microsoft.
Notes
Fabric is more of a data mesh plus tools. It is larger scope. It is built on top of Databricks
Fabric brings together experiences such as Data Engineering, Data Factory, Data Science, Data Warehouse, Real-Time Analytics, and Power BI onto a shared SaaS foundation, all seamlessly integrated into a single service. Microsoft Fabric comes with OneLake, an open & governed, unified SaaS data lake that serves as a single place to store organizational data.
Azure Service Fabric is for Microservices and gels nicely with Azure Fabric for the data layer.
Databricks is more widely accepted by companies out there. For Data Science.
Databricks works with OneLake.
PowerBi in Direct Lake mode gives very fast access to Data Lake data (ADLS2). OneLake supports parquet format (a bit like json format with totals and summary information held at the beginning of files for easier retrieval).

OneLake
OneLake is used for a unified view of all data across the company/organizations - think of it like “Onedrive for data” that everyone in the organization can access. One copy/version of data - single source of truth across many views. Simplified governance and security because all data is in one place.
You get one lake for the organization, just like in one drive. One OneLake for the whole tenant/organization.
Databricks and Fabric can use OneLake. Databricks has a Lake house plus tools and Fabric is a Lakehouse plus tools.

Microsoft Fabric
Fabric Lakehouse is composed of two main folders - files and tables. You can also have shortcuts to the main view of the data in other places. KQL database become shortcuts under the tables section.
OneLake looks like a set of workspaces (like folders) in Fabric. OneLake appears like a folder in the file explorer consisting of many folders/workspaces. Data is much more accessible and approachable. A workspace can be used to collaborate between teams on data. A workspace can be a whole data warehouse, can be a lakehouse, or can just be simpler files etc. You can drill into your data as though the data is folders and files (but really they are tables). Tables and Files can be stored - files can be any format - jpegs gifs pngs json js files etc. Everything is stored as files in parquet too so you can take the urls from there - example ABFSS url (for Databricks). Shortcuts to external data - looks like the data is in OneLake but its just a pointer to the data but you can use it like any other parquet file etc.
Fabric Notebooks have intellisense and you can move to visual code to edit too
You can drag and drop to insert snippets/images.
You can drag/drop snippets for things like reading spark data as a dataframe for example or draw a chart with Matplotlib
Typical notebook commands are display(dataframe/table) or dir(fileslist).
Deltas try to add ACID/governance to unstructured data Delta lake sits on top of Data lakes and below DataBricks or Fabric Lakehouse.
Pandas use dataframes (like tables but for unstructured data).
OneLake is in Delta/Parquet format. Delta extends parquet format to add ACIDity to the file storage including logs and governance.
OneLake supports the same APIs as ADLS Gen2, enabling users to read, write, and manage their data in OneLake. Not entirely a one to one though - and OneLake also enforces a set folder structure to support Fabric workspaces and items.

Google Cloud Big Data Architectures
Dataflow can sense sensitive data using Sensitive Data Protection. PII is tagged in the Data Catalog.
Uses Data Loss Prevention and sends findings to the Data Catalog and to Google Cloud storage.
Avro, Parquet and Json are the three main formats for big data.
Google Cloud Data Lake House (like Fabric in Azure, or Spectrum with Redshift in AWS) uses Dataplex Lake administration. BigLake is a bit like OneLake in Azure – it extends the data warehouse so that it can use the data lake in Cloud Storage etc. Dataplex identifies “entities” in the data lakes etc. and makes them available via BigQuery as though they were tables. Under the hood uses Apache Iceberg

DataProc is like Spark/PySpark but is serverless. Has an optional Apache Flink addon. Apache flink is for data streaming
DataFlow – data pipelines and realtime streaming options based upon Apache Beam (Apache Beam runs on top of Flink). For batch or streaming, executes logic at each component of the pipeline. Multi language like Spark – Scala, Python, R or Java, SQL or Go too

Authorized Views give access to the views/query results without giving access to the underlying data – to specific groups of people.
BigLake puts a Data Lakehouse architecture on top of Google Cloud Storage so that it is a little more ACID and accessible by the data warehouse.
Sample Architectures
Here are some sample architectures I have worked with during my career in the Cloud.
IOT Architectures – Azure IOT Hub, Event Hub, Azure Data Factory as the Data Pipeline, Blobs and Data Lake for the data lakes, Synapse as Data Warehouse, PowerBi and various ML tools using the data in the Data Warehouse.
IOT Architectures – AWS IOT Hub, Kafka and Kinesis Data Streams for data coming in, s3 for data lakes, AWS Data Pipelines as the Data Pipeline, EMR with Hadoop / Hive/ Pig/ Spark, Redshift as Data Warehouse, Quicksight using the data in the Data Warehouse.
Major International organization, used Google Cloud, Domain driven design, Informatica for Data Pipeline, BigQuery for both Data Lakes and Data Warehouse.
Multi Cloud Architecture – used AWS Well Architected Framework with Gruntworks, Active Directory/SSO with IQ and IG, Control Tower, Multi Account architecture, assumed roles, Security Policies. In each account, used VPCs, subnets, Elastic load balancers, AWS Route 53, subnets (private and public), NATs, EC2 instances of different types, kubernetes (EKS), AWS Lambda, Aurora DBs, DynamoDB, DataDog, Github, Jenkins, CloudWatch, CloudTrail.
-- used Azure Terraform template architecture, Azure AD/SSO, Virtual machines, VNets, Subnets, Load Balancers, SQL Server, Azure Synapse, Azure Devops, Azure Functions, Azure AppService, Azure FrontDoor, Kubernetes (AKS), Azure Monitoring
Financial Services – Private Cloud via Pivotal Cloud Foundry PCF – with HDFS Hadoop Hive Pig Spark PySpark, Oracle TeraData, Watson and Einstein doing ML.
Snowflake
Snowflake is a fast, elastic, secure (data at rest and in transit, access controls) data warehousing product offering per second as you go pricing, massively parallel processing, and it is a multi cloud product. Resiliency, data recovery are made easy. Columnar storage like most data warehouses.
Problems with the Data Lakehouse
Separating storage (data lake) and compute (warehouse engine) can have problems/limitations:
- Complexity of managing separate services in the decoupled architecture
- Multiple separate bills makes it hard to know the real cost
- Version history and time travel can be disrupted by changes to the external storage
- Governance is difficult to enforce when data can be accessed or changed at the storage layer
- Performance is highly variable, depending on the query type and the engine that’s processing it
Snowflake claims to have been built as an integrated lakehouse from the start, so that these problems do not arise:
A flexible platform like Snowflake allows you to use traditional business intelligence tools and newer, more advanced technologies devoted to artificial intelligence, machine learning, data science, and other forward-looking data analytic activities. It combines data warehouses, subject-specific data marts, and data lakes into a single source of truth that powers multiple types of workloads.
Data Mesh
A data mesh is a domain-oriented, self-service design that represents a new way of organizing data teams. It helps solve the challenges that often come with quickly scaling a centralized data approach relying on a data lake or data warehouse. Consumption, storage, transformation, and output of data are all decentralized, with each domain data team handling its own specific data.
|