Links

Description

Databricks combines a Data Lakehouse with Generative IA into a Data Intelligence Plateform.
Generative IA allows the usage of natural language to fetch data and allows to optimize storage and costs based on previous usages.
Erreur lors de la création de la vignette : /bin/bash: /usr/bin/convert: No such file or directory Error code: 127

Components

Delta Lake

The data lakehouse storage:

ACID transactions
Scalable data and metadata handling
Audit history and time travel (querying previous versions of the data)
Schema enforcement and evolution
Streaming and batch data processing
use Delta Tables (enhanced version of Apache Parquet files, a columnar storage file format optimized for efficient storage and retrieval of large-scale datasets)

Unity Catalog

The data governance module:

data federation: unified view of data from multiple sources
handle access permissions to data
AI-driven monitoring and reporting

Photon

Query engine used to improve query performance, reduce costs, and optimize resource utilization.

Databricks SQL

Datawarehouse component:

text to SQL queries
auto-scale for better performances and cost

Workflows

Orchestration:

intelligent pipeline triggering (scheduled, file arrival trigger, delta table update)
automatic resource allocation
automatic checkpoint and recovery (in event of failure, pipeline recovers from the last checkpoint)
automatic monitoring and alert (errors, timeouts)

Delta Live Tables

ETL & Real-time Analytics

Databricks AI

Data Science and AI

Code

import os
import shutil
import requests
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.utils import AnalysisException

catalog_name = "catalog1"
schema_name = "schema1"
volume_name = "volume1"
table_name = "table1"

spark.sql(f"CREATE CATALOG IF NOT EXISTS `{catalog_name}`")  # create a catalog
spark.sql(f"USE CATALOG `{catalog_name}`")                   # set as the default catalog

spark.sql(f"CREATE SCHEMA IF NOT EXISTS `{schema_name}`")    # create a schema
spark.sql(f"USE SCHEMA `{schema_name}`")                     # set as the default schema

spark.sql(f"CREATE VOLUME IF NOT EXISTS `{volume_name}`")    # create a volume
volume_full_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}"

target_table_full_path = f"`{catalog_name}`.`{schema_name}`.`{table_name}`"
df = spark.table(target_table_full_path)
display(df)

History

1980 - Data warehouse	Collect and store structured data to provide support for for refined analysis and reporting.
2000 - Data lake	Collect and store raw data and conducting exploratory analysis
2021 - Data lakehouse	Unified plateform that benefits of both data lakes and data warehouses solution

Aspect	Data Warehouse	Data Lake	Data Lakehouse
Data Type	Structured, processed, and refined data	Raw data: structured, semi-structured, and unstructured	Combines raw and processed data
Schema	Schema-on-write: Data is structured before storage	Schema-on-read: Structure applied when accessed	Flexible: Schema-on-read for raw data; schema-on-write for structured data
Purpose	Optimized for business intelligence (BI), reporting, and predefined analytics	Designed for big data analytics, machine learning, and exploratory analysis	Unified analytics platform for BI, AI/ML, streaming, and real-time analytics
Processing Approach	ETL: Data is cleaned and transformed before storage	ELT: Data is loaded first and transformed as needed	Both ETL and ELT; enables real-time processing
Scalability	Less scalable and more expensive to scale	Highly scalable and cost-effective for large volumes of diverse data	Combines scalability of lakes with performance optimization of warehouses
Users	Business analysts and decision-makers	Data scientists, engineers, and analysts	BI teams, data scientists, engineers
Accessibility	More rigid; changes to structure are complex	Flexible; easy to update and adapt	Highly adaptable; supports schema evolution
Security & Maturity	Mature security measures; better suited for sensitive data	Security measures evolving; risk of "data swamp" if not managed properly	Strong governance with ACID transactions; improved reliability
Use Cases	Operational reporting, dashboards, KPIs	Predictive analytics, AI/ML models, real-time analytics	Unified platform for BI dashboards, AI/ML workflows, streaming analytics

Entry	Description
Workspace	Store code, files and projects. Home: personal space Workspace: shared space, repositories
Catalog	Manage and organize data assets: databases, tables, views and functions
Workflow	Manage jobs and pipelines
Compute	Manage clusters