Links
Description
Databricks combines a Data Lakehouse with Generative IA into a Data Intelligence Plateform.
Generative IA allows the usage of natural language to fetch data and allows to optimize storage and costs based on previous usages.
Erreur lors de la création de la vignette : /bin/bash: /usr/bin/convert: No such file or directory Error code: 127
Data objects
Table
View
A temporary view is available while the cluster is running but it is not stored in the schema.
Fichier:Dbsql.svg
|
create or replace temp view view1 as
select * from csv.`${csv_path}`;
|
Functions
Components
Delta Lake
The data lakehouse storage:
- ACID transactions
- Scalable data and metadata handling
- Audit history and time travel (querying previous versions of the data)
- Schema enforcement and evolution
- Streaming and batch data processing
- use Delta Tables (enhanced version of Apache Parquet files, a columnar storage file format optimized for efficient storage and retrieval of large-scale datasets)
Unity Catalog
The data governance module:
- data federation: unified view of data from multiple sources
- handle access permissions to data
- AI-driven monitoring and reporting
Photon
Query engine used to improve query performance, reduce costs, and optimize resource utilization.
Databricks SQL
Datawarehouse component:
- text to SQL queries
- auto-scale for better performances and cost
Workflows
Orchestration:
- intelligent pipeline triggering (scheduled, file arrival trigger, delta table update)
- automatic resource allocation
- automatic checkpoint and recovery (in event of failure, pipeline recovers from the last checkpoint)
- automatic monitoring and alert (errors, timeouts)
Delta Live Tables
ETL & Real-time Analytics
Databricks AI
Data Science and AI
Entry
|
Description
|
Workspace |
Store code, files and projects.
- Home: personal space
- Workspace: shared space, repositories
|
Catalog |
Manage and organize data assets: databases, tables, views and functions
|
Workflow |
Manage jobs and pipelines
|
Compute |
Manage clusters
|
Code
|
import os
import shutil
import requests
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.utils import AnalysisException
catalog_name = "catalog1"
schema_name = "schema1"
volume_name = "volume1"
table_name = "table1"
spark.sql(f"CREATE CATALOG IF NOT EXISTS `{catalog_name}`")
spark.sql(f"USE CATALOG `{catalog_name}`")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS `{schema_name}`")
spark.sql(f"USE SCHEMA `{schema_name}`")
spark.sql(f"CREATE VOLUME IF NOT EXISTS `{volume_name}`")
volume_full_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}"
target_table_full_path = f"`{catalog_name}`.`{schema_name}`.`{table_name}`"
df = spark.table(target_table_full_path)
display(df)
|
Data Analysis
|
catalog_name = "catalog1"
schema_name = "schema1"
bronze_table_name = "bronze"
silver_table_name = "silver"
gold_table_name = "gold"
dbutils.widgets.text("catalog_name", catalog_name)
dbutils.widgets.text("schema_name", schema_name)
dbutils.widgets.text("bronze_table", bronze_table_name)
dbutils.widgets.text("silver_table", silver_table_name)
dbutils.widgets.text("gold_table", gold_table_name)
|
Fichier:DBSQL.svg
|
use catalog identifier(:catalog_name);
use schema identifier(:schema_name);
select * from identifier(:bronze_table);
create or replace table identifier(:silver_table) as
select cast(column1 as float), cast(column2 as int)
from identifier(:bronze_table)
where try_cast(column1 as float) is not null
and try_cast(column2 as int) > 100
order by column1;
-- get the column names and types of a table
describe identifier(:silver_table);
|
The identifier() function ensures that the widget value is treated as a valid database object name.
|
dbutils.widgets.text("table_name", "table1")
dbutils.widgets.remove("table_name")
dbutils.widgets.removeAll()
|
CSV
|
catalog_name = "catalog1"
schema_name = "schema1"
volume_name = "volume1"
volume_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}"
csv_dataset_path = f"{volume_path}/data.csv"
|
Fichier:Dbsql.svg
|
read_csv(
"${csv_dataset_path}",
sep => ",",
header => true,
mode => "FAILFAST"
);
select * from csv.`${csv_dataset_path}` limit 10;
select * from csv.`/Volumes/catalog1/schema1/volume1/data.csv` limit 10;
|
History
1980 - Data warehouse |
Collect and store structured data to provide support for for refined analysis and reporting.
|
2000 - Data lake |
Collect and store raw data and conducting exploratory analysis
|
2021 - Data lakehouse |
Unified plateform that benefits of both data lakes and data warehouses solution
|
Aspect
|
Data Warehouse
|
Data Lake
|
Data Lakehouse
|
Data Type |
Structured, processed, and refined data |
Raw data: structured, semi-structured, and unstructured |
Combines raw and processed data
|
Schema |
Schema-on-write: Data is structured before storage |
Schema-on-read: Structure applied when accessed |
Flexible: Schema-on-read for raw data; schema-on-write for structured data
|
Purpose |
Optimized for business intelligence (BI), reporting, and predefined analytics |
Designed for big data analytics, machine learning, and exploratory analysis |
Unified analytics platform for BI, AI/ML, streaming, and real-time analytics
|
Processing Approach |
ETL: Data is cleaned and transformed before storage |
ELT: Data is loaded first and transformed as needed |
Both ETL and ELT; enables real-time processing
|
Scalability |
Less scalable and more expensive to scale |
Highly scalable and cost-effective for large volumes of diverse data |
Combines scalability of lakes with performance optimization of warehouses
|
Users |
Business analysts and decision-makers |
Data scientists, engineers, and analysts |
BI teams, data scientists, engineers
|
Accessibility |
More rigid; changes to structure are complex |
Flexible; easy to update and adapt |
Highly adaptable; supports schema evolution
|
Security & Maturity |
Mature security measures; better suited for sensitive data |
Security measures evolving; risk of "data swamp" if not managed properly |
Strong governance with ACID transactions; improved reliability
|
Use Cases |
Operational reporting, dashboards, KPIs |
Predictive analytics, AI/ML models, real-time analytics |
Unified platform for BI dashboards, AI/ML workflows, streaming analytics
|