Links

Description

Databricks combines a Data Lakehouse with Generative IA into a Data Intelligence Plateform.
Generative IA allows the usage of natural language to fetch data and allows to optimize storage and costs based on previous usages.
Erreur lors de la création de la vignette : /bin/bash: /usr/bin/convert: No such file or directory Error code: 127

Data objects

Table

View

A temporary view is available while the cluster is running but it is not stored in the schema.

Fichier:Dbsql.svg

create or replace temp view view1 as
select * from csv.`${csv_path}`;

Functions

Components

Delta Lake

The data lakehouse storage:

ACID transactions
Scalable data and metadata handling
Audit history and time travel (querying previous versions of the data)
Schema enforcement and evolution
Streaming and batch data processing
use Delta Tables (enhanced version of Apache Parquet files, a columnar storage file format optimized for efficient storage and retrieval of large-scale datasets)

Unity Catalog

The data governance module:

data federation: unified view of data from multiple sources
handle access permissions to data
AI-driven monitoring and reporting

Photon

Query engine used to improve query performance, reduce costs, and optimize resource utilization.

Databricks SQL

Datawarehouse component:

text to SQL queries
auto-scale for better performances and cost

Workflows

Orchestration:

intelligent pipeline triggering (scheduled, file arrival trigger, delta table update)
automatic resource allocation
automatic checkpoint and recovery (in event of failure, pipeline recovers from the last checkpoint)
automatic monitoring and alert (errors, timeouts)

Delta Live Tables

ETL & Real-time Analytics

Databricks AI

Data Science and AI

Code

import os
import shutil
import requests
import pandas as pd
from pyspark.sql.types import StructType, StructField, StringType
from pyspark.sql.utils import AnalysisException

catalog_name = "catalog1"
schema_name = "schema1"
volume_name = "volume1"
table_name = "table1"

spark.sql(f"CREATE CATALOG IF NOT EXISTS `{catalog_name}`")  # create a catalog
spark.sql(f"USE CATALOG `{catalog_name}`")                   # set as the default catalog

spark.sql(f"CREATE SCHEMA IF NOT EXISTS `{schema_name}`")    # create a schema
spark.sql(f"USE SCHEMA `{schema_name}`")                     # set as the default schema

spark.sql(f"CREATE VOLUME IF NOT EXISTS `{volume_name}`")    # create a volume
volume_full_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}"

target_table_full_path = f"`{catalog_name}`.`{schema_name}`.`{table_name}`"
df = spark.table(target_table_full_path)
display(df)

Data Analysis

# variables
catalog_name = "catalog1"
schema_name = "schema1"
bronze_table_name = "bronze" # raw data
silver_table_name = "silver" # processed and cleaned data
gold_table_name = "gold" # aggregated data with business insights

# widgets
dbutils.widgets.text("catalog_name", catalog_name)
dbutils.widgets.text("schema_name", schema_name)
dbutils.widgets.text("bronze_table", bronze_table_name)
dbutils.widgets.text("silver_table", silver_table_name)
dbutils.widgets.text("gold_table", gold_table_name)

Fichier:DBSQL.svg

use catalog identifier(:catalog_name);
use schema identifier(:schema_name);

select * from identifier(:bronze_table);

create or replace table identifier(:silver_table) as
select cast(column1 as float), cast(column2 as int)
from identifier(:bronze_table)
where try_cast(column1 as float) is not null
and try_cast(column2 as int) > 100
order by column1;

-- get the column names and types of a table
describe identifier(:silver_table);

Using Widgets for SQL Parameterization

The identifier() function ensures that the widget value is treated as a valid database object name.

# create a widget
dbutils.widgets.text("table_name", "table1")

dbutils.widgets.remove("table_name") # remove a widget
dbutils.widgets.removeAll()