← Modification précédente Modification suivante →

Version du 8 avril 2025 à 14:40

Evolution

1980 - Data warehouse	Collect and store structured data to provide support for for refined analysis and reporting.
2000 - Data lake	Collect and store raw data and conducting exploratory analysis
2021 - Data lakehouse	Unified plateform that benefits of both data lakes and data warehouses solution

Aspect	Data Warehouse	Data Lake	Data Lakehouse
Data Type	Structured, processed, and refined data	Raw data: structured, semi-structured, and unstructured	Combines raw and processed data
Schema	Schema-on-write: Data is structured before storage	Schema-on-read: Structure applied when accessed	Flexible: Schema-on-read for raw data; schema-on-write for structured data
Purpose	Optimized for business intelligence (BI), reporting, and predefined analytics	Designed for big data analytics, machine learning, and exploratory analysis	Unified analytics platform for BI, AI/ML, streaming, and real-time analytics
Processing Approach	ETL: Data is cleaned and transformed before storage	ELT: Data is loaded first and transformed as needed	Both ETL and ELT; enables real-time processing
Scalability	Less scalable and more expensive to scale	Highly scalable and cost-effective for large volumes of diverse data	Combines scalability of lakes with performance optimization of warehouses
Users	Business analysts and decision-makers	Data scientists, engineers, and analysts	BI teams, data scientists, engineers
Accessibility	More rigid; changes to structure are complex	Flexible; easy to update and adapt	Highly adaptable; supports schema evolution
Security & Maturity	Mature security measures; better suited for sensitive data	Security measures evolving; risk of "data swamp" if not managed properly	Strong governance with ACID transactions; improved reliability
Use Cases	Operational reporting, dashboards, KPIs	Predictive analytics, AI/ML models, real-time analytics	Unified platform for BI dashboards, AI/ML workflows, streaming analytics

Data warehouse 1980

Collect and store structured data to provide support for for refined analysis and reporting.

Pros	Cons
Business Intelligence	Struggle with volume and velocity upticks
Analytics	Long processing time
Structured data	No support for semi-structured and unstructured data
Predefined schemas	Inflexible schemas

Data lake 2000

Collect and store raw data and conducting exploratory analysis.

Pros	Cons
Flexible data storage	Poor data reliability
Streaming support	No transactional support
Fast and cost efficient storage in the cloud	Slow analysis performance
Support for IA and ML	Data governance concerns (security, privacy)

Lake house

Pros	Cons
Flexible data storage	Poor data reliability
Streaming support	No transactional support
Fast and cost efficient storage in the cloud	Slow analysis performance
Support for IA and ML	Data governance concerns (security, privacy)

@@ Ligne 13 : / Ligne 13 : @@
 ! Data Warehouse
 ! Data Lake
+! Data Lakehouse
 |-
-| Data Type || Stores structured, processed, and refined data || Stores raw data, including structured, semi-structured, and unstructured data
+| Data Type || Structured, processed, and refined data || Raw data: structured, semi-structured, and unstructured || Combines raw and processed data
 |-
-| Schema || Schema-on-write: Data is structured and organized before being stored || Schema-on-read: Data is stored in its raw form; structure is applied when accessed or analyzed
+| Schema || Schema-on-write: Data is structured before storage || Schema-on-read: Structure applied when accessed || Flexible: Schema-on-read for raw data; schema-on-write for structured data
 |-
-| Purpose || Optimized for business intelligence (BI), reporting, and predefined analytics || Designed for big data analytics, machine learning, and exploratory analysis
+| Purpose || Optimized for business intelligence (BI), reporting, and predefined analytics || Designed for big data analytics, machine learning, and exploratory analysis || Unified analytics platform for BI, AI/ML, streaming, and real-time analytics
 |-
-| Processing Approach || Uses ETL (Extract, Transform, Load): Data is cleaned and transformed before storage || Uses ELT (Extract, Load, Transform): Data is loaded first and transformed as needed
+| Processing Approach || ETL: Data is cleaned and transformed before storage || ELT: Data is loaded first and transformed as needed || Both ETL and ELT; enables real-time processing
 |-
-| Scalability || Less scalable and more expensive to scale || Highly scalable and cost-effective for large volumes of diverse data
+| Scalability || Less scalable and more expensive to scale || Highly scalable and cost-effective for large volumes of diverse data || Combines scalability of lakes with performance optimization of warehouses
 |-
-| Users || Business analysts and decision-makers || Data scientists, engineers, and analysts
+| Users || Business analysts and decision-makers || Data scientists, engineers, and analysts || BI teams, data scientists, engineers
 |-
-| Accessibility || More rigid; changes to structure are complex || Flexible; easy to update and adapt
+| Accessibility || More rigid; changes to structure are complex || Flexible; easy to update and adapt || Highly adaptable; supports schema evolution
 |-
-| Security & Maturity || Mature security measures; better suited for sensitive data || Security measures still evolving; risk of "data swamp" if not managed properly
+| Security & Maturity || Mature security measures; better suited for sensitive data || Security measures evolving; risk of "data swamp" if not managed properly || Strong governance with ACID transactions; improved reliability
 |-
-| Use Cases || Operational reporting, dashboards, KPIs || Predictive analytics, AI/ML models, real-time analytics
+| Use Cases || Operational reporting, dashboards, KPIs || Predictive analytics, AI/ML models, real-time analytics || Unified platform for BI dashboards, AI/ML workflows, streaming analytics
 |}