Data Engineering¶
Data Architecture¶
- 10 characteristics of a modern data architecture
- Modern data architecture schema
- Data pipelines
- Data Engineering blog posts by Robin Linacre
- Databases and Data Modelling: a quick crash course
- System Design: ElasticSearch
- Architecture based on feedback loops
Basics¶
- Python for Data Engineers
- JSON Lines
- Modern Data Engineering
- Introduction to Streaming Frameworks
- Data Engineering Design Patterns
- Airbyte Data Glossary
- Modern Data Engineering
Data Engineering Vault¶
- Data Engineering Vault
- The Data Engineering Toolkit: Essential Tools for Your Machine
- BI-as-Code and the New Era of GenBI
- Modern Data Stack: The Struggle of Enterprise Adoption
- Data Lake and Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)
- Data Modeling – The Unsung Hero of Data Engineering: An Introduction to Data Modeling (Part 1)
- Data Engineering Design Patterns (DEDP) book
- The Rise of the Declarative Data Stack
- The Rise of the Semantic Layer
Data catalog¶
Data Structures¶
Database¶
- TimescaleDB
- LanceDB: Developer-friendly, serverless vector database for AI applications
- sqlite-vec: a vector search SQLite extension that runs anywhere!
- OpenDAL
- JameSQL: an in-memory NoSQL database implemented in Python
- icechunk: open-source, cloud-native transactional tensor storage engine
- xarray: N-D labeled arrays and datasets in Python
- zarr-python: an implementation of chunked, compressed, N-dimensional arrays for Python
- drawdb: free, simple, and intuitive online database diagram editor and SQL generator
DuckDB¶
- DuckDB and Motherduck serverless analytics platform
- DuckDB: open source OLAP database
- QuackOSM: an open-source Python and CLI tool for reading OpenStreetMap PBF files using DuckDB
- DuckDB Doesn’t Need Data To Be a Database
- mosaic: an extensible framework for linking databases and interactive views
- Graph components with DuckDB
- Friendly SQL in DuckDB
- DuckDB blog: Friendly Lists and Their Buddies, the Lambdas
- DuckDB Tricks
- Moving from Pandas to DuckDB
- DuckERD: a CLI tool for generating ERD diagrams from DuckDB databases
- DuckDB in Python in the Browser with Pyodide, PyScript and JupyterLite
- A Beginner's Guide to DuckDB's Python Client
- Ducklake: A journey to integrate DuckDB with Unity Catalog
- 15+ companies using duckdb in production: a comprehensive guide
- Mastering DuckDB when you're used to pandas or Polars
- Instant SQL is here: speedrun ad-hoc queries as you type
ACID¶
- deltabase: a lightweight, comprehensive solution for managing delta tables built on polars and deltalake
- strava-datastack: a modern Strava data pipeline fueled by dlt, duckdb, dbt, and evidence.dev
Apache Iceberg¶
- Apache Iceberg O'Reilly Training
- AWS Apache Iceberg technical guide
- Apache Polaris: the interoperable, open source catalog for Apache Iceberg
- 4 hours learning Apache Iceberg
- 7 hours learning Apache Iceberg
Apache DataFusion¶
Monitoring¶
OS¶
Rest API¶
Tools¶
- Luigi
- d6t
- Metaflow
- Airflow
- Dagster
- GPU
- Observer pattern vs Pub Sub pattern
- Prefect
- Ray
- Splink: probabilistic data linkage at scale
- Haystack: an end-to-end framework for production-ready search pipelines
- filequery: Query CSV and Parquet files using SQL
- dbt (data build tools): a command line tool to transform data more effectively
- dlt: an open-source library to load data from various and often messy data sources into well-structured, live datasets
- Trilogy Python Semantic Layers
- yato: yet another transformation orchestrator
- xorq: deferred computational framework for multi-engine pipelines