Data Engineering¶
Data Architecture¶
- 10 characteristics of a modern data architecture
- Modern data architecture schema
- Data pipelines
- Data Engineering blog posts by Robin Linacre
- Databases and Data Modelling: a quick crash course
- System Design: ElasticSearch
- Architecture based on feedback loops
Basics¶
- Python for Data Engineers
- JSON Lines
- Modern Data Engineering
- Introduction to Streaming Frameworks
- Data Engineering Design Patterns
- Airbyte Data Glossary
- Modern Data Engineering
Database¶
- TimescaleDB
- LanceDB: Developer-friendly, serverless vector database for AI applications
- sqlite-vec: a vector search SQLite extension that runs anywhere!
DuckDB¶
- DuckDB and Motherduck serverless analytics platform
- DuckDB: open source OLAP database
- QuackOSM: an open-source Python and CLI tool for reading OpenStreetMap PBF files using DuckDB
- DuckDB Doesn’t Need Data To Be a Database
- mosaic: an extensible framework for linking databases and interactive views
- Graph components with DuckDB
- Friendly SQL in DuckDB
- DuckDB blog: Friendly Lists and Their Buddies, the Lambdas
- DuckDB Tricks
- Moving from Pandas to DuckDB
- DuckERD: a CLI tool for generating ERD diagrams from DuckDB databases
- DuckDB in Python in the Browser with Pyodide, PyScript and JupyterLite
- A Beginner's Guide to DuckDB's Python Client
- Ducklake: A journey to integrate DuckDB with Unity Catalog
- 15+ companies using duckdb in production: a comprehensive guide
ACID¶
Apache Iceberg¶
- Apache Iceberg O'Reilly Training
- AWS Apache Iceberg technical guide
- Apache Polaris: the interoperable, open source catalog for Apache Iceberg
- 4 hours learning Apache Iceberg
- 7 hours learning Apache Iceberg
Apache DataFusion¶
Monitoring¶
OS¶
Rest API¶
Tools¶
- Luigi
- d6t
- Metaflow
- Airflow
- Dagster
- GPU
- Observer pattern vs Pub Sub pattern
- Prefect
- Ray
- Splink: probabilistic data linkage at scale
- Haystack: an end-to-end framework for production-ready search pipelines
- filequery: Query CSV and Parquet files using SQL
- dbt (data build tools): a command line tool to transform data more effectively
- dlt: an open-source library to load data from various and often messy data sources into well-structured, live datasets
- Trilogy Python Semantic Layers