数据库学习资料(2022-03-18 更新)


作者: 康凯森

日期: 2021-12-19

分类: OLAP


StarRocks 将近两年的时间里面,我们查询团队从零实现了向量化执行器,CBO 查询优化器,Pipeline 并行查询引擎,刷新国产 OLAP 数据库性能的里程碑,本文整理了我平时参考和学习的一些数据库资料,希望对大家有所帮助,也欢迎大家参与 StarRocks 开源社区。

本文章会努力持续更新,也欢迎大家一起贡献和修改。

TODO:

  • [ ] 系统综述部分进行分类
  • [ ] 每篇材料给出难度评级(入门,进阶,专业)
  • [ ] 每篇材料给出必要性评级 (可选,必读)
  • [ ] 每篇论文备注下核心观点,以及我们可以学习到什么
  • [ ] 每个数据库系统架构的优点,缺点,以及我们可以学习到什么

数据库总览

书籍

大学课程

系列教程

文章

How does a relational database work

数据库领域博客专栏

数据库系统综述

数据库系统简介

https://dbdb.io/

Amazon Aurora

Amazon Aurora paper

AnalyticDB

AnalyticDB: Real-time OLAP Database System at Alibaba Cloud

Arrow

Apache Arrow: In Theory, In Practice

Apache Calcite

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Apache Pulsar

「分布式系统前沿技术」专题 | Pulsar 的设计哲学

AresDB

Introducing AresDB: Uber’s GPU-Powered Open Source, Real-time Analytics Engine

Ceph

Ceph: A Scalable, High-Performance Distributed File System

ChubaoFS

CFS: A Distributed File System for Large Scale Container Platforms

ClickHouse

The Secrets of ClickHouse Performance Optimizations

CockroachDB

CockroachDB: The Resilient Geo-Distributed SQL Database.pdf)

CynosDB

DB2

DB2 with BLU Acceleration: So Much More than Just a Column Store

Dremio

Using Apache Arrow, Calcite and Parquet to build a Relational Cache

Dremio white-paper

Druid

Druid: A Real-time Analytical Data Store

F1

F1 Query: Declarative Querying at Scale

Apache Flink™: Stream and Batch Processing in a Single Engine

FoundationDB

FoundationDB: A Distributed Unbundled Transactional Key Value Store

Google BigQuery

Google Dremel

Dremel: Interactive Analysis of Web-Scale Datasets

Google Procella

Procella: Unifying serving and analytical data at YouTube

Google Napa

Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google

GreenPlum

Greenplum Architecture

Why Greenplum is the best compared with others

HAWQ

HAWQ: A Massively Parallel Processing SQL Engine in Hadoop

HBase

HBase 基本概念

HiStore

HiStore:阿里巴巴海量数据场景下的OLAP解决方案

海量高性能列式数据库HiStore介绍

Hudi

Apache Hudi Design And Architecture

Hyper

Hyper Paper

Ignite

What is Apache Ignite?

Impala

Impala: A Modern, Open-Source SQL Engine for Hadoop

Kudu

Kudu: Storage for Fast Analytics on Fast Data

Mysql

Mesa

Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing

Memsql

OceanBase

OceanBase存储系统架构的演进历程及工程实践

OmniSciDB

OmniSci Technical Whitepaper

Pinot

Pinot: Realtime OLAP for 530 Million Users

PostgreSQL

Presto

Presto: SQL on Everything

Redshift

Amazon Redshift and the Case for Simpler Data Warehouses

Data Warehousing in the Cloud: Amazon Redshift vs Microsoft Azure SQL

SAP HANA

SAP HANA: A Data Platform for Enterprise Applications Purpose Built for Modern Hardware ★★★★★

SAP HANA: A Data Platform for Enterprise Applications Purpose Built for Modern Hardware 视频

Spanner

Spanner: Google’s Globally-Distributed Database

SnappyData

SnappyData: A Unified Cluster for Streaming,Transactions, and Interactive Analytics

Snowflake

The Snowflake Elastic Data Warehouse

Building An Elastic Query Engine on Disaggregated Storage

Snowflake Database Internals

Spark

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Spark SQL: Relational Data Processing in Spark

Splice machine

Splice Machine

Splice Machine – An HTAP DB at Scale

TiDB

Tidb a raft based HTAP database

TiDB 设计文档

Vertica

The Vertica Analytic Database: C-Store 7 Years Later

VoltDB

H-Store And VoltDB

X-Engine

X-Engine: An Optimized Storage Engine for Large-scale E-commerce Transaction Processing

Hologres

Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing

C++ 语言

  1. https://www.learncpp.com/ 比较系统,通俗易懂,适合入门时系统学习,也可以在忘记某个细节时查一查
  2. A Tour of C++ ★★★★★ 如果有其他编码语言基础,建议直接阅读此书入门 C++
  3. Build your own Database 介绍了数据库开发需要的 C++ 知识点,里面的知识点搞懂后,就可以开始用 C++ 开发数据库了
  4. https://isocpp.org/faq ★★★★★ 对某个 C++ 细节点搞不清楚时,可以进行参考
  5. Programming Notes for Professionals books Stack Overflow C++ 问题高频,经典问题的系统整理,在对某个 C++ 细节点搞不清楚时,可以进行参考
  6. CppCoreGuidelines C++ 编程规范和最佳实践,有空了可以系统过一遍,不过每个细节不用都记住,IDE 有提供对应的插件,可以在你编码时进行提醒
  7. CppCon 历年会议 PPT ★★★★★ 有很多关于 C++ 设计,语法,工具,性能优化,数据结构的精彩分享, 感觉自己 C++ 入门后,可以每天抽空看一个小分享
  8. Effective C++ 系列:Effective 系列有4本书,自己时间充裕的话,可以阅读全书,时间紧张的话建议直接阅读网上大家整理好的笔记

编程语言是工具,工具重在使用,所以学习编程语言和学习英语一样,提升最快的方式就是多使用,多实践,不需要等到对语言完全熟悉或者精通后才开始使用,实践中遇到了某个点不清楚,就一个点一个点逐一突破。

查询优化器

论文

文章

查询执行

论文

PPT

向量化

论文

课程

文章 & PPT

查询编译

论文

课程

文章 & PPT

存储

论文

文章

数据导入

测试

事务

  • A Critique of ANSI SQL Isolation Levels
  • Transaction Processing: Concepts and Techniques
  • Granularity of Locks and Degrees of Consistency in a Shared Data Base
  • (Percolator) Large-scale Incremental Processing Using Distributed Transactions and Notifications
  • (Omid) Omid, Reloaded: Scalable and Highly-Available Transaction Processing
  • (2PC) Database System Concepts.
  • (2PC) Database Systems The Complete Book
  • (OCC) On Optimistic Methods for Concurrency Control
  • (TicTok) TicToc: Time Traveling Optimistic Concurrency Control
  • (Silo) Speedy Transactions in Multicore In-Memory Databases
  • (Hekaton) High-Performance Concurrency Control Mechanisms for Main-Memory Databases
  • Deferred Action Framework

Cloud-Native

待完善

性能优化

Profile 工具

CPU 微架构

CPU Cache

如果学习了某部分理论,却不知如何实践,欢迎参与 StarRocks 开源社区,上面所有的理论在 StarRocks 中都有可以实践的地方,欢迎大家的 Star, Issue 和 PR。


《OLAP 性能优化指南》欢迎 Star&共建

《OLAP 性能优化指南》

欢迎关注微信公众号