推荐理由:嵌入式数据库,免安装,小巧高性能,适合数据分析
0. 准备工作
0.1 准备python环境
apt install -y python3-pip
pip3 install --break-system-packages duckdb
apt install -y python3-numpy
0.2 准备运行代码
import duckdb
import argparse
# 定义 norm 函数
def norm(x):
return x.strip().lower() if x is not None else None
# 创建自定义函数
duckdb.create_function("norm", norm, return_type="VARCHAR")
# 设置命令行参数解析器
parser = argparse.ArgumentParser(description='Execute a SQL query from a file.')
parser.add_argument('sql_file', type=str, help='Path to the SQL file')
# 解析命令行参数
args = parser.parse_args()
# 从指定的 SQL 文件读取查询
with open(args.sql_file, 'r') as file:
sql_query = file.read()
# 执行 SQL 查询
result = duckdb.sql(sql_query).fetchall()
# 输出结果
print(result)
1. 直接用 CTE 当“数据管道”
** 利用cte的with结构复用数据结构 **
1.1 构造数据
id,email
1,JohnDoe@gmail.com
2,JaneSmith@yahoo.com
3,RobertBrown@gmail.com
4,LindaLee@hotmail.com
5,MichaelJordan@gmail.com
6,EmilyDavis@outlook.com
1.2 执行语句
WITH raw AS (
SELECT * FROM read_csv_auto('users.csv')
),
clean AS (
SELECT id, LOWER(email) AS email FROM raw
)
SELECT * FROM clean WHERE email LIKE '%gmail%';
1.3 运行例子
python3 run.py cte.sql
2. 把 Python 当存储过程用
name
Alice
Bob
Charlie
David
Eva
SELECT norm(name) as name_cleaned
FROM read_csv_auto('data.csv')
python3 run.py norm.sql
3. 小表 join 大表,先过滤再连
3.1 数据
user_id,email,country
1,JohnDoe@gmail.com,US
2,JaneSmith@yahoo.com,CN
3,RobertBrown@gmail.com,UK
4,LindaLee@hotmail.com,CN
5,MichaelJordan@gmail.com,US
6,EmilyDavis@outlook.com,CN
order_id,user_id,order_date,amount
1001,2,2025-11-01,150
1002,4,2025-11-03,200
1003,6,2025-11-05,250
1004,1,2025-11-06,300
1005,3,2025-11-07,350
3.2 语句
WITH f AS (
SELECT user_id FROM read_csv_auto('customs.csv') WHERE country='CN'
)
SELECT o.*
FROM f
JOIN read_csv_auto('orders.csv') o USING(user_id);
3.3 运行
python3 run.py join.sql
** 反正记住一句话:把能变小的数据尽量提前弄小。**
4. 行转列、列转行模式
user_id,event,cnt
1,login,5
1,view,10
1,purchase,3
2,login,8
2,view,15
2,purchase,6
3,login,7
3,view,12
3,purchase,4
4,login,6
4,view,9
4,purchase,2
5,login,10
5,view,20
5,purchase,8
1.行转列
pivot(
SELECT user_id, event, cnt FROM read_csv_auto('stats.csv')
) ON event USING avg(cnt);
python3 run.py stats.sql
2.列转行
user_id,login,view,purchase
1,3,5,2
2,4,8,1
3,2,7,1
UNPIVOT 'un_stats.csv'
ON login, view, purchase
INTO
NAME type
VALUE cnt;
python3 run.py un_stats.sql
5.开窗函数
event_id,user_id,ts,event_type,description
1,1,2025-12-08 08:00:00,login,User 1 logged in
2,1,2025-12-08 09:00:00,logout,User 1 logged out
3,2,2025-12-08 08:30:00,login,User 2 logged in
4,2,2025-12-08 10:00:00,purchase,User 2 made a purchase
5,3,2025-12-07 18:00:00,signup,User 3 signed up
6,3,2025-12-08 11:00:00,login,User 3 logged in
7,3,2025-12-08 12:00:00,logout,User 3 logged out
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS rn
FROM 'events.csv'
)
WHERE rn = 1;
python3 run.py events.sql
6.交互式练习
Duckdb https://shell.duckdb.org/#queries=v0,%20%20-Create-table-from-Parquet-file%0ACREATE-TABLE-train_services-AS%0A----FROM-'https%3A%2F%2Fblobs.duckdb.org%2Ftrain_services.parquet'~%0A%0A%20%20-Get-the-top%203-busiest-train-stations%0ASELECT%0A----station_name%2C%0A----count(*)-AS-num_services%0AFROM-train_services%0AGROUP-BY-ALL%0AORDER-BY-num_services-DESC%0ALIMIT-3~%0A
相关链接: - Dockdb 仓库:https://github.com/duckdb/duckdb - Dockdb 文档:https://duckdb.org/docs/stable/data/overview