1 TPC-DS和TPC-H的区别
说起数据库测试基准,第一个想到的tpc-c,tpc-c常常被用于在线事务处理(OLTP)数据库的性能测试,比如linux上常用的sysbench测试工具就支持oltp测试,开源工具sysbench-tpcc就基于sysbench的tpcc测试工具。
oltp的测试基准是tpcc,olap(在线分析处理)的测试基准呢?广为人知的是tpc-H,TPC-H面向商品零售业,它定义了8张表,22个查询,遵循SQL92标准,它的表结构同oltp的表架构比较接近。
另一个不太为人知的标准就是本文要提到的TPC-DS,这个标准表结构是典型的数据仓库的表结构,采用星型、雪花型等多维数据模式。它包含7张事实表,17张纬度表,跟大数据的分析挖掘应用非常类似。对于数据仓库的初学者来说,通过实际的例子来学习相对起来容易一些,这个数据集是不错的选择。
2 TPC-DS的下载和编译
下载连接在这个位置
https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp
在这个页面填入信息后,点击下载,可能会出现下面这个错误
这个错误并不是输入有误,用必应搜索一下得知可以在火狐浏览器上安装gooreplacer插件来解决,插件的地址是:
https://addons.mozilla.org/zh-CN/firefox/addon/gooreplacer/
下载插件安装后,设置一下这个插件:
设置完之后重启一下浏览器,打开TPC-DS的下载连接,输入信息,点击下载后。输入的电子邮件地址会收到下载连接。将这个连接复制粘贴到浏览器就可以下载了。
这个工具的编译十分简单,进入/usr/local/tpcds/tools目录下运行make命令即可。
[root@ tools]# pwd /usr/local/tpcds/tools [root@ tools]# make
3 准备MySQL数据库
MySQL数据库的准备也不复杂,创建一个数据库,在数据库内创建表就可以了。
3.1 创建并切换到数据库
mysql>create database tpcds DEFAULT CHARSET utf8 COLLATE utf8_general_ci; Query OK,1 row affected (0.00 sec)mysql> use tpcds; Database changed
3.2 创建表
建表脚本比较长,就不在这里粘贴了,建表后的结果如下(一共是25个表):
mysql> show tables;+------------------------+| Tables_in_tpcds |+------------------------+| call_center || catalog_page || catalog_returns || catalog_sales || customer || customer_address || customer_demographics || date_dim || dbgen_version || household_demographics || income_band || inventory || item || promotion || reason || ship_mode || store || store_returns || store_sales || time_dim || warehouse || web_page || web_returns || web_sales || web_site |+------------------------+25 rows inset(0.00 sec)
4 生成测试数据
生成测试数据之前先要创建测试数据存放的目录
[root@ tools]# mkdir -p /tmp/tpcds_data
运行生成测试数据的命令
[root@ tools]# ./dsdgen -DIR /tmp/tpcds_data -SCALE 1 -TERMINATE N, tpcds.sql dsdgen Population Generator (Version 3.2.0) Copyright Transaction Processing Performance Council (TPC) 2001-2021 Warning: This scale factor is valid for QUALIFICATION ONLY
-SCALE 参数指定数据的大小,以G为单位。
5 生成导入数据的脚本
tpc-ds生成的测试数据可以用load命令导入MySQL数据库,可以用文本编辑器编辑,也可以用shell脚本生成,这里给出一个shell脚本,只需要调整数据文件所在的位置就可以生成25个表的导入脚本。
[root@ tpcds_data]# for file in `ls -l |awk '{print $9}'`; > do > echo"LOAD DATA INFILE '/tmp/tpcds_data/"$file"' INTO TABLE"${file%%.*}" FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';"; > done
脚本是一个for循环,ls -l |awk '{print $9}'打印出来是当前目录下的文件名,文件名是带有扩展名的,针对每一取到的文件名,利用linux字符串功能拼接成load语句,${file%%.*}截取文件名中'.'左边的部分,这时是花括号,不是括号。这个脚本的输出结果如下:
LOAD DATA INFILE '/tmp/tpcds_data/call_center.dat' INTO TABLE call_center FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/catalog_page.dat' INTO TABLE catalog_page FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/catalog_returns.dat' INTO TABLE catalog_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/catalog_sales.dat' INTO TABLE catalog_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/customer_address.dat' INTO TABLE customer_address FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/customer.dat' INTO TABLE customer FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/customer_demographics.dat' INTO TABLE customer_demographics FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/date_dim.dat' INTO TABLE date_dim FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/dbgen_version.dat' INTO TABLE dbgen_version FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/household_demographics.dat' INTO TABLE household_demographics FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/income_band.dat' INTO TABLE income_band FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/inventory.dat' INTO TABLE inventory FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/item.dat' INTO TABLE item FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/promotion.dat' INTO TABLE promotion FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/reason.dat' INTO TABLE reason FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/ship_mode.dat' INTO TABLE ship_mode FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/store.dat' INTO TABLE store FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/store_returns.dat' INTO TABLE store_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/store_sales.dat' INTO TABLE store_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/time_dim.dat' INTO TABLE time_dim FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/warehouse.dat' INTO TABLE warehouse FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/web_page.dat' INTO TABLE web_page FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/web_returns.dat' INTO TABLE web_returns FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/web_sales.dat' INTO TABLE web_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; LOAD DATA INFILE '/tmp/tpcds_data/web_site.dat' INTO TABLE web_site FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
6 导入数据
登录MySQL数据库,切换到tcpds数据库下,粘贴生成的脚本就可以导入数据了。导入时往往报下面这个错误。
mysql> LOAD DATA INFILE '/tmp/tpcds_data/call_center.dat'INTOTABLE call_center FIELDS TERMINATED BY'|' LINES TERMINATED BY'\n'; ERROR 1290(HY000): The MySQL server is running with the --secure-file-priv option so it cannot execute this statement
出现这个错误的原因是secure_file_priv设置的问题,这个变量默认的设置为空
mysql> show variables like'%secure_file_priv%';+------------------+-------+| Variable_name | Value |+------------------+-------+| secure_file_priv |NULL|+------------------+-------+1 row inset(0.00 sec)
需要将它设置为存放要导入文件的目录
mysql>set secure_file_priv="/tmp/tpcds_data/"; ERROR 1238(HY000): Variable 'secure_file_priv'is a read only variable
这个变量是只读变量,只能在数据库启动之前设置,关闭MySQL数据库,重启后带上这个选项
[root@ tpcds_data]# mysqld_safe --user=mysql --datadir=/mysqldata --secure_file_priv="/tmp/tpcds_data/"& [1] 60061[root@ tpcds_data]# 2022-08-30T07:10:08.899244Z mysqld_safe Logging to '/mysqldata/iZ2ze0t8khaprrpfvmevjiZ.err'.2022-08-30T07:10:08.926603Z mysqld_safe Starting mysqld daemon with databases from /mysqldata
再次登录数据库就可以导入了。
mysql> LOAD DATA INFILE '/tmp/tpcds_data/call_center.dat'INTOTABLE call_center FIELDS TERMINATED BY'|' LINES TERMINATED BY'\n'; ERROR 1292(22007): Incorrect date value:'' for column 'cc_rec_end_date' at row 1
这个错误可以通过设置sql模式来消除
mysql>set sql_mode ='NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'; Query OK,0 rows affected (0.00 sec)
再次导入
mysql> LOAD DATA INFILE '/tmp/tpcds_data/call_center.dat'INTOTABLE call_center FIELDS TERMINATED BY'|' LINES TERMINATED BY'\n'; Query OK,6 rows affected,9 warnings (0.00 sec) Records:6 Deleted:0 Skipped:0 Warnings:9
7 生成查询
# ./dsqgen -input ../query_templates/templates.lst -directory ../query_templates -output ./sql.ansi/ -DIALECT ansi -LOG ./sql.ansi/ansi.log
-directory是查询模板所在的目录,-DIALECT设置sql语言的版本
生成查询的过程中也有可能报下面这个错误
qgen2 Query Generator (Version 3.2.0) Copyright Transaction Processing Performance Council (TPC) 2001-2021 Warning: This scale factor is valid for QUALIFICATION ONLY ERROR: Substitution'_END' is used before being initialized at line 63in ../query_templates/query1.tpl
需要处理一下查询模板文件,在每个文件的末尾加上一行
[root@ query_templates]# for i in `ls query*tpl` > do > echo$i; > echo"define _END = \"\";" >> $i > done
生成查询之间仍然需要创建查询的存储目录
[root@ tools]# mkdir sql.ansi
生成查询
[root@ tools]# ./dsqgen -input ../query_templates/templates.lst -directory ../query_templates -output ./sql.ansi/ -DIALECT ansi -LOG ./sql.ansi/ansi.log qgen2 Query Generator (Version 3.2.0) Copyright Transaction Processing Performance Council (TPC) 2001-2021 Warning: This scale factor is valid for QUALIFICATION ONLY
8 运行查询
找一个查询语句运行一下
selectcount(*)from store_sales ,household_demographics ,time_dim, store where ss_sold_time_sk = time_dim.t_time_skand ss_hdemo_sk = household_demographics.hd_demo_skand ss_store_sk = s_store_sk and time_dim.t_hour=8and time_dim.t_minute>=30and household_demographics.hd_dep_count=5and store.s_store_name='ese'orderbycount(*);
看一下这条语句的执行计划
mysql> explain selectcount(*)from store_sales ,household_demographics ,time_dim, store where ss_sold_time_sk = time_dim.t_time_skand ss_hdemo_sk = household_demographics.hd_demo_skand ss_store_sk = s_store_sk and time_dim.t_hour=8and time_dim.t_minute>=30and household_demographics.hd_dep_count=5and store.s_store_name='ese'orderbycount(*);+----+-------------+------------------------+------------+--------+---------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------------+| id | select_type |table| partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |+----+-------------+------------------------+------------+--------+---------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------------+|1| SIMPLE | store |NULL| ALL | PRIMARY |NULL|NULL|NULL|12|10.00| Using where||1| SIMPLE | store_sales |NULL| ALL |NULL|NULL|NULL|NULL|3754283|10.00| Using where; Using join buffer (Block Nested Loop)||1| SIMPLE | household_demographics |NULL| eq_ref | PRIMARY | PRIMARY |4| tpcds.store_sales.ss_hdemo_sk|1|10.00| Using where||1| SIMPLE | time_dim |NULL| eq_ref | PRIMARY | PRIMARY |4| tpcds.store_sales.ss_sold_time_sk|1|5.00| Using where|+----+-------------+------------------------+------------+--------+---------------+---------+---------+-----------------------------------+---------+----------+----------------------------------------------------+4 rows inset,1 warning (0.00 sec)