一、窗口函数简介
1.1 什么是窗口函数
MySQL从8.0开始支持窗口函数,这个功能在大多数据库中早已支持,有的也叫分析函数。那么什么是窗口呢?
窗口的概念非常重要,它可以理解为记录集合,窗口函数也就是在满足某种条件的记录集合上执行的特殊函数。对于每条记录都要在此窗口内执行函数,有的函数随着记录不同,窗口大小都是固定的,这种属于静态窗口;有的函数则相反,不同的记录对应着不同的窗口,这种动态变化的窗口叫滑动窗口。简单的说窗口函数就是对于查询的每一行,都使用与该行相关的行进行计算。
窗口函数和普通聚合函数很容易混淆,二者区别如下:
- 聚合函数是将多条记录聚合为一条;而窗口函数是每条记录都会执行,有几条记录执行完还是几条。
- 聚合函数也可以用于窗口函数中。
1.2 窗口函数功能
名称 | 描述 |
---|---|
CUME_DIST() | 计算一组值中一个值的累积分布 |
DENSE_RANK() | 根据该ORDER BY子句为分区中的每一行分配一个等级。它将相同的等级分配给具有相等值的行。如果两行或更多行具有相同的排名,则排名值序列中将没有间隙 |
FIRST_VALUE() | 返回相对于窗口框架第一行的指定表达式的值 |
LAG() | 返回分区中当前行之前的第N行的值。如果不存在前一行,则返回NULL |
LAST_VALUE() | 返回相对于窗口框架中最后一行的指定表达式的值 |
LEAD() | 返回分区中当前行之后的第N行的值。如果不存在后续行,则返回NULL |
NTH_VALUE() | 从窗口框架的第N行返回参数的值 |
NTILE() | 将每个窗口分区的行分配到指定数量的排名组中 |
PERCENT_RANK() | 计算分区或结果集中行的百分数等级 |
RANK() | 与DENSE_RANK()函数相似,不同之处在于当两行或更多行具有相同的等级时,等级值序列中存在间隙 |
ROW_NUMBER() | 为分区中的每一行分配一个顺序整数 |
将上述函数按照功能划分,可以把MySQL支持的窗口函数分为如下几类:
- 序号函数:ROW_NUMBER()、RANK()、DENSE_RANK()
- 分布函数:PERCENT_RANK()、CUME_DIST()
- 前后函数:LAG()、LEAD()
- 头尾函数:FIRST_VALUE()、LAST_VALUE()
- 其他函数:NTH_VALUE()、NTILE()
二、窗口函数语法
窗口函数的相关语法是:
[WINDOW window_name AS (window_spec)
[, window_name AS (window_spec)] ...]
window_function_name(window_name/expression)
OVER (
[partition_defintion]
[order_definition]
[frame_definition]
)
先指定作为窗口函数的函数名,后面跟一个表达式,然后是OVER(…),就算OVER里面没有内容,括号也需要保留。
窗口函数的一个概念是当前行,当前行属于某个窗口,窗口由“[partition_defintion]”,“[order_definition]”,“[frame_definition]“确定。
window_name:给窗口指定一个别名,如果SQL中涉及的窗口较多,采用别名可以看起来更清晰易读。
partition_defintion:窗口按照指定字段进行分区,两个分区由分区边界分隔,窗口功能在分区内执行,并在跨越分区边界时重新初始化。
order_definition:按照指定字段进行排序,窗口函数将按照排序后的记录顺序进行编号。可以和partition子句配合使用,也可以单独使用。
frame子句:frame是当前分区的一个子集,在分区里面再进一步细分窗口,子句用来定义子集的规则,通常用来作为滑动窗口使用。
具体语法如下:
frame_unit {<frame_start>|<frame_between>}
frame_unit有两种,分别是ROWS和RANGE,由ROWS定义的frame是由开始和结束位置的行确定的,由RANGE定义的frame由在某个值区间的行确定。
- 基于行:
通常使用BETWEEN frame_start AND frame_end语法来表示行范围,frame_start和frame_end可以支持如下关键字,来确定不同的动态行记录:
CURRENT ROW 边界是当前行,一般和其他范围关键字一起使用
UNBOUNDED PRECEDING 边界是分区中的第一行
UNBOUNDED FOLLOWING 边界是分区中的最后一行
expr PRECEDING 当前行之前的expr(数字或表达式)行
expr FOLLOWING 当前行之后的expr(数字或表达式)行
比如,下面都是合法的范围:
rows BETWEEN 1 PRECEDING AND 1 FOLLOWING 窗口范围是当前行、前一行、后一行一共三行记录。
rows UNBOUNDED FOLLOWING 窗口范围是当前行到分区中的最后一行
rows BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING 窗口范围是当前分区中所有行,等同于不写。
- 基于范围:
和基于行类似,但有些范围不是直接可以用行数来表示的,比如希望窗口范围是一周前的订单开始,截止到当前行,则无法使用rows来直接表示,此时就可以使用范围来表示窗口:INTERVAL 7 DAY PRECEDING。Linux中常见的最近1分钟、5分钟负载是一个典型的应用场景。
如果未frame_definition在OVER子句中指定,则MySQL默认使用以下框架:
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
三、窗口函数示例
MySQL [test]> select * from t1;
+----+------+--------+---------------------+
| id | name | amount | time |
+----+------+--------+---------------------+
| 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 |
+----+------+--------+---------------------+
10 rows in set (0.00 sec)
MySQL [test]> select id,name,amount,time,avg(amount) over w as avg_sum from t1 window w as (partition by name order by time desc rows BETWEEN 1 PRECEDING AND 1 FOLLOWING);
+----+------+--------+---------------------+----------+
| id | name | amount | time | avg_sum |
+----+------+--------+---------------------+----------+
| 5 | a1 | 300 | 2019-01-04 00:00:00 | 300.0000 |
| 4 | a1 | 300 | 2019-01-03 00:00:00 | 266.6667 |
| 2 | a1 | 200 | 2019-01-02 00:00:00 | 266.6667 |
| 3 | a1 | 300 | 2019-01-02 00:00:00 | 200.0000 |
| 1 | a1 | 100 | 2019-01-01 00:00:00 | 200.0000 |
| 8 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 9 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 10 | a2 | 600 | 2019-01-07 00:00:00 | 600.0000 |
| 7 | a2 | 600 | 2019-01-06 00:00:00 | 566.6667 |
| 6 | a2 | 500 | 2019-01-05 00:00:00 | 550.0000 |
+----+------+--------+---------------------+----------+
10 rows in set (0.00 sec)
#从结果可以看出,id为5的记录属于边界值,没有前一行,因此avg_sum为(300+300)/2=300;id为4的记录前后都有记录,所以avg_sum为(300+300+200)/3=266.6667,以此类推就可以得到一个基于滑动窗口的动态平均值。此例中,窗口函数用到了传统的聚合函数avg(),用来计算动态的平均值。
MySQL [test]> select CUME_DIST() over (partition by name order by time desc ) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0.2 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.4 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.8 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.8 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0.6 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.6 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.8 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#在某种排序条件下,小于等于当前行值的行数/总行数,得到的是数据在某一个纬度的分布百分比情况
MySQL [test]> select DENSE_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 4 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 3 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.01 sec)
#dense_rank()的出现是为了解决rank()编号存在的问题的,rank()编号的时候存在跳号的问题,如果有两个并列第1,那么下一个名次的编号就是3,结果就是没有编号为2的数据。如果不想跳号,可以使用dense_rank()替代
MySQL [test]> select FIRST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-04 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-04 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#first_value就是取某一组数据,按照某种方式排序的,最早的一个字段的值。
MySQL [test]> select LAG(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-04 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| NULL | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-06 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lag(column,n)获取当前数据行按照某种排序规则的上n行数据的某个字段
MySQL [test]> select LAST_VALUE(time) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-04 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-02 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-05 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#last_value就是取某一组数据,按照某种方式排序的,最新的一个字段的值。
MySQL [test]> select LEAD(time,1) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| 2019-01-03 00:00:00 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-02 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-02 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-01 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| NULL | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-06 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-05 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| NULL | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#lead(column,n)获取当前数据行按照某种排序规则的下n行数据的某个字段
MySQL [test]> select NTH_VALUE(time,2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------------------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------------------+----+------+--------+---------------------+
| NULL | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2019-01-03 00:00:00 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 2019-01-03 00:00:00 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2019-01-03 00:00:00 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 2019-01-07 00:00:00 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2019-01-07 00:00:00 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2019-01-07 00:00:00 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------------------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#从排序的第n行还是返回nth_value字段中的值
MySQL [test]> select NTILE(2) over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 1 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 1 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 2 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 2 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 2 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#按照某列的倒序排列,将字段分成N组,可以得到哪个数据在N组中哪一部分
MySQL [test]> select PERCENT_RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 0 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 0.25 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 0.5 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 0.5 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 1 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 0 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 0 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 0.75 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 1 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#数据分布的计算方式:当前RANK值-1/总行数-1
MySQL [test]> select RANK() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 3 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 1 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#排序条件一样的情况下,其编号也一样。
MySQL [test]> select ROW_NUMBER() over (partition by name order by time desc) as win_fun, id,name,amount,time from t1;
+---------+----+------+--------+---------------------+
| win_fun | id | name | amount | time |
+---------+----+------+--------+---------------------+
| 1 | 5 | a1 | 300 | 2019-01-04 00:00:00 |
| 2 | 4 | a1 | 300 | 2019-01-03 00:00:00 |
| 3 | 2 | a1 | 200 | 2019-01-02 00:00:00 |
| 4 | 3 | a1 | 300 | 2019-01-02 00:00:00 |
| 5 | 1 | a1 | 100 | 2019-01-01 00:00:00 |
| 1 | 8 | a2 | 600 | 2019-01-07 00:00:00 |
| 2 | 9 | a2 | 600 | 2019-01-07 00:00:00 |
| 3 | 10 | a2 | 600 | 2019-01-07 00:00:00 |
| 4 | 7 | a2 | 600 | 2019-01-06 00:00:00 |
| 5 | 6 | a2 | 500 | 2019-01-05 00:00:00 |
+---------+----+------+--------+---------------------+
10 rows in set (0.00 sec)
#对排序结果编号
四、总结
MySQL8.0中加入了窗口函数的功能,这一点方便了SQL的编写,可以说是MySQL8.0的亮点之一。