本套SQL题的答案是由许多小伙伴共同贡献的,1+1的力量是远远大于2的,有不少题目都采用了非常巧妙的解法,也有不少题目有多种解法。本套大数据SQL题不仅题目丰富多样,答案更是精彩绝伦!
注:以下参考答案都经过简单数据场景进行测试通过,但并未测试其他复杂情况。本文档的SQL主要使用Hive SQL。
一、行列转换
描述:表中记录了各年份各部门的平均绩效考核成绩。
表名:t1
表结构:
a -- 年份 b -- 部门 c -- 绩效得分
表内容:
a b c 2014 B 9 2015 A 8 2014 A 10 2015 B 7
问题一:多行转多列
问题描述:将上述表内容转为如下输出结果所示:
a col_A col_B 2014 10 9 2015 8 7
参考答案:
select a, max(case when b="A" then c end) col_A, max(case when b="B" then c end) col_B from t1 group by a;
问题二:如何将结果转成源表?(多列转多行)
问题描述:将问题一的结果转成源表,问题一结果表名为t1_2
。
参考答案:
select a, b, c from ( select a,"A" as b,col_a as c from t1_2 union all select a,"B" as b,col_b as c from t1_2 )tmp;
问题三:同一部门会有多个绩效,求多行转多列结果
问题描述:2014年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:
2014 B 9 2015 A 8 2014 A 10 2015 B 7 2014 B 6
输出结果如下所示:
a col_A col_B 2014 10 6,9 2015 8 7
参考答案:
select a, max(case when b="A" then c end) col_A, max(case when b="B" then c end) col_B from ( select a, b, concat_ws(",",collect_set(cast(c as string))) as c from t1 group by a,b )tmp group by a;
二、排名中取他值
表名:t2
表字段及内容:
a b c 2014 A 3 2014 B 1 2014 C 2 2015 A 4 2015 D 3
问题一:按a分组取b字段最小时对应的c字段
输出结果如下所示:
a min_c 2014 3 2015 4
参考答案:
select a, c as min_c from ( select a, b, c, row_number() over(partition by a order by b) as rn from t2 )a where rn = 1;
问题二:按a分组取b字段排第二时对应的c字段
输出结果如下所示:
a second_c 2014 1 2015 3
参考答案:
select a, c as second_c from ( select a, b, c, row_number() over(partition by a order by b) as rn from t2 )a where rn = 2;
问题三:按a分组取b字段最小和最大时对应的c字段
输出结果如下所示:
a min_c max_c 2014 3 2 2015 4 3
参考答案:
select a, min(if(asc_rn = 1, c, null)) as min_c, max(if(desc_rn = 1, c, null)) as max_c from ( select a, b, c, row_number() over(partition by a order by b) as asc_rn, row_number() over(partition by a order by b desc) as desc_rn from t2 )a where asc_rn = 1 or desc_rn = 1 group by a;
问题四:按a分组取b字段第二小和第二大时对应的c字段
输出结果如下所示:
a min_c max_c 2014 1 1 2015 3 4
参考答案:
select ret.a ,max(case when ret.rn_min = 2 then ret.c else null end) as min_c ,max(case when ret.rn_max = 2 then ret.c else null end) as max_c from ( select * ,row_number() over(partition by t2.a order by t2.b) as rn_min ,row_number() over(partition by t2.a order by t2.b desc) as rn_max from t2 ) as ret where ret.rn_min = 2 or ret.rn_max = 2 group by ret.a;
问题五:按a分组取b字段前两小和前两大时对应的c字段
注意:需保持b字段最小、最大排首位
输出结果如下所示:
a min_c max_c 2014 3,1 2,1 2015 4,3 3,4
参考答案:
select tmp1.a as a, min_c, max_c from ( select a, concat_ws(',', collect_list(c)) as min_c from ( select a, b, c, row_number() over(partition by a order by b) as asc_rn from t2 )a where asc_rn <= 2 group by a )tmp1 join ( select a, concat_ws(',', collect_list(c)) as max_c from ( select a, b, c, row_number() over(partition by a order by b desc) as desc_rn from t2 )a where desc_rn <= 2 group by a )tmp2 on tmp1.a = tmp2.a;
三、累计求值
表名:t3
表字段及内容:
a b c 2014 A 3 2014 B 1 2014 C 2 2015 A 4 2015 D 3
问题一:按a分组按b字段排序,对c累计求和
输出结果如下所示:
a b sum_c 2014 A 3 2014 B 4 2014 C 6 2015 A 4 2015 D 7
参考答案:
select a, b, c, sum(c) over(partition by a order by b) as sum_c from t3;
问题二:按a分组按b字段排序,对c取累计平均值
输出结果如下所示:
a b avg_c 2014 A 3 2014 B 2 2014 C 2 2015 A 4 2015 D 3.5
参考答案:
select a, b, c, avg(c) over(partition by a order by b) as avg_c from t3;
问题三:按a分组按b字段排序,对b取累计排名比例
输出结果如下所示:
a b ratio_c 2014 A 0.33 2014 B 0.67 2014 C 1.00 2015 A 0.50 2015 D 1.00
参考答案:
select a, b, c, round(row_number() over(partition by a order by b) / (count(c) over(partition by a)),2) as ratio_c from t3 order by a,b;
问题四:按a分组按b字段排序,对b取累计求和比例
输出结果如下所示:
a b ratio_c 2014 A 0.50 2014 B 0.67 2014 C 1.00 2015 A 0.57 2015 D 1.00
参考答案:
select a, b, c, round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_c from t3 order by a,b;
四、窗口大小控制
表名:t4
表字段及内容:
a b c 2014 A 3 2014 B 1 2014 C 2 2015 A 4 2015 D 3
问题一:按a分组按b字段排序,对c取前后各一行的和
输出结果如下所示:
a b sum_c 2014 A 1 2014 B 5 2014 C 1 2015 A 3 2015 D 4
参考答案:
select a, b, lag(c,1,0) over(partition by a order by b)+lead(c,1,0) over(partition by a order by b) as sum_c from t4;
问题二:按a分组按b字段排序,对c取平均值
问题描述:前一行与当前行的均值!
输出结果如下所示:
a b avg_c 2014 A 3 2014 B 2 2014 C 1.5 2015 A 4 2015 D 3.5
参考答案:
select a, b, case when lag_c is null then c else (c+lag_c)/2 end as avg_c from ( select a, b, c, lag(c,1) over(partition by a order by b) as lag_c from t4 )temp;