热线电话:13121318867

登录
2022-01-22 阅读量: 1653
开窗函数之面试真题讲解(学习SQL的小朋友们赶紧看过来,顺便手有余香点个赞哦,ღ( ´・ᴗ・` )比心)

开篇定场诗

“问渠哪得清如许, 为有源头活水来。”——朱熹


不断挑战困难,才是成长正确的打开方式,当不断学习和进步成为一种习惯,有一天你自己也不知道你自己会变得有多优秀!💪🏻~


题目来源:腾讯

题目:

1、计算每天回流玩家数量

练习代码:



login_date date,
user_id int,
primary key(login_date,user_id)
);
insert into payment_log values
('2021-01-01',10001,50,801),
('2021-01-01',10001,60,802),
('2021-01-01',10001,80,803),
('2021-01-02',10001,55,804),
('2021-01-02',10001,56,805),
('2021-01-02',10002,70,806),
('2021-01-02',10002,71,807),
('2021-01-03',10002,72,808),
('2021-01-05',10002,73,809),
('2021-01-11',10005,90,810),
('2021-01-21',10005,91,811),
('2021-01-31',10005,92,812),
('2021-01-03',10022,40,813);
insert into user_log values
('2021-01-01',10001),
('2021-01-02',10001),
('2021-01-14',10022),
('2021-01-01',10005),
('2021-01-11',10005),
('2021-01-21',10005),
('2021-01-31',10005),
('2021-01-02',10002),
('2021-01-03',10002),
('2021-01-03',10022),
('2021-01-05',10002),
('2021-01-06',10024);
select * from payment_log;
select * from user_log;

题目拆解:

# 理解指标概念:"回流玩家"定义连续7天未登录的玩家

# 如何将指标意义运用到公式计算?

  • 题目的分组依据是登录日期,要聚合的字段是用户数量

  • 这里的限制条件是连续7天未登录的玩家,所以首先要对每个用户的玩家类型进行判断,看是不是上次登录时间是不是7天之前

  • 题目已给出了登录的日期,所以本题考虑用滑动窗口函数lag(),计算出“上次登录日期”,再用timestampdiff计算出“登录间隔天数”,得到新的表

  • 接下来就是条件判断,选出“登录间隔天数”大于7天的用户,在这里,我们不禁思考,可以尝试where或者having语句吗?答案是不可以,只能通过子查询

因为一个查询中,无法用where对聚合字段进行筛选(where执行顺序先与聚合),也无法用having语句作用于开窗函数lag(),帮助我们筛选出登录间隔天数大于7的用户


(具体原因我们通过下面的代码来演示)


参考答案:

select 
login_date,
count(distinct uid) as 回流玩家数量
from 
(select 
uid, 
login_date,
lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期,
timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数
from
login_user ) as T
where 登录间隔天数>7
group by login_date;

执行结果:

image.png

思考一:

有的小伙伴可能会尝试用开窗函数count()对回流玩家数进行计数,可以吗?我们来尝试一下

select 
login_date,
count(distinct uid) as 回流玩家数量,
count(uid) over(partition by login_date) as 回流玩家数数量1
from 
(select 
uid, 
login_date,
lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期,
timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数
from
login_user ) as T
where 登录间隔天数>7
group by login_date;

执行结果:一月14号的回流玩家数少了一个,为什么呢?

image.png

这是因为我们在主查询中有group by的分组限制,一个日期只能得到一个聚合结果,而在sql的执行顺序中,select字段后的开窗函数count()over()只能对剩下的5行进行计数,那有的小伙伴可能又会问了,那为什么count函数可以呢?如果我们只对group by后的结果select用户id,看看结果如何?

select 
login_date,
uid
from 
(select 
uid, 
login_date,
lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期,
timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数
from
login_user ) as T
where 登录间隔天数>7
group by login_date;

image.png

一月14号的回流玩家只有一个,这是为什么呢?需要注意,在分组规则中,如果分组后选择显示的字段不是聚合字段,如这里的UID,那么sql会默认选出在存储的数据中的一条uid记录,但实际数据表中还是有两条UID记录,所以,在使用count函数计数是,得到的结果是2,而如果使用count开窗函数,则是对显示在group by 分组后只有一条uid的字段再进行开窗,得到的结果自然只有1


思考2:为什么不能用where和having对子查询的结果进行条件筛选呢?我们分别来尝试一下

select 
uid, 
login_date,
lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期,
timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数
from
login_user 
group by login_date
having timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) >7;

结果直接报错:

You cannot use the window function 'lag' in this context.' 0.00044 sec

原因分析:其实这还是跟sql语句的执行顺序有关,在sql语句中,select后面的字段是最后才执行的,而having先于select后面的字段,所以当having+开窗聚合字段,自然会报错!那用where呢?

select 
uid, 
login_date,
lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期,
timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数
from
login_user 
where timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) >7
group by login_date;

同样也是报错!!!

Error Code: 3593. You cannot use the window function 'lag' in this context.'

道理也是一样的,因为在sql执行语句中,where作为条件筛选,执行顺序是先于开窗函数的!需要特别注意,所以在本题中,我们只能通过将子查询的结果作为一张表,再对这张表里的字段进行聚合运算!


本题讲解到此,希望对这几个知识点有迷惑的小伙伴有所帮助!



2、提取每个用户每日累计充值金额

练习代码:

create table payment_log(
pay_date date,
user_id int,
revenue decimal,
order_id int primary key
);
create table user_log(
login_date date,
user_id int,
primary key(login_date,user_id)
);
insert into payment_log values
('2021-01-01',10001,50,801),
('2021-01-01',10001,60,802),
('2021-01-01',10001,80,803),
('2021-01-02',10001,55,804),
('2021-01-02',10001,56,805),
('2021-01-02',10002,70,806),
('2021-01-02',10002,71,807),
('2021-01-03',10002,72,808),
('2021-01-05',10002,73,809),
('2021-01-11',10005,90,810),
('2021-01-21',10005,91,811),
('2021-01-31',10005,92,812),
('2021-01-03',10022,40,813);
insert into user_log values
('2021-01-01',10001),
('2021-01-02',10001),
('2021-01-14',10022),
('2021-01-01',10005),
('2021-01-11',10005),
('2021-01-21',10005),
('2021-01-31',10005),
('2021-01-02',10002),
('2021-01-03',10002),
('2021-01-03',10022),
('2021-01-05',10002),
('2021-01-06',10024);
select * from payment_log;
select * from user_log;

# 参考答案 ✅

select 
payment_log.user_id,
    sum(revenue) as 最后登录7天内的充值金额
from payment_log
left join (select user_id,max(login_date) as 最后登录日期 from user_log group by user_id) as t
on payment_log.user_id=t.user_id
where timestampdiff(day,pay_date,最后登录日期)<=7
group by payment_log.user_id;

image.png

# 错误答案❌

select 
distinct (payment_log.user_id),
sum(revenue) as 充值金额
from
payment_log
inner join (
select *
from(
select *,
max(login_date) over(partition by user_id) as 最大登录日期,
(max(login_date) over(partition by user_id)-login_date) as  登录天数差
from 
user_log) as T表
where 登录天数差<=7) as T1表
on T1表.user_id=payment_log.user_id
group by payment_log.user_id;

image.png

为什么呢?

这条错误的代码结果比正确答案多了10022这条记录,因为在条件筛选的时候是用最后登录的时间减去付款时间,而不是减去登录时间!!!所以平时计算时注意理清楚指标












63.7176
0
关注作者
收藏
评论(0)

发表评论

暂无数据
推荐帖子