开篇定场诗
“问渠哪得清如许, 为有源头活水来。”——朱熹
不断挑战困难,才是成长正确的打开方式,当不断学习和进步成为一种习惯,有一天你自己也不知道你自己会变得有多优秀!💪🏻~
题目来源:腾讯
题目:
1、计算每天回流玩家数量
练习代码:
login_date date,
user_id int,
primary key(login_date,user_id)
);
insert into payment_log values
('2021-01-01',10001,50,801),
('2021-01-01',10001,60,802),
('2021-01-01',10001,80,803),
('2021-01-02',10001,55,804),
('2021-01-02',10001,56,805),
('2021-01-02',10002,70,806),
('2021-01-02',10002,71,807),
('2021-01-03',10002,72,808),
('2021-01-05',10002,73,809),
('2021-01-11',10005,90,810),
('2021-01-21',10005,91,811),
('2021-01-31',10005,92,812),
('2021-01-03',10022,40,813);
insert into user_log values
('2021-01-01',10001),
('2021-01-02',10001),
('2021-01-14',10022),
('2021-01-01',10005),
('2021-01-11',10005),
('2021-01-21',10005),
('2021-01-31',10005),
('2021-01-02',10002),
('2021-01-03',10002),
('2021-01-03',10022),
('2021-01-05',10002),
('2021-01-06',10024);
select * from payment_log;
select * from user_log;
题目拆解:
# 理解指标概念:"回流玩家"定义连续7天未登录的玩家
# 如何将指标意义运用到公式计算?
题目的分组依据是登录日期,要聚合的字段是用户数量
这里的限制条件是连续7天未登录的玩家,所以首先要对每个用户的玩家类型进行判断,看是不是上次登录时间是不是7天之前
题目已给出了登录的日期,所以本题考虑用滑动窗口函数lag(),计算出“上次登录日期”,再用timestampdiff计算出“登录间隔天数”,得到新的表
接下来就是条件判断,选出“登录间隔天数”大于7天的用户,在这里,我们不禁思考,可以尝试where或者having语句吗?答案是不可以,只能通过子查询
因为一个查询中,无法用where对聚合字段进行筛选(where执行顺序先与聚合),也无法用having语句作用于开窗函数lag(),帮助我们筛选出登录间隔天数大于7的用户
(具体原因我们通过下面的代码来演示)
参考答案:
select login_date, count(distinct uid) as 回流玩家数量 from (select uid, login_date, lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期, timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数 from login_user ) as T where 登录间隔天数>7 group by login_date;
执行结果:
思考一:
有的小伙伴可能会尝试用开窗函数count()对回流玩家数进行计数,可以吗?我们来尝试一下
select login_date, count(distinct uid) as 回流玩家数量, count(uid) over(partition by login_date) as 回流玩家数数量1 from (select uid, login_date, lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期, timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数 from login_user ) as T where 登录间隔天数>7 group by login_date;
执行结果:一月14号的回流玩家数少了一个,为什么呢?
这是因为我们在主查询中有group by的分组限制,一个日期只能得到一个聚合结果,而在sql的执行顺序中,select字段后的开窗函数count()over()只能对剩下的5行进行计数,那有的小伙伴可能又会问了,那为什么count函数可以呢?如果我们只对group by后的结果select用户id,看看结果如何?
select login_date, uid from (select uid, login_date, lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期, timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数 from login_user ) as T where 登录间隔天数>7 group by login_date;
一月14号的回流玩家只有一个,这是为什么呢?需要注意,在分组规则中,如果分组后选择显示的字段不是聚合字段,如这里的UID,那么sql会默认选出在存储的数据中的一条uid记录,但实际数据表中还是有两条UID记录,所以,在使用count函数计数是,得到的结果是2,而如果使用count开窗函数,则是对显示在group by 分组后只有一条uid的字段再进行开窗,得到的结果自然只有1
思考2:为什么不能用where和having对子查询的结果进行条件筛选呢?我们分别来尝试一下
select uid, login_date, lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期, timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数 from login_user group by login_date having timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) >7;
结果直接报错:
You cannot use the window function 'lag' in this context.' 0.00044 sec
原因分析:其实这还是跟sql语句的执行顺序有关,在sql语句中,select后面的字段是最后才执行的,而having先于select后面的字段,所以当having+开窗聚合字段,自然会报错!那用where呢?
select uid, login_date, lag(login_date,1) over(partition by uid order by login_date) as 上次登录日期, timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) as 登录间隔天数 from login_user where timestampdiff(day,lag(login_date,1) over(partition by uid order by login_date),login_date) >7 group by login_date;
同样也是报错!!!
Error Code: 3593. You cannot use the window function 'lag' in this context.'
道理也是一样的,因为在sql执行语句中,where作为条件筛选,执行顺序是先于开窗函数的!需要特别注意,所以在本题中,我们只能通过将子查询的结果作为一张表,再对这张表里的字段进行聚合运算!
本题讲解到此,希望对这几个知识点有迷惑的小伙伴有所帮助!
2、提取每个用户每日累计充值金额
练习代码:
create table payment_log( pay_date date, user_id int, revenue decimal, order_id int primary key ); create table user_log( login_date date, user_id int, primary key(login_date,user_id) ); insert into payment_log values ('2021-01-01',10001,50,801), ('2021-01-01',10001,60,802), ('2021-01-01',10001,80,803), ('2021-01-02',10001,55,804), ('2021-01-02',10001,56,805), ('2021-01-02',10002,70,806), ('2021-01-02',10002,71,807), ('2021-01-03',10002,72,808), ('2021-01-05',10002,73,809), ('2021-01-11',10005,90,810), ('2021-01-21',10005,91,811), ('2021-01-31',10005,92,812), ('2021-01-03',10022,40,813); insert into user_log values ('2021-01-01',10001), ('2021-01-02',10001), ('2021-01-14',10022), ('2021-01-01',10005), ('2021-01-11',10005), ('2021-01-21',10005), ('2021-01-31',10005), ('2021-01-02',10002), ('2021-01-03',10002), ('2021-01-03',10022), ('2021-01-05',10002), ('2021-01-06',10024); select * from payment_log; select * from user_log;
# 参考答案 ✅
select payment_log.user_id, sum(revenue) as 最后登录7天内的充值金额 from payment_log left join (select user_id,max(login_date) as 最后登录日期 from user_log group by user_id) as t on payment_log.user_id=t.user_id where timestampdiff(day,pay_date,最后登录日期)<=7 group by payment_log.user_id;
# 错误答案❌
select distinct (payment_log.user_id), sum(revenue) as 充值金额 from payment_log inner join ( select * from( select *, max(login_date) over(partition by user_id) as 最大登录日期, (max(login_date) over(partition by user_id)-login_date) as 登录天数差 from user_log) as T表 where 登录天数差<=7) as T1表 on T1表.user_id=payment_log.user_id group by payment_log.user_id;
为什么呢?
这条错误的代码结果比正确答案多了10022这条记录,因为在条件筛选的时候是用最后登录的时间减去付款时间,而不是减去登录时间!!!所以平时计算时注意理清楚指标








暂无数据