题解 | #每篇文章同一时刻最大在看人数#
每篇文章同一时刻最大在看人数
http://www.nowcoder.com/practice/fe24c93008b84e9592b35faa15755e48
方法一:
思路:
- 原始数据
易错点:要筛去 artical_id=0的数据
2. 时刻的选取
- 题目要求“同一时刻”,而时间是连续变量,该选取什么时刻作为关注对象呢?根据例题画出了上图,可以直观发现,每一个有人进出的时刻都会人数变化,不妨取进出时刻为“关键时刻”point_time。
- in_time和out_time都是point_time,同时选取表格中的不同变量,可以分别选取再用union连接
- 易错点!!!一定要用union,而不是union all!!! (笔者因为这个问题卡了一下午)
前者去重,后者不去重。 若不去重,后续统计同一时刻在看的总人数时,会重复计算同时有进出的时刻的人数,导致结果错误
- 是否在看的判断标准
- 题目提到,若同一时刻有人进入也有离开,先记录进入,再记录离开,也就是同一时刻会对应两个总人数值,两值相差1。但既然题目最终只要求最大人数,对于同时有进出的时刻,我们不妨只取进入未离开的状态,也就是更大的总人数。
- 由此,将每个用户观看的时间段与point_time对照,判断某时刻某用户是否在看: if(point_time between in_time and out_time,1,0)
代码
- 取关键时刻point_time表q1
select artical_id,in_time as point_time
from tb_user_log
where artical_id!=0
union all
select artical_id,out_time as point_time
from tb_user_log
where artical_id!=0
order by artical_id,point_time
- 若不去重(union all)
- 若去重(union)
- 对每个point_time,统计所有用户在看情况,得到表q
select q1.artical_id,point_time,
sum(if(point_time between in_time and out_time,1,0)) as uv
from
(
select artical_id,in_time as point_time
from tb_user_log
where artical_id!=0
union
select artical_id,out_time as point_time
from tb_user_log
where artical_id!=0
order by artical_id,point_time
) as q1
join tb_user_log as q2
on q1.artical_id=q2.artical_id
group by artical_id,point_time
- 拼接结果(部分)
- 拼接结果整合(按id和日期sum后)
- 取每篇文章同一时刻最大在看人数
select artical_id,max(uv) as max_uv
from
(select q1.artical_id,point_time,
sum(if(point_time between in_time and out_time,1,0)) as uv
from
(
select artical_id,in_time as point_time
from tb_user_log
where artical_id!=0
union
select artical_id,out_time as point_time
from tb_user_log
where artical_id!=0
order by artical_id,point_time
) as q1
join tb_user_log as q2
on q1.artical_id=q2.artical_id
group by artical_id,point_time) as q
group by artical_id
order by max_uv desc
- 若不去重,所得结果:
所以一定要去重啊朋友们!!!
方法二:
题解区大部分友友的思路:
- 同样取关键时刻,但对进出予以区分,进diff=+1,退diff=-1
- 按时间排序后巧用**sum()over(partition by order by)**可得截止每个关键时刻的总人数
- 易错点:先进后出,故sum(diff)时不但要time asc,还要diff desc
代码:
select artical_id,max(uv) as max_uv
from
(select q1.artical_id,point_time,
sum(diff) over (partition by artical_id order by point_time,
diff desc) as uv
from
(select artical_id,in_time as point_time, 1 as diff
from tb_user_log
where artical_id!=0
union all
select artical_id,out_time as point_time,-1 as diff
from tb_user_log
where artical_id!=0
order by artical_id,point_time) as q1
) as q2
group by artical_id
order by max_uv desc
两方法对比:
- 易错点:
- 方法一由于最终取最大值,不考虑同时有进出时刻的“出”情况,故不区分in和out,但易错点也在于此,既然要sum,就要记得给point_time去重。
- 方法二区分了in和out,但易错点在于区分过后就要规定好先后,不然最大值就会受到影响。
- 思路本质:
- 方法一将时间段与时间点对照,但从结果出发取了巧,但偷懒也有偷懒要考虑的后果。
- 方法二只考虑时间点间的跳跃,只需要更严格的顺序要求便可由简洁的思路得到严谨的结果。
- 内存(虽然不太懂内存这些,但体感)
- 方法一会占用更多内存,若除id=0的情况有n条数据,那么point_time有2n个,并要逐一与n个时间段进行对照,则join后的表格会有2n*n条数据,大可不必如此!
- 方法二只需对2n个关键时刻进行窗口运算即可。