常用的摘要函数_CDA答疑社区

not_cancelled %>%
group_by(year, month, day) %>%
summarize(

# 平均延误时间：

avg_delay1 = mean(arr_delay),

# 平均正延误时间：

avg_delay2 = mean(arr_delay[arr_delay > 0])
)
#> Source: local data frame [365 x 5]
#> Groups: year, month [?]
#>
#> year month day avg_delay1 avg_delay2
#> <int> <int> <int> <dbl> <dbl>
#> 1 2013 1 1 12.65 32.5
#> 2 2013 1 2 12.69 32.0
#> 3 2013 1 3 5.73 27.7
#> 4 2013 1 4 -1.93 28.3
#> 5 2013 1 5 -1.53 22.6
#> 6 2013 1 6 4.24 24.4
#> # ... with 359 more rows

分散程度度量： sd(x)、 IQR(x) 和 mad(x)

均方误差（又称标准误差， standard deviation， sd）是分散程度的标准度量方式。四分

位距 IQR() 和绝对中位差 mad(x) 基本等价，更适合有离群点的情况：

# 为什么到某些目的地的距离比到其他目的地更多变？
not_cancelled %>%
group_by(dest) %>%
summarize(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
#> # A tibble: 104 × 2
#> dest distance_sd
#> <chr> <dbl>
#> 1 EGE 10.54
#> 2 SAN 10.35
#> 3 SFO 10.22
#> 4 HNL 10.00
#> 5 SEA 9.98
#> 6 LAS 9.91
#> # ... with 98 more rows

秩的度量： min(x)、 quantile(x, 0.25) 和 max(x)

分位数是中位数的扩展。例如， quantile(x, 0.25) 会找出 x 中按从小到大顺序大于前

25% 而小于后 75% 的值：

# 每天最早和最晚的航班何时出发？

not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time),
last = max(dep_time)
)
#> Source: local data frame [365 x 5]
#> Groups: year, month [?]
#>
#> year month day first last
#> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 2356
#> 2 2013 1 2 42 2354
#> 3 2013 1 3 32 2349
#> 4 2013 1 4 25 2358
#> 5 2013 1 5 14 2357
#> 6 2013 1 6 16 2355
#> # ... with 359 more rows

定位度量： first(x)、 nth(x, 2) 和 last(x)

这几个函数的作用与 x[1]、 x[2] 和 x[length(x)] 相同，只是当定位不存在时（比如尝

试从只有两个元素的分组中得到第三个元素），前者允许你设置一个默认值。例如，我

们可以找出每天最早和最晚出发的航班：

not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
#> Source: local data frame [365 x 5]
#> Groups: year, month [?]
#>
#> year month day first_dep last_dep
#> <int> <int> <int> <int> <int>
#> 1 2013 1 1 517 2356
#> 2 2013 1 2 42 2354
#> 3 2013 1 3 32 2349
#> 4 2013 1 4 25 2358
#> 5 2013 1 5 14 2357
#> 6 2013 1 6 16 2355
#> # ... with 359 more rows
这些函数对筛选操作进行了排秩方面的补充。筛选会返回所有变量，每个观测在单独的
一行中：
not_cancelled %>%
group_by(year, month, day) %>%
mutate(r = min_rank(desc(dep_time))) %>%
filter(r %in% range(r))
#> Source: local data frame [770 x 20]
#> Groups: year, month, day [365]
#>
#> year month day dep_time sched_dep_time dep_delay
#> <int> <int> <int> <int> <int> <dbl>
#> 1 2013 1 1 517 515 2
#> 2 2013 1 1 2356 2359 -3
#> 3 2013 1 2 42 2359 43
#> 4 2013 1 2 2354 2359 -5
#> 5 2013 1 3 32 2359 33
#> 6 2013 1 3 2349 2359 -10
#> # ... with 764 more rows, and 13 more variables:
#> # arr_time <int>, sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>, r <int>

计数

如果想要计算出非

缺失值的数量，可以使用 sum(!is.na(x))。要想计算出唯一值的数量，可以使用 n_

distinct(x)：

# 哪个目的地具有最多的航空公司？
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
#> # A tibble: 104 × 2
#> dest carriers
#> <chr> <int>
#> 1 ATL 7
#> 2 BOS 7
#> 3 CLT 7
#> 4 ORD 7
#> 5 TPA 7
#> 6 AUS 6
#> # ... with 98 more rows

因为计数太常用了，所以 dplyr 提供了一个简单的辅助函数，用于只需要计数的情况：

not_cancelled %>%

count(dest)

#> # A tibble: 104 × 2

#> dest n

#> <chr> <int>

#> 1 ABQ 254

#> 2 ACK 264

#> 3 ALB 418

#> 4 ANC 8

#> 5 ATL 16837

#> 6 AUS 2411

#> # ... with 98 more rows

还可以选择提供一个加权变量。例如，你可以使用以下代码算出每架飞机飞行的总里程

数（实际上就是求和）：

not_cancelled %>%

count(tailnum, wt = distance)

#> # A tibble: 4,037 × 2

#> tailnum n

#> <chr> <dbl>

#> 1 D942DN 3418

#> 2 N0EGMQ 239143

#> 3 N10156 109664

#> 4 N102UW 25722

#> 5 N103US 24619

#> 6 N104UW 24616

#> # ... with 4,031 more rows

逻辑值的计数和比例： sum(x > 10) 和 mean(y == 0)

当与数值型函数一同使用时， TRUE 会转换为 1， FALSE 会转换为 0。这使得 sum() 和 mean()

非常适用于逻辑值： sum(x) 可以找出 x 中 TRUE 的数量， mean(x) 则可以找出比例。

# 多少架航班是在早上5点前出发的？（这通常表明前一天延误的航班数量）
not_cancelled %>%
group_by(year, month, day) %>%
summarize(n_early = sum(dep_time < 500))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#>
#> year month day n_early
#> <int> <int> <int> <int>
#> 1 2013 1 1 0
#> 2 2013 1 2 3
#> 3 2013 1 3 4
#> 4 2013 1 4 3
#> 5 2013 1 5 3
#> 6 2013 1 6 2
#> # ... with 359 more rows
# 延误超过1小时的航班比例是多少？
not_cancelled %>%
group_by(year, month, day) %>%
summarize(hour_perc = mean(arr_delay > 60))
#> Source: local data frame [365 x 4]
#> Groups: year, month [?]
#>
#> year month day hour_perc
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 0.0722
#> 2 2013 1 2 0.0851
#> 3 2013 1 3 0.0567
#> 4 2013 1 4 0.0396
#> 5 2013 1 5 0.0349
#> 6 2013 1 6 0.0470
#> # ... with 359 more rows