京公网安备 11010802034615号
经营许可证编号:京B2-20210330
R语言利器之ddply和aggregate
ddply和aggregate是两个用来整合数据的功能强大的函数。
aggregate(x, ...)
关于aggregate()函数的使用在《R语言实战》中P105有简单描述,这里重新说一下。此函数主要有一下几种用法:
## Default S3 method:
aggregate(x, ...)

## S3 method for class 'data.frame'
aggregate(x, by, FUN, ..., simplify = TRUE, drop = TRUE)
## S3 method for class 'formula'
aggregate(formula, data, FUN, ...,subset, na.action = na.omit)
## S3 method for class 'ts'
aggregate(x, nfrequency = 1, FUN = sum, ndeltat = 1,ts.eps = getOption("ts.eps"), ...)
例:
attach(mtcars)
aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE)
aggdata
Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb
1 4 3 21.500 4 120.1000 97.0000 3.700000 2.465000 20.0100 1.0 0.00 3 1.000000
2 6 3 19.750 6 241.5000 107.5000 2.920000 3.337500 19.8300 1.0 0.00 3 1.000000
3 8 3 15.050 8 357.6167 194.1667 3.120833 4.104083 17.1425 0.0 0.00 3 3.083333
4 4 4 26.925 4 102.6250 76.0000 4.110000 2.378125 19.6125 1.0 0.75 4 1.500000
5 6 4 19.750 6 163.8000 116.5000 3.910000 3.093750 17.6700 0.5 0.50 4 4.000000
6 4 5 28.200 4 107.7000 102.0000 4.100000 1.826500 16.8000 0.5 1.00 5 2.000000
7 6 5 19.700 6 145.0000 175.0000 3.620000 2.770000 15.5000 0.0 1.00 5 6.000000
8 8 5 15.400 8 326.0000 299.5000 3.880000 3.370000 14.5500 0.0 1.00 5 6.000000
得到数据框aggdata,其中的Group.1和Group.2的列名可以指定,只需第二行写成:
1
aggdata <-aggregate(mtcars, by=list(Group.cyl=cyl, Group.gears=gear),FUN=mean, na.rm=TRUE)
即可。
注意:在使用aggregate()函数的时候, by中的变量必须在一个列表中(即使只有一个变量) 。 指定的函数FUN可为任意的内建或自编函数 。
其他的一些例子:
## Compute the averages for the variables in 'state.x77', grouped
## according to the region (Northeast, South, North Central, West) that
## each state belongs to.
aggregate(state.x77, list(Region = state.region), mean)
## Compute the averages according to region and the occurrence of more
## than 130 days of frost.
aggregate(state.x77,
list(Region = state.region,Cold = state.x77[,"Frost"] > 130),
mean)
## (Note that no state in 'South' is THAT cold.)
## example with character variables and NAs
testDF <- data.frame(v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99) )
by1 <- c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12)
by2 <- c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA)
aggregate(x = testDF, by = list(by1, by2), FUN = "mean")
# and if you want to treat NAs as a group
fby1 <- factor(by1, exclude = "")
fby2 <- factor(by2, exclude = "")
aggregate(x = testDF, by = list(fby1, fby2), FUN = "mean")
## Formulas, one ~ one, one ~ many, many ~ one, and many ~ many:
aggregate(weight ~ feed, data = chickwts, mean)
aggregate(breaks ~ wool + tension, data = warpbreaks, mean)
aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
aggregate(cbind(ncases, ncontrols) ~ alcgp + tobgp, data = esoph, sum)
## Dot notation:
aggregate(. ~ Species, data = iris, mean)
aggregate(len ~ ., data = ToothGrowth, mean)
## Often followed by xtabs():
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
xtabs(len ~ ., data = ag)
## Compute the average annual approval ratings for American presidents.
aggregate(presidents, nfrequency = 1, FUN = mean)
## Give the summer less weight.
aggregate(presidents, nfrequency = 1,
FUN = weighted.mean, w = c(1, 1, 0.5, 1))
ddplycda数据分析师培训
下面是ddply函数的一般用法:
# Summarize a dataset by two variables
dfx <- data.frame(
group = c(rep('A', 8), rep('B', 15), rep('C', 6)),
sex = sample(c("M", "F"), size = 29, replace = TRUE),
age = runif(n = 29, min = 18, max = 54)
)
head(dfx)
group sex age
1 A M 22.44750
2 A M 52.92616
3 A F 30.00443
4 A M 39.56907
5 A M 18.89180
6 A F 50.81139
#Note the use of the '.' function to allow
# group and sex to be used without quoting
ddply(dfx, .(group, sex), summarize,mean = round(mean(age), 2),sd = round(sd(age), 2))
group sex mean sd
1 A F 40.41 14.71
2 A M 30.35 13.17
3 B F 34.81 12.76
4 B M 34.04 13.36
5 C F 35.09 13.39
6 C M 28.53 4.57
# An example using a formula for .variables
ddply(baseball[1:100,], ~ year, nrow)
year V1
1 1871 7
2 1872 13
3 1873 13
4 1874 15
5 1875 17
6 1876 15
7 1877 17
8 1878 3
# Applying two functions; nrow and ncol
ddply(baseball, .(lg), c("nrow", "ncol"))
lg nrow ncol
1 65 22
2 AA 171 22
3 AL 10007 22
4 FL 37 22
5 NL 11378 22
6 PL 32 22
7 UA 9 22
# Calculate mean runs batted in for each year
rbi <- ddply(baseball, .(year), summarise,mean_rbi = mean(rbi, na.rm = TRUE))
head(rbi)
year mean_rbi
1 1871 22.28571
2 1872 20.53846
3 1873 30.92308
4 1874 29.00000
5 1875 31.58824
6 1876 30.13333
# Plot a line chart of the result
plot(mean_rbi ~ year, type = "l", data = rbi)
# make new variable career_year based on the
# start year for each player (id)
base2 <- ddply(baseball, .(id), mutate,career_year = year - min(year) + 1)
head(base2)
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp career_year
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 1
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 2
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 3
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 4
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 5
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 6
数据分析咨询请扫描二维码
若不方便扫码,搜微信号:CDAshujufenxi
【核心关键词】软件、洞察力、大数据、产品、经验、硬件、流量、创新、决策、数据安全、网络安全、数据分析、决策制定、数据挖 ...
2026-06-18在方案选型、效果复盘、产品评估、供应商筛选等各类业务决策场景中,仅凭单一指标下结论往往会陷入 “以偏概全” 的误区。多维度 ...
2026-06-18 很多数据分析师精通Excel单元格操作,但当被问到“表结构数据的基本处理单位是什么”“字段和记录的本质区别”“为什么表结 ...
2026-06-18在数据分析、用户运营与业务增长的工作体系中,漏斗拆解是最基础也最高频的问题定位方法。很多业务场景下,我们只能看到最终的转 ...
2026-06-17在数据库开发、数据清洗与报表统计场景中,数值类型转换为日期是高频刚需操作。业务系统常以 Unix 时间戳、整型日期(如20240617 ...
2026-06-17 数据分析师八成以上的时间在和数据表格打交道,但许多人拿到Excel后习惯性地先算、先分析,结果回头发现漏了一列关键数据, ...
2026-06-17【核心关键词】数据库、电商、知识、产品、数据产品、监管业务、产品经理、业务系统、用户行为分析、用户分析、数据分析、电商 ...
2026-06-16在 Python 动态类型与面向对象的编程体系中,变量定义与类实例化是构建代码逻辑的两大核心基石。变量是数据存储、传递与运算的基 ...
2026-06-16 很多数据分析师每天与Excel打交道,但当被问到“表格结构数据和表结构数据有什么区别”“数据类型误判会引发哪些分析错误” ...
2026-06-16在 MySQL 查询性能优化体系中,索引是降低查询耗时、提升数据库吞吐的核心手段。其中联合索引与覆盖索引是实际开发中最高频的两 ...
2026-06-15在数据仓库建设与商业智能分析体系中,维度建模是应用最广泛的建模方法论,而事实表与维度表是维度建模的两大核心构件,共同构成 ...
2026-06-15 很多数据分析师能熟练计算指标,但当被问到“这家企业的核心业务目标是什么”“如何把模糊的战略目标拆解为可量化的指标”“ ...
2026-06-15在数据分析、业务监控、运营复盘等场景中,列值趋势计算是核心需求之一。无论是分析销售额的月度增长、用户活跃的变化趋势、库存 ...
2026-06-12在数字经济深度渗透的当下,消费者的购买行为已从过去的 “被动接受” 转变为 “主动决策”。流量红利消退、获客成本攀升、用户 ...
2026-06-12CDA三级认证是三个级别中的塔尖,全面考察数据战略、团队领导和复杂项目的综合能力。它所对应的《敏捷数据挖掘》教材,不再局限 ...
2026-06-12在游戏产业的商业逻辑中,付费玩家是支撑游戏生存与发展的核心支柱。行业普遍遵循 “二八定律”:20% 的付费玩家贡献了游戏 80% ...
2026-06-11【核心关键词】企业、定位、传统、产品、互联网、可视化、业务侧、数字化、结构化、数据分析、传统制造业、市场状态、发展空间 ...
2026-06-11 解读《CDA二级教材:量化策略分析(2025)》的全景结构与学习逻辑 ” CDA二级认证是企业招聘数据分析师时最常提及的证书门槛 ...
2026-06-11【核心关键词】药企、可视化、营销、分类、数据分析师、销售数据、业务人员、指导方向、分析报告、营销数据、营销医生 【专访摘 ...
2026-06-10在统计学分析、问卷调研、实验验证、业务复盘等场景中,卡方检验与 T 检验是应用最广泛的两类基础假设检验方法。前者专门处理分 ...
2026-06-10