如果跨越某个cumsum阈值，如何对pandas行进行分组

啊啊啊啊啊吖

2019-03-01 阅读量: 3965

无论何时超过给定的cumsum阈值，我都希望对连续行进行分组。当超过此阈值时，还应重新启动cumsum（为零），如下所示：

Index Values Regular CumSum Wanted CumSum Wanted Column

1 0.0666666666 0.0666666666 0.000000 0.0

2 0.0238095238 0.0904761904 0.000000 1.0

3 0.0134146341 0.1038908246 0.000000 2.0

4 0.0210135970 0.1249044216 0.013414 2.0

5 0.0072639225 0.1321683441 0.000000 3.0

6 0.0158536585 0.1480220027 0.007263 3.0

7 0.0012004801 0.1492224829 0.000000 4.0

8 0.0144230769 0.1636455598 0.001200 4.0

9 0.0130331753 0.1766787351 0.015623 4.0

在这种情况下，阈值为0.02（对所有小数都很抱歉）。

任何大于阈值的条目都应立即形成或关闭一个组（例如索引1,2和4中的条目）

索引3上的条目小于阈值，因此它等待下一个连续条目。如果下一个条目（单独或总和到索引3的值）超过阈值，则它们形成一个新组，否则下一个下一个条目也将被包括（在这种情况下，索引4的条目大于阈值，因此形成一个新组）。

条目5小于0.02的阈值，但是对条目6求和使它们大于0.02，因此一组被关闭。

条目7,8和9总和大于0.02，因此形成一组。

....

我能够开发以下简单的代码来实现这一点，但我希望有人可以帮助我开发一个更快的方法，可能使用pandas库：

FinalList = [0]

index=0

cumsum = 0

i=1

#while to go through all entries in df

while(i!=df.index[-1]):

#When entry is larger(or equal) than threshold immediately close group and clear cumsum

if df.Values.iloc[i] >= Threshold:

FinalList.append(index)

cumsum = 0

index+=1

#When entry is smaller than threshold

if df.Values.iloc[i] < Threshold:

#If previous cumsum plus current entry surpass threshold group is closed.

if cumsum + df.Values.iloc[i] > Threshold:

FinalList.append(index)

cumsum=0

index+=1

#Otherwise, continue increasing cumsum until it crosses threshold

else:

cumsum = cumsum + df.Values.iloc[i]

FinalList.append(index)

i+=1

0.0000

关注作者

发表评论

啊啊啊啊啊吖

2019-03-01

更多的pandas方法是遍历数据框

threshold = 0.02

cumsum = 0

group = 0

for idx, value in df.Values.iteritems():

cumsum += value

df.loc[idx, 'Group'] = group

if cumsum >= threshold:

cumsum = 0

group += 1

Values Group

Index

1 0.066667 0.0

2 0.023810 1.0

3 0.013415 2.0

4 0.021014 2.0

5 0.007264 3.0

6 0.015854 3.0

7 0.001200 4.0

8 0.014423 4.0

9 0.013033 4.0

0.0000 0 0 回复

CDA考试动态

CDA报考指南

推荐帖子