Pandas – 10.1 聚合groupby-agg/aggreagte
可以与groupby一起使用的方法或函数
count / np.count_nonzero 统计频数(不包含NaN值)
size 统计频数 (包含NaN值)
mean / np.mean 求平均值
std / np.std 样本标准差
min /np.min 最小值
quantile(q=0.25) / np.percentile(q=0.25) 较小四分位数
quantile(q=0.5) / np.percentile(q=0.5) 中位数
quantile(q=0.75) / np.percentile(q=0.75) 较大四分位数
max / np.max 最大值
sum / np.sum 求和
var / np.var 无偏方差
sem / scipy.stats.sem 平均值的无偏方差
describe / scipy.stats.describe 统计信息描述
frist 返回第一行
last 返回最后一行
nth 返回第n行
import pandas as pd
df = pd.read_csv('data/gapminder.tsv', sep='/t')
continent_describe = df.groupby('continent').lifeExp.describe()
print(continent_describe)
'''
count mean std min 25% 50% 75% /
continent
Africa 624.0 48.865330 9.150210 23.599 42.37250 47.7920 54.41150
Americas 300.0 64.658737 9.345088 37.579 58.41000 67.0480 71.69950
Asia 396.0 60.064903 11.864532 28.801 51.42625 61.7915 69.50525
Europe 360.0 71.903686 5.433178 43.585 69.57000 72.2410 75.45050
Oceania 24.0 74.326208 3.795611 69.120 71.20500 73.6650 77.55250
max
continent
Africa 76.442
Americas 80.653
Asia 82.603
Europe 81.757
Oceania 81.235
'''
聚合函数
除了上面列出的函数,可以调用agg或aggregate方法传入想用的聚合函数。
- 传入其他库的函数
- 传入自定义的函数
传入其他库的函数
import numpy as np
cont_le_agg = df.groupby('continent').lifeExp.agg(np.mean)
print(cont_le_agg)
'''
continent
Africa 48.865330
Americas 64.658737
Asia 60.064903
Europe 71.903686
Oceania 74.326208
Name: lifeExp, dtype: float64
'''
cont_le_agg2 = df.groupby('continent').lifeExp.aggregate(np.mean)
print(cont_le_agg2)
'''
continent
Africa 48.865330
Americas 64.658737
Asia 60.064903
Europe 71.903686
Oceania 74.326208
Name: lifeExp, dtype: float64
'''
自定义函数
def my_mean(values):
n = len(values)
sum = 0
for value in values:
sum += value
return (sum/n)
agg_my_mean = df.groupby('continent').lifeExp.aggregate(my_mean)
print(agg_my_mean)
'''
continent
Africa 48.865330
Americas 64.658737
Asia 60.064903
Europe 71.903686
Oceania 74.326208
Name: lifeExp, dtype: float64
'''
带有多个参数的自定义聚合函数,第一个参数是值序列,其他参数作为关键字传入agg
def my_mean_diff(values, diff_value):
n = len(values)
sum =0
for value in values:
sum += value
mean = sum/n
return (mean - diff_value)
global_mean = df.lifeExp.mean()
print(global_mean) # 59.47443936619713
agg_mean_diff = df.groupby('year').lifeExp.agg(my_mean_diff, diff_value=global_mean)
print(agg_mean_diff)
'''
year
1952 -10.416820
1957 -7.967038
1962 -5.865190
1967 -3.796150
1972 -1.827053
1977 0.095718
1982 2.058758
1987 3.738173
1992 4.685899
1997 5.540237
2002 6.220483
2007 7.532983
Name: lifeExp, dtype: float64
'''
同时传入多个函数
- 对于一个序列计算多个聚合函数,将它们放入一个python列表,再将列表传入agg
- 对多个序列分别使用不同的聚合函数,将字典传入agg
一个序列计算多个聚合函数
gdf = df.groupby('year').lifeExp.agg([np.mean, np.std, np.count_nonzero])
print(gdf)
'''
mean std count_nonzero
year
1952 49.057620 12.225956 142.0
1957 51.507401 12.231286 142.0
1962 53.609249 12.097245 142.0
1967 55.678290 11.718858 142.0
1972 57.647386 11.381953 142.0
1977 59.570157 11.227229 142.0
1982 61.533197 10.770618 142.0
1987 63.212613 10.556285 142.0
1992 64.160338 11.227380 142.0
1997 65.014676 11.559439 142.0
2002 65.694923 12.279823 142.0
2007 67.007423 12.073021 142.0
'''
gdf = df.groupby('year').lifeExp./
agg([np.mean, np.std, np.count_nonzero])./
rename(columns={'mean':'avg',
'count_nonzero':'count',
'std':'std_dev'}).reset_index()
print(gdf)
'''
year avg std_dev count
0 1952 49.057620 12.225956 142.0
1 1957 51.507401 12.231286 142.0
2 1962 53.609249 12.097245 142.0
3 1967 55.678290 11.718858 142.0
4 1972 57.647386 11.381953 142.0
5 1977 59.570157 11.227229 142.0
6 1982 61.533197 10.770618 142.0
7 1987 63.212613 10.556285 142.0
8 1992 64.160338 11.227380 142.0
9 1997 65.014676 11.559439 142.0
10 2002 65.694923 12.279823 142.0
11 2007 67.007423 12.073021 142.0
'''
多个序列分别使用不同的聚合函数,针对DataFrame
gdf_dict = df.groupby('year').agg({
'lifeExp':'mean',
'pop':'median',
'gdpPercap':'median'})
print(gdf_dict)
'''
lifeExp pop gdpPercap
year
1952 49.057620 3943953.0 1968.528344
1957 51.507401 4282942.0 2173.220291
1962 53.609249 4686039.5 2335.439533
1967 55.678290 5170175.5 2678.334741
1972 57.647386 5877996.5 3339.129407
1977 59.570157 6404036.5 3798.609244
1982 61.533197 7007320.0 4216.228428
1987 63.212613 7774861.5 4280.300366
1992 64.160338 8688686.5 4386.085502
1997 65.014676 9735063.5 4781.825478
2002 65.694923 10372918.5 5319.804524
2007 67.007423 10517531.0 6124.371109
'''
共有 0 条评论