2021-10-13 10:16 已编辑荔枝FM_数据分析师

关注

Pandas的数据处理

常见的聚合方法及说明

count—————–计数
describe————-给出各列的常用统计量
min,max————-最大最小值
argmin,argmax—-最大最小值的索引位置（整数）
idxmin,idxmax—–最大最小值的索引值
quantile————-计算样本分位数
sum,mean———-对列求和，均值
mediam————-中位数
mad——————根据平均值计算平均绝对离差
var,std—————方差，标准差
skew—————–偏度（三阶矩）
Kurt——————峰度（四阶矩）
cumsum————累积和
Cummins，cummax—累计组大致和累计最小值
cumprod————累计积
diff——————-一阶差分
pct_change———计算百分数变化

1.删除重复元素

使用duplicated()函数检测重复的行，返回元素为布尔类型的Series对象，每个元素对应一行，如果该行不是第一次出现，则元素为True。
导入相应的包：

import pandas as pd
import numpy as np
from pandas import DataFrame, Series

df = DataFrame(np.random.randint(98, 100, size=(6, 3)),
               columns=["语文", "数学", "英语"],
               index=["张三", "李四", "王五", "张三", "小李", "小赵"])
print(df)
print(df.duplicated())   # 使用duplicated()检测重复的行

输出结果如下：

# df的输出结果
     语文  数学  英语
张三  99    99  98
李四  99    99  98
王五  98    98  99
张三  98    98  98
小李  99    98  99
小赵  98    99  99
# 使用duplicated()检测重复的行
张三     False
李四     True
王五     False
张三     False
小李     False
小赵     False
dtype: bool

在对行进行检测后，使用drop_duplicates()函数删除重复的行。

print(df.drop_duplicates())  # 使用drop_duplicates()函数删除重复的行

输出结果为：

     语文  数学  英语
张三  99    99   98
王五  98    98   99
张三  98    98   98
小李  99    98   99
小赵  98    99   99

# 如果使用pd.concat([df1,df2],axis=1)生成新的DataFrame，新的df中的columns相同，使用duplicated()和drop_duplicates()都会出问题。
df1 = pd.concat([df, df], axis=1)
print(df1)

     语文  数学  英语  语文  数学  英语
张三  99    99   98    99    99   98
李四  99    99   98    99    99   98
王五  98    98   99    98    98   99
张三  98    98   98    98    98   98
小李  99    98   99    99    98   99
小赵  98    99   99    98    99   99

删除重复的列名

print(df1.drop_duplicates())   #重复的列名

输出结果为：

     语文  数学  英语  语文  数学  英语
张三  99    99   98    99    99   98
王五  98    98   99    98    98   99
张三  98    98   98    98    98   98
小李  99    98   99    99    98   99
小赵  98    99   99    98    99   99

2.映射

映射的含义：创建一个映射关系列表，把values元素和一个特定的标签或者字符串绑定，需要使用字典：map = {‘label1’:’value1’, ‘label2’:’value2’, … }。
其中包含三种操作:
(1)replace()函数：替换元素；该方法最为重要；
(2)map()函数：新建一列；
(3)rename()函数：替换索引。

replace()替换元素

replace({索引键值对})

df = DataFrame({
  'item': ['ball', 'mug', 'pen'],
                 'color': ['white', 'red', 'verde'],
                'price': [5.56, 4.20, 1.30]})
new_colors = {
  'red': 'black', 'verde': 'green'}
print(df)
print(df.replace(new_colors))

输出结果为：

   color  item  price
0  white  ball   5.56
1    red   mug   4.20
2  verde   pen   1.30

   color  item  price
0  white  ball   5.56
1  black   mug   4.20
2  green   pen   1.30

replace()还经常用来替换NaN元素

df = DataFrame({
  'math': [100, 139, np.nan], 'English': [146, None, 119]}, index=['张三', '李四', '王五'])
new_values = {np.nan: 100}
print(df)
print(df.replace(new_values))

输出结果为：

       English   math
张三    146.0    100.0
李四      NaN    139.0
王五    119.0     NaN

       English   math
张三    146.0    100.0
李四    100.0    139.0
王五    119.0    100.0

map()函数：新建一列

map(函数,可迭代对象) ，map(函数/{索引键值对})。map中返回的数据是一个具体值，不能迭代。

df = DataFrame({
  'color': ['red', 'green', 'blue'], 'project': ['Math', 'English', 'Chemistry']})
price = {
  'red': 5.56, 'green': 3.14, 'chemistry': 2.79}
print(df)
df5['price'] = df5['color'].map(price)
print(df)

输出结果为：

    color    project
0    red       Math
1  green    English
2   blue  Chemistry

    color    project  price
0    red       Math   5.56
1  green    English   3.14
2   blue  Chemistry    NaN

rename()函数：替换索引

rename({索引键值对})

df = DataFrame({
  'color': ['white', 'gray', 'purple', 'blue', 'green'], 'value': np.random.randint(0, 10, size=5)})
print(df)

输出结果为：

    color    value
0   white      5
1    gray      5
2  purple      7
3    blue      8
4   green      9

使用rename()函数替换行索引

new_index = {
  0: 'first', 1: 'two', 2: 'three', 3: 'four', 4: 'five'}
print(df.rename(new_index))

输出结果为：

         color   value
first   white      5
two      gray      5
three  purple      7
four     blue      8
five    green      9

3.异常值检查和过滤

df = DataFrame(np.random.randint(0, 150, size=(6, 3)), index=list("ABCDEF"), columns=list("语数英"))

输出结果为：

   语   数   英
A  115  13   22
B   38  96   89
C   11  25  128
D   42  15   37
E   51  66   67
F   52  67  146

1.使用describe()函数查看每一列的描述性统计量

print(df.describe())

输出结果为：

           语          数           英
count    6.000000   6.000000    6.000000
mean    51.500000  47.000000   81.500000
std     34.483329  34.135026   49.212803
min     11.000000  13.000000   22.000000
25% 39.000000 17.500000 44.500000
50% 46.500000 45.500000 78.000000
75% 51.750000 66.750000 118.250000
max    115.000000  96.000000  146.000000

2 .使用std()函数可以求得DataFrame对象每一列的标准方差

print(df.std())

输出结果为：

语    34.483329
数    34.135026
英    49.212803
dtype: float64

3.根据每一列的标准差，对DataFrame元素进行过滤。借助any()函数，测试是否有True，有一个或以上返回True，反之返回False。对每一列应用筛选条件,any过滤出所有符合条件的数据。

# 如果数据小于4倍的平均方差，认为数据可靠
df_df = np.abs(df8) < df8.std()*4
print(df_df.all(axis=1))
print(df8[df_df.all(axis=1)])

输出结果为：

A    True
B    True
C    True
D    True
E    True
F    True
dtype: bool

   语   数    英
A  115  13   22
B   38  96   89
C   11  25  128
D   42  15   37
E   51  66   67
F   52  67  146

4.排序

使用take()函数排序，可以借助np.random.permutation()函数随机排序。

df = DataFrame(np.arange(25).reshape(5, 5))
new_order = np.random.permutation(5)  # 生成五个随机数
print(df)
print(new_order)
print(df.take(new_order))   # 根据new_order的随机数进行排序

输出结果为：

0   1   2   3   4
0   0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
3  15  16  17  18  19
4  20  21  22  23  24
[1 2 4 0 3]  # print(new_order)
    0   1   2   3   4
1   5   6   7   8   9
2  10  11  12  13  14
4  20  21  22  23  24
0   0   1   2   3   4
3  15  16  17  18  19