pandas_day2_基础练习

Pandas 是基于 NumPy 的一种数据处理工具，该工具为了解决数据分析任务而创建。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的函数和方法。

Pandas 的数据结构：Pandas 主要有 Series（一维数组），DataFrame（二维数组），Panel（三维数组），Panel4D（四维数组），PanelND（更多维数组）等数据结构。其中 Series 和 DataFrame 应用的最为广泛。

Series 是一维带标签的数组，它可以包含任何数据类型。包括整数，字符串，浮点数，Python 对象等。Series 可以通过标签来定位。
DataFrame 是二维的带标签的数据结构。我们可以通过标签来定位数据。这是 NumPy 所没有的。

0.1 实验知识点

本次实验涉及的知识点主要有：

创建Series
Series基本操作
创建DataFrame
DataFrame基本操作
DataFrame文件操作
透视表

0.2 实验环境

Python 3.6
NumPy
Pandas

1 基础部分

导入Pandas模块

In [2]:

import pandas as pd

1.1 创建 Series 数据类型

Pandas 中，Series 可以被看作由 1 列数据组成的数据集。

创建 Series 语法：s = pd.Series(data, index=index)，可以通过多种方式进行创建，以下介绍了 3 个常用方法。

从列表创建 Series

前面的 0,1,2,3,4 为当前 Series 的索引，后面的 0,1,2,3,4 为 Series 的值。

In [3]:

arr = [0,1,2,3,4]

s1 = pd.Series(arr)  # 如果不指定索引，则默认从0开始

s1

Out[3]:

0    0
1    1
2    2
3    3
4    4
dtype: int64

从 Ndarray 创建 Series

In [4]:

import numpy as np

In [5]:

n = np.random.randn(5)

index = ['a','b','c','d','e']

s2 = pd.Series(n,index=index)

s2

Out[5]:

a    0.258928
b   -1.213810
c    0.739202
d    1.183618
e   -0.900387
dtype: float64

从字典创建 Series

In [6]:

d = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

s3 = pd.Series(d)

s3

Out[6]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

1.2 Series 基本操作

修改 Series 索引

In [7]:

s1.index=['A','B','C','D','E']

s1

Out[7]:

A    0
B    1
C    2
D    3
E    4
dtype: int64

Series 按指定索引删除元素

In [13]:

s1 = s1.drop('E')

s1

Out[13]:

A    0
B    1
C    2
D    3
dtype: int64

Series 修改指定索引元素

In [14]:

s1['A'] = 6

s1

Out[14]:

A    6
B    1
C    2
D    3
dtype: int64

Series 按指定索引查找元素

In [16]:

s1['B']

Out[16]:

Series 切片操作

例如对前 3 个数据访问

In [17]:

s1[:3]

Out[17]:

A    6
B    1
C    2
dtype: int64

1.3 创建 DataFrame 数据类型

与 Sereis 不同，DataFrame 可以存在多列数据。一般情况下，DataFrame 也更加常用。

通过 NumPy 数组创建 DataFrame

通过字典数组创建 DataFrame

In [25]:

# 定义时间序列作为 index

dates = pd.date_range('today',periods=6)

# 传入 numpy 随机数组

num_arr = np.random.randn(6,4)

# 将列表作为列名

columns = ['A','B','C','D']

df1 = pd.DataFrame(num_arr,index=dates,columns=columns)

df1

Out[25]:

	A	B	C	D
2020-08-19 07:02:04.602751	0.256439	-0.602427	-1.321354	-1.984449
2020-08-20 07:02:04.602751	1.388259	-0.915227	-0.880789	0.457450
2020-08-21 07:02:04.602751	0.057342	0.206199	-1.048160	-0.640060
2020-08-22 07:02:04.602751	1.091020	0.351744	-0.408801	0.392707
2020-08-23 07:02:04.602751	-0.484573	-0.295999	-0.587875	-0.621808
2020-08-24 07:02:04.602751	1.359954	-0.599783	1.132802	-2.175532

In [26]:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],

        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],

        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],

        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a','b','c','d','e','f','g','h','i','j']

df2 = pd.DataFrame(data,index=labels)

df2

Out[26]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查看 DataFrame 的数据类型

In [27]:

df2.dtypes

Out[27]:

animal       object
age         float64
visits        int64
priority     object
dtype: object

预览 DataFrame 的前 5 行数据

In [28]:

df2.head()

Out[28]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no

查看 DataFrame 的后 3 行数据

In [29]:

df2.tail(3)

Out[29]:

	animal	age	visits	priority
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

查看 DataFrame 的索引

In [30]:

df2.index

Out[30]:

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')

查看 DataFrame 的列名

In [31]:

df2.columns

Out[31]:

Index(['animal', 'age', 'visits', 'priority'], dtype='object')

查看 DataFrame 的数值

In [32]:

df2.values

Out[32]:

array([['cat', 2.5, 1, 'yes'],
       ['cat', 3.0, 3, 'yes'],
       ['snake', 0.5, 2, 'no'],
       ['dog', nan, 3, 'yes'],
       ['dog', 5.0, 2, 'no'],
       ['cat', 2.0, 3, 'no'],
       ['snake', 4.5, 1, 'no'],
       ['cat', nan, 1, 'yes'],
       ['dog', 7.0, 2, 'no'],
       ['dog', 3.0, 1, 'no']], dtype=object)

查看 DataFrame 的统计数据

In [33]:

df2.describe()

Out[33]:

	age	visits
count	8.000000	10.000000
mean	3.437500	1.900000
std	2.007797	0.875595
min	0.500000	1.000000
25%	2.375000	1.000000
50%	3.000000	2.000000
75%	4.625000	2.750000
max	7.000000	3.000000

DataFrame 转置操作

In [34]:

df2.transpose()

Out[34]:

	a	b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

In [35]:

df2.T

Out[35]:

	a	b	c	d	e	f	g	h	i	j
animal	cat	cat	snake	dog	dog	cat	snake	cat	dog	dog
age	2.5	3	0.5	NaN	5	2	4.5	NaN	7	3
visits	1	3	2	3	2	3	1	1	2	1
priority	yes	yes	no	yes	no	no	no	yes	no	no

对 DataFrame 数据切片

In [37]:

df2[1:3]

Out[37]:

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

对 DataFrame 通过标签查询（单列）

In [38]:

df2['age']

df2.age

Out[38]:

a    2.5
b    3.0
c    0.5
d    NaN
e    5.0
f    2.0
g    4.5
h    NaN
i    7.0
j    3.0
Name: age, dtype: float64

对 DataFrame 通过标签查询（多列）

In [39]:

df2[['age','animal']]

Out[39]:

	age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

对 DataFrame 通过位置查询

In [36]:

df2.iloc[1:3]

Out[36]:

	animal	age	visits	priority
b	cat	3.0	3	yes
c	snake	0.5	2	no

判断 DataFrame 元素是否为空 any

In [40]:

df2.isnull().any()

Out[40]:

animal      False
age          True
visits      False
priority    False
dtype: bool

添加列数据

In [43]:

num  = pd.Series([0,1,2,3,4,5,6,7,8,9],index=df2.index)

df2['No. '] = num

df2

Out[43]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	3.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

根据 DataFrame 的下标值进行更改

In [46]:

# 修改第 2 行与第 2 列对应的值 3.0 → 2.0

df2.iloc[1,1]=2

df2

Out[46]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	2.0	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

根据 DataFrame 的标签对数据进行修改

In [47]:

df2.loc['f','age'] = 1.5

df2

Out[47]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	NaN	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	NaN	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

DataFrame 求平均值操作

In [48]:

df2.mean()

Out[48]:

age       3.25
visits    1.90
No.       4.50
dtype: float64

对 DataFrame 中任意列做求和操作

In [56]:

df2.visits.sum()

Out[56]:

1.4 DataFrame 缺失值操作

对缺失值进行填充

In [50]:

df4 = df2.copy()

print(df4)

df4.fillna(3)

  animal  age  visits priority  No. 
a    cat  2.5       1      yes     0
b    cat  2.0       3      yes     1
c  snake  0.5       2       no     2
d    dog  NaN       3      yes     3
e    dog  5.0       2       no     4
f    cat  1.5       3       no     5
g  snake  4.5       1       no     6
h    cat  NaN       1      yes     7
i    dog  7.0       2       no     8
j    dog  3.0       1       no     9

Out[50]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
d	dog	3.0	3	yes	3
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
h	cat	3.0	1	yes	7
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9

删除存在缺失值的行

In [51]:

df5 = df2.copy()

print(df5)

df5.dropna(how='any')

  animal  age  visits priority  No. 
a    cat  2.5       1      yes     0
b    cat  2.0       3      yes     1
c  snake  0.5       2       no     2
d    dog  NaN       3      yes     3
e    dog  5.0       2       no     4
f    cat  1.5       3       no     5
g  snake  4.5       1       no     6
h    cat  NaN       1      yes     7
i    dog  7.0       2       no     8
j    dog  3.0       1       no     9

Out[51]:

	animal	age	visits	priority	No.
a	cat	2.5	1	yes	0
b	cat	2.0	3	yes	1
c	snake	0.5	2	no	2
e	dog	5.0	2	no	4
f	cat	1.5	3	no	5
g	snake	4.5	1	no	6
i	dog	7.0	2	no	8
j	dog	3.0	1	no	9