Pandas数据处理案例
学习了Pandas的一些基本语法后,通过下面的例子来初步了解一些pandas数据处理的过程,下面是三个要使用到的数据链接:https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csvhttps://raw.githubusercontent.com/jakevdp/data-USstates/m..
学习了Pandas的一些基本语法后,通过下面的例子来初步了解一些pandas数据处理的过程,下面是三个要使用到的数据链接:
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
一、读取数据并获取数据:
>>>import pandas as pd
>>>pop = pd.read_csv('state-population.csv')
>>>areas = pd.read_csv('state-areas.csv')
>>>abbrevs = pd.read_csv('state-abbrevs.csv')
>>>pop.head()
state/region ages year population
0 AL under18 2012 1117489.0
1 AL total 2012 4817528.0
2 AL under18 2010 1130966.0
3 AL total 2010 4785570.0
4 AL under18 2011 1125763.0
>>>areas.head()
state area (sq. mi)
0 Alabama 52423
1 Alaska 656425
2 Arizona 114006
3 Arkansas 53182
4 California 163707
>>>abbrevs.head()
state abbreviation
0 Alabama AL
1 Alaska AK
2 Arizona AZ
3 Arkansas AR
4 California CA
其中,pandas.DataFrame.head()默认读取的是前五行数据,pd.read_csv()得到的是一个DataFrame对象。
关于pandas.DataFrame.head()的官方文档如下:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html
二、得到美国各州人口的密度排名:
虽然可以计算每张表来的到结果,但在这里试着用数据集链接解决问题:
1、把pop中的state/region和abbrevs中的abbreviation合并:
>>>merged = pd.merge(pop, abbrevs, how='outer',
left_on='state/region', right_on='abbreviation')
>>>merged = merged.drop('abbreviation', 1) # 丢掉 axis=1(列)上的元素
>>>merged.head()
state/region ages year population state
0 AL under18 2012 1117489.0 Alabama
1 AL total 2012 4817528.0 Alabama
2 AL under18 2010 1130966.0 Alabama
3 AL total 2010 4785570.0 Alabama
4 AL under18 2011 1125763.0 Alabama
2、检查是否有缺失值并处理:
>>>merged.isnull().any()
state/region False
ages False
year False
population True
state True
dtype: bool
可以看出,部分的population有缺失值,看是哪些数据:
>>>merged[merged['population'].isnull()].head()
state/region ages year population state
2448 PR under18 1990 NaN NaN
2449 PR total 1990 NaN NaN
2450 PR total 1991 NaN NaN
2451 PR under18 1991 NaN NaN
2452 PR total 1993 NaN NaN
再看看哪些州也有缺失:
>>>merged.loc[merged['state'].isnull(), 'state/region'].unique()
array(['PR', 'USA'], dtype=object)
loc的用法:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc
可以发现,人口数据中包含PR和USA这两项数据,但是这两项的名称并没有出现在缩写表中,我们可以补全:
>>>merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
>>>merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
>>>merged.isnull().any()
state/region False
ages False
year False
population True
state False
dtype: bool
这下state没有缺失值了。
3、将面积数据合并,用共同的state来合并:
下面how='left’的意思就是:以merged的数据为基础,把areas的数据向merged上合并。
>>>final = pd.merge(merged, areas, on='state', how='left')
>>>final.head()
state/region ages year population state area (sq. mi)
0 AL under18 2012 1117489.0 Alabama 52423.0
1 AL total 2012 4817528.0 Alabama 52423.0
2 AL under18 2010 1130966.0 Alabama 52423.0
3 AL total 2010 4785570.0 Alabama 52423.0
4 AL under18 2011 1125763.0 Alabama 52423.0
检查有没有缺失值:
>>>final.isnull().any()
state/region False
ages False
year False
population True
state False
area (sq. mi) True
dtype: bool
看看是哪些地区面积的缺失:
>>>final['state'][final['area (sq. mi)'].isnull()].unique()
array(['United States'], dtype=object)
是全美国的面积缺失,在这里,该数据没有什么作用,可以用pandas.DataFrame.dropna()方法去除:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html?highlight=dropna#pandas.DataFrame.dropna
>>>final.dropna(inplace=True)
>>>final.head()
可以再检查一下缺失值,发现没有缺失值了:
>>>final.isnull().any()
state/region False
ages False
year False
population False
state False
area (sq. mi) False
dtype: bool
3、选择2010年的各州人口以及总人口数据:
query()方法:查询DataFrame的列
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html?highlight=query#pandas.DataFrame.query
>>>data2010 = final.query("year == 2010 & ages == 'total'")
>>>data2010.head()
state/region ages year population state area (sq. mi)
3 AL total 2010 4785570.0 Alabama 52423.0
91 AK total 2010 713868.0 Alaska 656425.0
101 AZ total 2010 6408790.0 Arizona 114006.0
189 AR total 2010 2922280.0 Arkansas 53182.0
197 CA total 2010 37333601.0 California 163707.0
4、计算人口密度并按序排列:
set_index():https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html?highlight=set_index#pandas.DataFrame.set_index
sort_values(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_value
>>>data2010.set_index('state', inplace=True) # 以 state 设置为行标签
>>>density = data2010['population'] / data2010['area (sq. mi)']
>>>density.sort_values(ascending=False, inplace=True) # 按任一轴的值排序,这里 ascending=False 为降序
>>>density.head()
state
District of Columbia 8898.897059
Puerto Rico 1058.665149
New Jersey 1009.253268
Rhode Island 681.339159
Connecticut 645.600649
dtype: float64
查看最后几名:
>>>density.tail()
state
South Dakota 10.583512
North Dakota 9.537565
Montana 6.736171
Wyoming 5.768079
Alaska 1.087509
dtype: float64
更多推荐



所有评论(0)