《Pandas Cookbook》第04章 选取数据子集

简介: 第01章 Pandas基础第02章 DataFrame运算第03章 数据分析入门第04章 选取数据子集第05章 布尔索引第06章 索引对齐第07章 分组聚合、过滤、转换第08章 数据清理第09章 合并Pandas对象第10章 时间序列分析第1...

第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


In[1]: import pandas as pd
       import numpy as np

1. 选取Series数据

# 读取college数据集,查看CITY的前5行
 In[2]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
        city = college['CITY']
        city.head()
Out[2]: INSTNM
        Alabama A & M University                   Normal
        University of Alabama at Birmingham    Birmingham
        Amridge University                     Montgomery
        University of Alabama in Huntsville    Huntsville
        Alabama State University               Montgomery
        Name: CITY, dtype: object
# iloc可以通过整数选取
 In[3]: city.iloc[3]
Out[3]: 'Huntsville'
# iloc通过整数列表选取多行,返回结果是Series
 In[4]: city.iloc[[10,20,30]]
Out[4]: INSTNM
        Birmingham Southern College                            Birmingham
        George C Wallace State Community College-Hanceville    Hanceville
        Judson College                                             Marion
        Name: CITY, dtype: object
# 选择等分的数据,可以使用切片语法
 In[5]: city.iloc[4:50:10]
Out[5]: INSTNM
        Alabama State University              Montgomery
        Enterprise State Community College    Enterprise
        Heritage Christian University           Florence
        Marion Military Institute                 Marion
        Reid State Technical College           Evergreen
        Name: CITY, dtype: object
# loc只接收行索引标签
 In[6]: city.loc['Heritage Christian University']
Out[6]: 'Florence'
# 随机选择4个标签
 In[7]: np.random.seed(1)
        labels = list(np.random.choice(city.index, 4))
        labels
Out[7]: ['Northwest HVAC/R Training Center',
         'California State University-Dominguez Hills',
         'Lower Columbia College',
         'Southwest Acupuncture College-Boulder']
# 通过标签列表选择多行
 In[8]: city.loc[labels]
Out[8]: INSTNM
        Northwest HVAC/R Training Center                Spokane
        California State University-Dominguez Hills      Carson
        Lower Columbia College                         Longview
        Southwest Acupuncture College-Boulder           Boulder
        Name: CITY, dtype: object
# 也可以通过切片语法均匀选择多个
 In[9]: city.loc['Alabama State University':'Reid State Technical College':10]
Out[9]: INSTNM
        Alabama State University              Montgomery
        Enterprise State Community College    Enterprise
        Heritage Christian University           Florence
        Marion Military Institute                 Marion
        Reid State Technical College           Evergreen
        Name: CITY, dtype: object
# 也可以不使用loc,直接使用类似Python的语法
 In[10]: city['Alabama State University':'Reid State Technical College':10]
Out[10]: INSTNM
         Alabama State University              Montgomery
         Enterprise State Community College    Enterprise
         Heritage Christian University           Florence
         Marion Military Institute                 Marion
         Reid State Technical College           Evergreen
         Name: CITY, dtype: object

更多

# 要想只选取一项,并保留其Series类型,则传入一个只包含一项的列表
 In[11]: city.iloc[[3]]
Out[11]: INSTNM
         University of Alabama in Huntsville    Huntsville
         Name: CITY, dtype: object
# 使用loc切片时要注意,如果start索引再stop索引之后,则会返回空,并且不会报警
 In[12]: city.loc['Reid State Technical College':'Alabama State University':10]
Out[12]: Series([], Name: CITY, dtype: object)
# 也可以切片逆序选取
 In[13]: city.loc['Reid State Technical College':'Alabama State University':-10]
Out[13]: INSTNM
         Reid State Technical College           Evergreen
         Marion Military Institute                 Marion
         Heritage Christian University           Florence
         Enterprise State Community College    Enterprise
         Alabama State University              Montgomery
         Name: CITY, dtype: object

2. 选取DataFrame的行

# 还是读取college数据集
 In[14]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.head()
Out[14]: 
img_636fe67140c8275a80410a7e9043b0f6.png
# 选取第61行
 In[15]: pd.options.display.max_rows = 6
 In[16]: college.iloc[60]
Out[16]: 
img_1b319a9559600947ddfd214871365264.png
# 也可以通过行标签选取
 In[17]: college.loc['University of Alaska Anchorage']
Out[17]: CITY                  Anchorage
         STABBR                       AK
         HBCU                          0
                                     ...    
         UG25ABV                  0.4386
         MD_EARN_WNE_P10           42500
         GRAD_DEBT_MDN_SUPP      19449.5
         Name: University of Alaska Anchorage, Length: 26, dtype: object
# 选取多个不连续的行
 In[18]: college.iloc[[60, 99, 3]]
Out[18]: 
img_be79af68cf7a61c00d82e170d79baeb7.png
# 也可以用loc加列表来选取
 In[19]: labels = ['University of Alaska Anchorage',
                   'International Academy of Hair Design',
                   'University of Alabama in Huntsville']
         college.loc[labels]
Out[19]: 
img_291b4747594f515d7a56f3ce46637aa1.png
# iloc可以用切片连续选取
 In[20]: college.iloc[99:102]
Out[20]: 
img_90579e58ad1851bbcccdf916f5faf925.png
# loc可以用标签连续选取
 In[21]: start = 'International Academy of Hair Design'
         stop = 'Mesa Community College'
         college.loc[start:stop]
Out[21]: 
img_6ce0e865541ee318b96cb7ed0882e78d.png

更多

# .index.tolist()可以直接提取索引标签,生成一个列表
 In[22]: college.iloc[[60, 99, 3]].index.tolist()
Out[22]: ['University of Alaska Anchorage',
          'International Academy of Hair Design',
          'University of Alabama in Huntsville']

3. 同时选取DataFrame的行和列

# 读取college数据集,给行索引命名为INSTNM;选取前3行和前4列
 In[23]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.iloc[:3, :4]
Out[23]: 
img_25fee28c88644d27637472023b06aecd.png
# 用loc实现同上功能
 In[24]: college.loc[:'Amridge University', :'MENONLY']
Out[24]: 
img_3f1bc707891e7fe584dc59e539eb9baa.png
# 选取两列的所有的行
 In[25]: college.iloc[:, [4,6]].head()
Out[25]: 
img_01ce899029226bb396240f72bcb71dc8.png
# loc实现同上功能
 In[26]: college.loc[:, ['WOMENONLY', 'SATVRMID']]
Out[26]: 
img_63f690d921d8c0b011cd60f10a27d428.png
# 选取不连续的行和列
 In[27]: college.iloc[[100, 200], [7, 15]]
Out[27]: 
img_785f74faf3829bdd4b0f30f0d0b3998e.png
# 用loc和列表,选取不连续的行和列
 In[28]: rows = ['GateWay Community College', 'American Baptist Seminary of the West']
         columns = ['SATMTMID', 'UGDS_NHPI']
         college.loc[rows, columns]
Out[28]: 
img_0d3a10e0c4dbf673356b73a30b18a9c2.png
# iloc选取一个标量值
 In[29]: college.iloc[5, -4]
Out[29]: 0.40100000000000002
# loc选取一个标量值
 In[30]: college.loc['The University of Alabama', 'PCTFLOAN']
Out[30]: 0.40100000000000002
# iloc对行切片,并只选取一列
 In[31]: college.iloc[90:80:-2, 5]
Out[31]: INSTNM
         Empire Beauty School-Flagstaff     0
         Charles of Italy Beauty College    0
         Central Arizona College            0
         University of Arizona              0
         Arizona State University-Tempe     0
         Name: RELAFFIL, dtype: int64
# loc对行切片,并只选取一列
 In[32]: start = 'Empire Beauty School-Flagstaff'
         stop = 'Arizona State University-Tempe'
         college.loc[start:stop:-2, 'RELAFFIL']
Out[32]: INSTNM
         Empire Beauty School-Flagstaff     0
         Charles of Italy Beauty College    0
         Central Arizona College            0
         University of Arizona              0
         Arizona State University-Tempe     0
         Name: RELAFFIL, dtype: int64

4. 用整数和标签选取数据

# 读取college数据集,行索引命名为INSTNM
 In[33]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
# 用索引方法get_loc,找到指定列的整数位置
 In[34]: col_start = college.columns.get_loc('UGDS_WHITE')
         col_end = college.columns.get_loc('UGDS_UNKN') + 1
         col_start, col_end
Out[34]: (10, 19)
# 用切片选取连续的列
 In[35]: college.iloc[:5, col_start:col_end]
Out[35]:
img_11910d35027576bc810a6055f9eee178.png

更多

# index()方法可以获得整数行对应的标签名
 In[36]: row_start = college.index[10]
         row_end = college.index[15]
         college.loc[row_start:row_end, 'UGDS_WHITE':'UGDS_UNKN']
Out[36]: 
img_4676eb1a2e7eee99cab5aed37e29eeef.png

5. 快速选取标量

# 通过将行标签赋值给一个变量,用loc选取
 In[37]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         cn = 'Texas A & M University-College Station'
         college.loc[cn, 'UGDS_WHITE']
Out[37]: 0.66099999999999992
# at可以实现同样的功能
 In[38]: college.at[cn, 'UGDS_WHITE']
Out[38]: 0.66099999999999992
# 用魔术方法%timeit,对速度进行比较
 In[39]: %timeit college.loc[cn, 'UGDS_WHITE']
Out[39]: 9.93 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
 In[40]: %timeit college.at[cn, 'UGDS_WHITE']
Out[40]: 6.69 µs ± 223 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

.iat.at只接收标量值,是专门用来取代.iloc.loc选取标量的,可以节省大概2.5微秒。

# 用get_loc找到整数位置,再进行速度比较
 In[41]: row_num = college.index.get_loc(cn)
         col_num = college.columns.get_loc('UGDS_WHITE')
 In[42]: row_num, col_num
Out[42]: (3765, 10)

 In[43]: %timeit college.iloc[row_num, col_num]
Out[43]: 11.1 µs ± 426 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[44]: %timeit college.iat[row_num, col_num]
Out[44]: 7.47 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[45]: %timeit college.iloc[5, col_num]
Out[45]: 10.8 µs ± 467 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

 In[46]: %timeit college.iat[5, col_num]
Out[46]: 7.12 µs ± 297 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

更多

# Series对象也可以使用.iat和.at选取标量
 In[47]: state = college['STABBR']
 In[48]: state.iat[1000]
Out[48]: 'IL'

 In[49]: state.at['Stanford University']
Out[49]: 'CA'

6. 惰性行切片

# 读取college数据集;从行索引10到20,每隔一个取一行
 In[50]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college[10:20:2]
Out[50]: 
img_2f58b041f8765eccb30a9bceb991c4ff.png
# Series也可以进行同样的切片
 In[51]: city = college['CITY']
         city[10:20:2]
Out[51]: INSTNM
         Birmingham Southern College              Birmingham
         Concordia College Alabama                     Selma
         Enterprise State Community College       Enterprise
         Faulkner University                      Montgomery
         New Beginning College of Cosmetology    Albertville
         Name: CITY, dtype: object
# 查看第4002个行索引标签
 In[52]: college.index[4001]
Out[52]: 'Spokane Community College'
# Series和DataFrame都可以用标签进行切片。下面是对DataFrame用标签切片
 In[53]: start = 'Mesa Community College'
         stop = 'Spokane Community College'
         college[start:stop:1500]
Out[53]: 
img_74a07687f5e79453e7cb3f18b889b060.png
# 下面是对Series用标签切片
 In[54]: city[start:stop:1500]
Out[54]: INSTNM
         Mesa Community College                            Mesa
         Hair Academy Inc-New Carrollton         New Carrollton
         National College of Natural Medicine          Portland
         Name: CITY, dtype: object

更多

惰性切片不能用于列,只能用于DataFrame的行和Series,也不能同时选取行和列。

# 下面尝试选取两列,导致错误
 In[55]: college[:10, ['CITY', 'STABBR']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-55-92538c61bdfa> in <module>()
----> 1 college[:10, ['CITY', 'STABBR']]

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1962             return self._getitem_multilevel(key)
   1963         else:
-> 1964             return self._getitem_column(key)
   1965 
   1966     def _getitem_column(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1969         # get column
   1970         if self.columns.is_unique:
-> 1971             return self._get_item_cache(key)
   1972 
   1973         # duplicate columns & possible reduce dimensionality

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1641         """Return the cached item, item represents a label indexer."""
   1642         cache = self._item_cache
-> 1643         res = cache.get(item)
   1644         if res is None:
   1645             values = self._data.get(item)

TypeError: unhashable type: 'slice'
# 只能用.loc和.iloc选取
 In[56]: first_ten_instnm = college.index[:10]
         college.loc[first_ten_instnm, ['CITY', 'STABBR']]
Out[56]: 
img_e903f1aa1e160685607042b0fbceef67.png

7. 按照字母切片

# 读取college数据集;尝试选取字母顺序在‘Sp’和‘Su’之间的学校
 In[57]: college = pd.read_csv('data/college.csv', index_col='INSTNM')
         college.loc['Sp':'Su']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3483             try:
-> 3484                 return self._searchsorted_monotonic(label, side)
   3485             except ValueError:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in _searchsorted_monotonic(self, label, side)
   3442 
-> 3443         raise ValueError('index must be monotonic increasing or decreasing')
   3444 

ValueError: index must be monotonic increasing or decreasing

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-57-c9f1c69a918b> in <module>()
      1 college = pd.read_csv('data/college.csv', index_col='INSTNM')
----> 2 college.loc['Sp':'Su']

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1326         else:
   1327             key = com._apply_if_callable(key, self.obj)
-> 1328             return self._getitem_axis(key, axis=0)
   1329 
   1330     def _is_scalar_access(self, key):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1504         if isinstance(key, slice):
   1505             self._has_valid_type(key, axis)
-> 1506             return self._get_slice_axis(key, axis=axis)
   1507         elif is_bool_indexer(key):
   1508             return self._getbool_axis(key, axis=axis)

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexing.py in _get_slice_axis(self, slice_obj, axis)
   1354         labels = obj._get_axis(axis)
   1355         indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop,
-> 1356                                        slice_obj.step, kind=self.name)
   1357 
   1358         if isinstance(indexer, slice):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_indexer(self, start, end, step, kind)
   3348         """
   3349         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 3350                                                  kind=kind)
   3351 
   3352         # return a slice

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in slice_locs(self, start, end, step, kind)
   3536         start_slice = None
   3537         if start is not None:
-> 3538             start_slice = self.get_slice_bound(start, 'left', kind)
   3539         if start_slice is None:
   3540             start_slice = 0

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3485             except ValueError:
   3486                 # raise the original KeyError
-> 3487                 raise err
   3488 
   3489         if isinstance(slc, np.ndarray):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_slice_bound(self, label, side, kind)
   3479         # we need to look up the label
   3480         try:
-> 3481             slc = self._get_loc_only_exact_matches(label)
   3482         except KeyError as err:
   3483             try:

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in _get_loc_only_exact_matches(self, key)
   3448         get_slice_bound.
   3449         """
-> 3450         return self.get_loc(key)
   3451 
   3452     def get_slice_bound(self, label, side, kind):

/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2442                 return self._engine.get_loc(key)
   2443             except KeyError:
-> 2444                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2445 
   2446         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5280)()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5126)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20523)()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20477)()

KeyError: 'Sp'
# 对college进行排序
 In[58]: college = college.sort_index()
 In[59]: college = college.head()
Out[59]:
img_bcad331fc82ffd8ec5a7a66c32a2e297.png
# 再尝试选取字母顺序在‘Sp’和‘Su’之间的学校
 In[60]: pd.options.display.max_rows = 6
 In[61]: college.loc['Sp':'Su']
Out[61]:
img_bdaeb3c9fa894413f27dc9b0b942fc6b.png
# 可以用is_monotonic_increasing或is_monotonic_decreasing检测字母排序的顺序
 In[62]: college = college.sort_index(ascending=False)
         college.index.is_monotonic_decreasing
Out[62]: True
# 字母逆序选取
 In[63]: college.loc['E':'B']
Out[63]: 
img_5e70c001873e3f5f178ff64f6908438a.png

第01章 Pandas基础
第02章 DataFrame运算
第03章 数据分析入门
第04章 选取数据子集
第05章 布尔索引
第06章 索引对齐
第07章 分组聚合、过滤、转换
第08章 数据清理
第09章 合并Pandas对象
第10章 时间序列分析
第11章 用Matplotlib、Pandas、Seaborn进行可视化


目录
相关文章
|
2月前
|
机器学习/深度学习 数据可视化 搜索推荐
Python在社交媒体分析中扮演关键角色,借助Pandas、NumPy、Matplotlib等工具处理、可视化数据及进行机器学习。
【7月更文挑战第5天】Python在社交媒体分析中扮演关键角色,借助Pandas、NumPy、Matplotlib等工具处理、可视化数据及进行机器学习。流程包括数据获取、预处理、探索、模型选择、评估与优化,以及结果可视化。示例展示了用户行为、话题趋势和用户画像分析。Python的丰富生态使得社交媒体洞察变得高效。通过学习和实践,可以提升社交媒体分析能力。
61 1
|
4月前
|
存储 JSON 数据处理
从JSON数据到Pandas DataFrame:如何解析出所需字段
从JSON数据到Pandas DataFrame:如何解析出所需字段
286 1
|
10天前
|
数据采集 数据挖掘 数据处理
使用Python和Pandas处理CSV数据
使用Python和Pandas处理CSV数据
40 5
|
23天前
|
存储 移动开发 测试技术
在pandas中利用hdf5高效存储数据
在pandas中利用hdf5高效存储数据
|
22天前
|
数据采集 分布式计算 并行计算
Dask与Pandas:无缝迁移至分布式数据框架
【8月更文第29天】Pandas 是 Python 社区中最受欢迎的数据分析库之一,它提供了高效且易于使用的数据结构,如 DataFrame 和 Series,以及大量的数据分析功能。然而,随着数据集规模的增大,单机上的 Pandas 开始显现出性能瓶颈。这时,Dask 就成为了一个很好的解决方案,它能够利用多核 CPU 和多台机器进行分布式计算,从而有效地处理大规模数据集。
45 1
|
22天前
|
索引 Python
python pandas 把数据保存成csv文件,以及读取csv文件获取指定行、指定列数据
该文档详细介绍了如何使用Python的Pandas库处理图像数据集,并将其保存为CSV文件。示例数据集位于`test_data`目录中,包含5张PNG图片,每张图片名中的数字代表其标签。文档提供了将这些数据转换为CSV格式的具体步骤,包括不同格式的数据输入方法(如NumPy数组、嵌套列表、嵌套元组和字典),以及如何使用`pd.DataFrame`和`to_csv`方法保存数据。此外,还展示了如何读取CSV文件并访问其中的每一行和每一列数据,包括获取列名、指定列数据及行数据的操作方法。
29 1
|
2月前
|
存储 数据可视化 Python
Python中的数据可视化:使用Matplotlib和Pandas探索数据
【7月更文挑战第31天】 在数据科学的世界里,可视化是理解复杂数据集的关键。本文将引导你通过Python的Matplotlib库和Pandas库来揭示数据背后的故事。我们将一起构建图表,从简单的线图到复杂的热图,每一步都将用代码示例来丰富你的理解。文章不仅仅是展示如何绘制图表,更是一次深入探索数据结构和模式的旅程。
51 8
|
22天前
|
数据可视化 Python
Pandas可视化指南:从零教你绘制数据图表
Pandas可视化指南:从零教你绘制数据图表
|
1月前
|
Python
Pandas 读取Eexcel - 间隔N行,读取某列数据
Pandas 读取Eexcel - 间隔N行,读取某列数据
24 0