如何使用 pandas 读取 html 表格（包括示例）

经过本杰明·安德森博 7月 21, 2023 指导 0 条评论

您可以使用 pandas read_html()函数将 HTML 表读入 pandas DataFrame。

该函数使用以下基本语法：

 df = pd. read_html (' https://en.wikipedia.org/wiki/National_Basketball_Association ')

以下示例演示如何使用此函数从此 Wikipedia 页面读取 NBA 球队名称表。

示例：使用 Pandas 读取 HTML 表格

在使用read_html()函数之前，您可能需要安装 lxml：

 pip install lxml

注意：如果您使用的是 Jupyter 笔记本，则必须在执行此安装后重新启动内核。

接下来，我们可以使用read_html()函数读取此维基百科页面上的每个 HTML 表格：

 import pandas as pd
import numpy as np
import matplotlib. pyplot as plt
from unicodedata import normalize

#read all HTML tables from specific URL
tabs = pd. read_html (' https://en.wikipedia.org/wiki/National_Basketball_Association ')

#display total number of tables read
len (tabs)

44

我们可以看到，该页面总共找到了 44 个 HTML 表格。

我知道我感兴趣的表包含单词“Division”，因此我可以使用match参数来仅检索包含该单词的 HTML 表：

 #read HTML tables from specific URL with the word "Division" in them
tabs = pd. read_html (' https://en.wikipedia.org/wiki/National_Basketball_Association ',
                    match=' Division ')

#display total number of tables read
len (tabs)

1

然后我可以列出表列的名称：

 #define table
df = tabs[0]

#list all column names of table
list (df)

[('Division', 'Eastern Conference'),
 ('Team', 'Eastern Conference'),
 ('Location', 'Eastern Conference'),
 ('Arena', 'Eastern Conference'),
 ('Capacity', 'Eastern Conference'),
 ('Coordinates', 'Eastern Conference'),
 ('Founded', 'Eastern Conference'),
 ('Joined', 'Eastern Conference'),
 ('Unnamed: 8_level_0', 'Eastern Conference')]

我只对前两列感兴趣，因此我可以过滤DataFrame 以仅包含这些列：

 #filter DataFrame to only contain first two columns
df_final = df. iloc [:, 0:2]

#rename columns
df_final. columns = [' Division ', ' Team ']

#view first few rows of final DataFrame
print ( df_final.head ())

   Division Team
0 Atlantic Boston Celtics
1 Atlantic Brooklyn Nets
2 Atlantic New York Knicks
3 Atlantic Philadelphia 76ers
4 Atlantic Toronto Raptors

最终表仅包含“部门”和“团队”列。

其他资源

以下教程解释了如何在 pandas 中读取其他文件类型：

如何使用 Pandas 读取文本文件
 如何使用 Pandas 读取 Excel 文件
 如何使用 Pandas 读取 CSV 文件

关于作者

本杰明·安德森博

大家好，我是本杰明，一位退休的统计学教授，后来成为 Statorials 的热心教师。凭借在统计领域的丰富经验和专业知识，我渴望分享我的知识，通过 Statorials 增强学生的能力。了解更多

示例：使用 Pandas 读取 HTML 表格

其他资源

关于作者

本杰明·安德森博

添加评论