BeautifulSoup用于刮取表数据并存储为将来计算的值

啊啊啊啊啊吖

2019-02-25 阅读量: 676

我正在取得一些进展，但在尝试为每个团队提取行数据并将其存储以供将来计算时，我陷入困境。到目前为止，这是我的代码：

from urllib.request import urlopen

import pandas as pd

from bs4 import BeautifulSoup

url = "https://www.hockey-reference.com/leagues/NHL_2019.html"

html = urlopen(url)

soup = BeautifulSoup(html, "lxml")

all_stats = soup.find('div', {'id': 'all_stats'})

print(all_stats)

使用此代码，我能够以HTML格式查看我需要的行信息，但任何拉动该数据的尝试都会导致找不到。我想我必须为每个团队和td值分配一个变量，以便将来可以调用它。我需要收集30行数据。

解决办法：原因是Team Statisticstable在Comment行中，所以你不解析它。在这种情况下，您可以使用Commentfrom bs4：

from bs4 import BeautifulSoup , Comment

from urllib import urlopen

search_url = 'https://www.hockey-reference.com/leagues/NHL_2019.html#'

page = urlopen(search_url)

soup = BeautifulSoup(page, "html.parser")

table = soup.findAll('table') ## html part with no comment

table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))

[comment.extract() for comment in table_with_comment]

## print table_with_comment print all comment line

for c in table_with_comment:

a = BeautifulSoup(c, "html.parser")

teams = a.findAll('td', attrs={'class':'left'}) # Team

values = a.findAll('td', attrs={'class':'right'}) #stats

for getvalues in values:

print getvalues.text

for gettextinElement in teams:

print gettextinElement.text

0.0000

关注作者

发表评论

暂无数据

CDA考试动态

CDA报考指南

推荐帖子