我正在取得一些进展,但在尝试为每个团队提取行数据并将其存储以供将来计算时,我陷入困境。到目前为止,这是我的代码:
from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.hockey-reference.com/leagues/NHL_2019.html"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
all_stats = soup.find('div', {'id': 'all_stats'})
print(all_stats)
使用此代码,我能够以HTML格式查看我需要的行信息,但任何拉动该数据的尝试都会导致找不到。我想我必须为每个团队和td值分配一个变量,以便将来可以调用它。我需要收集30行数据。
解决办法:原因是Team Statisticstable在Comment行中,所以你不解析它。在这种情况下,您可以使用Commentfrom bs4:
from bs4 import BeautifulSoup , Comment
from urllib import urlopen
search_url = 'https://www.hockey-reference.com/leagues/NHL_2019.html#'
page = urlopen(search_url)
soup = BeautifulSoup(page, "html.parser")
table = soup.findAll('table') ## html part with no comment
table_with_comment = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in table_with_comment]
## print table_with_comment print all comment line
for c in table_with_comment:
a = BeautifulSoup(c, "html.parser")
teams = a.findAll('td', attrs={'class':'left'}) # Team
values = a.findAll('td', attrs={'class':'right'}) #stats
for getvalues in values:
print getvalues.text
for gettextinElement in teams:
print gettextinElement.text








暂无数据