0
我看過一些教程並閱讀了關於beautifulsoup基礎知識的書,寫了這個scraper,但是無法通過urls a-z循環或瀏覽頁面。對於這個項目我颳了一個網站,我希望能夠抓住A-Z而不僅僅是頁面A的結果。下面我如何在beautifulsoup中生成url字符串
代碼是工作,直到我試圖得到它產生的最後一個字母串 -
這下面是我不工作的代碼 - 我在嘗試建立URL字符串。理想情況下,我很樂意從文件或預定義列表中拉出來,但也是寶貝步驟。
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
for letter in ascii_lowercase:
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
for record in soup.find_all("tr"):
playerdata=""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
if len(playerdata)!=0:
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Player,From,To,Pos,Ht,Wt,Birth Date,College"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))
print(letter)
print(playerdatasaved)
我的錯誤在下面 ---------------------
Traceback (most recent call last):
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 15, in <module>
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/")
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 8, in make_soup
thepage = urllib.request.urlopen(url)
File "C:\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 564, in error
result = self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 756, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
任何人給我一些幫助或建議?
下面是隻有一個頁面的工作版本 - 我需要它爬行多個。
import urllib
import urllib.request
from bs4 import BeautifulSoup
import os
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
playerdatasaved=""
soup = make_soup("http://www.basketball-reference.com/players/a/")
for record in soup.find_all("tr"):
playerdata = ""
for data in record.findAll("td"):
playerdata=playerdata+","+data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header="Player,From,To,Pos,Ht,Wt,Birth Date,College"+"\n"
file = open(os.path.expanduser("Basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii",errors="ignore"))
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore"))
print(playerdatasaved)
該錯誤表示未找到該網址的響應,因此您正在生成一個在各自服務器中不可用的URL。 – eLRuLL
我想盡量多,但你能告訴我爲什麼?任何使用a-z的url都應該是有效的 - 我如何設置這個設置有什麼問題?我發佈了抓取單個網址的第一個版本 - 需要多個網址。 – Drazziwac
我無法告訴你爲什麼服務器決定按照程序員的意圖工作。在你的例子中,它看起來像「X」不可用(404) – eLRuLL