2017-02-13 60 views
0

我看過一些教程並閱讀了關於beautifulsoup基礎知識的書,寫了這個scraper,但是無法通過urls a-z循環或瀏覽頁面。對於這個項目我颳了一個網站,我希望能夠抓住A-Z而不僅僅是頁面A的結果。下面我如何在beautifulsoup中生成url字符串

代碼是工作,直到我試圖得到它產生的最後一個字母串 -

這下面是我不工作的代碼 - 我在嘗試建立URL字符串。理想情況下,我很樂意從文件或預定義列表中拉出來,但也是寶貝步驟。

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
import os 
from string import ascii_lowercase 

def make_soup(url): 
thepage = urllib.request.urlopen(url) 
soupdata = BeautifulSoup(thepage, "html.parser") 
return soupdata 


playerdatasaved="" 
for letter in ascii_lowercase: 
soup = make_soup("http://www.basketball-reference.com/players/" + letter +  "/") 
    for record in soup.find_all("tr"): 
    playerdata="" 
    for data in record.findAll("td"): 
     playerdata=playerdata+","+data.text 
    if len(playerdata)!=0: 
     playerdatasaved = playerdatasaved + "\n" + playerdata[1:] 

header="Player,From,To,Pos,Ht,Wt,Birth Date,College" 
file = open(os.path.expanduser("Basketball.csv"),"wb") 
file.write(bytes(header, encoding="ascii",errors="ignore")) 
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore")) 

print(letter) 
print(playerdatasaved) 

我的錯誤在下面 ---------------------

Traceback (most recent call last): 
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 15, in <module> 
soup = make_soup("http://www.basketball-reference.com/players/" + letter + "/") 
File "C:/Python36/web_scraper_tutorial/multiple_url_2.py", line 8, in make_soup 
    thepage = urllib.request.urlopen(url) 
File "C:\Python36\lib\urllib\request.py", line 223, in urlopen 
    return opener.open(url, data, timeout) 
    File "C:\Python36\lib\urllib\request.py", line 532, in open 
    response = meth(req, response) 
File "C:\Python36\lib\urllib\request.py", line 642, in http_response 
'http', request, response, code, msg, hdrs) 
File "C:\Python36\lib\urllib\request.py", line 564, in error 
    result = self._call_chain(*args) 
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain 
result = func(*args) 
File "C:\Python36\lib\urllib\request.py", line 756, in http_error_302 
return self.parent.open(new, timeout=req.timeout) 
File "C:\Python36\lib\urllib\request.py", line 532, in open 
response = meth(req, response) 
File "C:\Python36\lib\urllib\request.py", line 642, in http_response 
'http', request, response, code, msg, hdrs) 
File "C:\Python36\lib\urllib\request.py", line 570, in error 
return self._call_chain(*args) 
File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain 
result = func(*args) 
File "C:\Python36\lib\urllib\request.py", line 650, in http_error_default 
raise HTTPError(req.full_url, code, msg, hdrs, fp) 
urllib.error.HTTPError: HTTP Error 404: Not Found 

任何人給我一些幫助或建議?

下面是隻有一個頁面的工作版本 - 我需要它爬行多個。

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
import os 

def make_soup(url): 
thepage = urllib.request.urlopen(url) 
soupdata = BeautifulSoup(thepage, "html.parser") 
return soupdata 

playerdatasaved="" 
soup = make_soup("http://www.basketball-reference.com/players/a/") 
for record in soup.find_all("tr"): 
playerdata = "" 
for data in record.findAll("td"): 
    playerdata=playerdata+","+data.text 
playerdatasaved = playerdatasaved + "\n" + playerdata[1:] 

header="Player,From,To,Pos,Ht,Wt,Birth Date,College"+"\n" 
file = open(os.path.expanduser("Basketball.csv"),"wb") 
file.write(bytes(header, encoding="ascii",errors="ignore")) 
file.write(bytes(playerdatasaved, encoding="ascii",errors="ignore")) 


print(playerdatasaved) 
+0

該錯誤表示未找到該網址的響應,因此您正在生成一個在各自服務器中不可用的URL。 – eLRuLL

+0

我想盡量多,但你能告訴我爲什麼?任何使用a-z的url都應該是有效的 - 我如何設置這個設置有什麼問題?我發佈了抓取單個網址的第一個版本 - 需要多個網址。 – Drazziwac

+1

我無法告訴你爲什麼服務器決定按照程序員的意圖工作。在你的例子中,它看起來像「X」不可用(404) – eLRuLL

回答

0

該特定網站沒有'x'。因此,你會得到一個404。試試看,除非它跳過404的頁面,它應該可以工作。

playerdatasaved="" 
for letter in ascii_lowercase: 
    try: 
     soup = make_soup("http://www.basketball-reference.com/players/" + letter +  "/") 
     for record in soup.find_all("tr"): 
     playerdata="" 
     for data in record.findAll("td"): 
      playerdata=playerdata+","+data.text 
     if len(playerdata)!=0: 
      playerdatasaved = playerdatasaved + "\n" + playerdata[1:] 
    except: 
     pass