2016-06-30 30 views
1

我只是在用BeautifulSoup做一些網頁抓取,而且我遇到了一個奇怪的錯誤。代碼:BeautifulSoup超時實例化?

print "Running urllib2" 
g = urllib2.urlopen(link + "about", timeout=5) 
print "Finished urllib2" 
about_soup = BeautifulSoup(g, 'lxml') 

下面是輸出:

Running urllib2 
Finished urllib2 

Error 
    Traceback (most recent call last): 
     File "/Users/pspieker/Documents/projects/ThePyStrikesBack/tests/TestSpringerOpenScraper.py", line 10, in test_strip_chars 
     for row in self.instance.get_entries(): 
     File "/Users/pspieker/Documents/projects/ThePyStrikesBack/src/JournalScrapers.py", line 304, in get_entries 
     about_soup = BeautifulSoup(g, 'lxml') 
     File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__ 
     markup = markup.read() 
     File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 355, in read 
     data = self._sock.recv(rbufsize) 
     File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 588, in read 
     return self._read_chunked(amt) 
     File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 648, in _read_chunked 
     value.append(self._safe_read(amt)) 
     File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 703, in _safe_read 
     chunk = self.fp.read(min(amt, MAXAMOUNT)) 
     File "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 384, in read 
     data = self._sock.recv(left) 
    timeout: timed out 

我明白urllib2.urlopen可能會造成問題,但在該行實例BeautifulSoup出現異常。我做了一些Google搜索,但沒有找到關於BeautfiulSoup超時問題的任何信息。

關於發生了什麼的任何想法?

回答

2

這是urllib2引起超時的部分。

你看到它是失敗的BeautifulSoup實例線的原因是gthe file-like object正在BeautifulSoup內部讀取。這是棧跟蹤的一部分證明:

File "/Users/pspieker/.virtualenvs/thepystrikesback/lib/python2.7/site-packages/bs4/__init__.py", line 175, in __init__ 
    markup = markup.read() 
+0

呵呵,所以'urllib2.urlopen'對象在實例化時不會拋出異常? – user2740614

+0

@ user2740614 nope,它會在調用'read()'時觸發.. – alecxe

+0

好吧,這解釋了爲什麼它不起作用。如果你不介意,爲什麼你會像這樣設計'urllib2.urlopen'?你不希望它失敗得更快嗎(例如在實例化上)?只是好奇:) – user2740614