你能以我的扭曲方式向我展示錯誤嗎?我一直在努力建立一個快速的網絡刮板使用扭曲相當一段時間。使用Queue構建一個傳統的線程刮板是一件小事,迄今爲止,它已經變得更快。不過,我想比較扭曲! webscraper的目標是從圖庫中遞歸地查找image()鏈接,並連接到這些圖像鏈接以抓取圖像()和/或收集更多圖像鏈接以稍後解析。代碼如下所示。大多數功能都通過字典,因此我可以在概念上更多地從每個鏈接中收集所有信息。我嘗試線程會阻塞代碼(parsePage函數),並使用「異步代碼」(或我相信)來檢索html頁面,頁眉信息和圖像。構建這種扭曲的網狀刮刀時我做錯了什麼?
到目前爲止,我的主要問題是從我的getLinkHTML或getImgHeader errback中找回了大量的「用戶超時導致連接失敗」。我試圖用信號量來限制連接的數量,甚至導致我的一些代碼無法入睡,因爲我淹沒了連接。我也認爲這個問題可能來自reactor.connectTCP,因爲在運行scraper約30秒後產生超時錯誤,並且connectTCP有30秒的超時時間。但是,我將來自雙絞線模塊的connectTCP代碼設置爲60秒,超時錯誤在運行後約30秒仍然發生。當然,使用我的傳統螺紋刮刀刮掉相同的網站效果很好,速度要快得多。
那麼我做錯了什麼?另外,請隨意對我的代碼進行批評,因爲我是自學成才的,而且我在代碼中也有一些隨機問題。任何建議,非常感謝!
from twisted.internet import defer
from twisted.internet import reactor
from twisted.web import client
from lxml import html
from StringIO import StringIO
from os import path
import re
start_url = "http://www.thesupermodelsgallery.com/"
directory = "/home/z0e/Pictures/Pix/Twisted"
min_img_size = 100000
#maximum <a> links to get from main gallery
max_gallery_links = 500
#maximum <a> links to get from subsequent gallery/pages
max_picture_links = 35
def parsePage(info):
def linkFilter(link):
#filter unwanted <a> links
if link is not None:
trade_match = re.search(r'&trade=', link)
href_split = link.split('=')
for i in range(len(href_split)):
if 'www' in href_split[i] and i > 0:
link = href_split[i]
end_pattern = r'\.(com|com/|net|net/|pro|pro/)$'
end_match = re.search(end_pattern, link)
p_pattern = r'(.*)&p'
p_match = re.search(p_pattern, link)
if end_match or trade_match:
return None
elif p_match:
link = p_match.group(1)
return link
else:
return link
else:
return None
# better to handle a link with 'None' value through TypeError
# exception or through if else statements? Compare linkFilter
# vs. imgFilter functions
def imgFilter(link):
#filter <img> links to retain only .jpg
try:
jpg_match = re.search(r'.jpg', link)
if jpg_match is not None:
return link
else:
return None
except TypeError:
return None
link_num = 0
gallery_flag = None
info['level'] += 1
if info['page'] is '':
return None
# use lxml to parse and get document root
tree = html.parse(StringIO(info['page']))
root = tree.getroot()
root.make_links_absolute(info['url'])
# info['level'] = 1 corresponds to first recursive layer (i.e. main gallery page)
# info['level'] > 1 will be all other <a> links from main gallery page
if info['level'] == 1:
link_cap = max_gallery_links
gallery_flag = True
else:
link_cap = max_picture_links
gallery_flag = False
if info['level'] > 4:
return None
else:
# get <img> links if page is not main gallery ('gallery_flag = False')
# put <img> links back into main event loop to extract header information
# to judge pictures by picture size (i.e. content-length)
if not gallery_flag:
for elem in root.iter('img'):
# create copy of info so that dictionary no longer points to
# previous dictionary, but new dictionary for each link
info = info.copy()
info['url'] = imgFilter(elem.get('src'))
if info['url'] is not None:
reactor.callFromThread(getImgHeader, info)
# get <a> link and put work back into main event loop (i.e. w/
# reactor.callFromThread...) to getPage and then parse, continuing the
# cycle of linking
for elem in root.iter('a'):
if link_num > link_cap:
break
else:
img = elem.find('img')
if img is not None:
link_num += 1
info = info.copy()
info['url'] = linkFilter(elem.get('href'))
if info['url'] is not None:
reactor.callFromThread(getLinkHTML, info)
def getLinkHTML(info):
# get html from <a> link and then send page to be parsed in a thread
d = client.getPage(info['url'])
d.addCallback(parseThread, info)
d.addErrback(failure, "getLink Failure: " + info['url'])
def parseThread(page, info):
print 'parsethread:', info['url']
info['page'] = page
reactor.callInThread(parsePage, info)
def getImgHeader(info):
# get <img> header information to filter images by image size
agent = client.Agent(reactor)
d = agent.request('HEAD', info['url'], None, None)
d.addCallback(getImg, info)
d.addErrback(failure, "getImgHeader Failure: " + info['url'])
def getImg(img_header, info):
# download image only if image is above a certain threshold size
img_size = img_header.headers.getRawHeaders('Content-Length')
if int(img_size[0]) > min_img_size and img_size is not None:
img_name = ''.join(map(urlToName, info['url']))
client.downloadPage(info['url'], path.join(directory, img_name))
else:
img_header, link = None, None #Does this help garbage collecting?
def urlToName(char):
#convert all unwanted characters to '-' from url and use as file name
if char in '/\?|<>"':
return '-'
else:
return char
def failure(error, url):
print error
print url
def main():
info = dict()
info['url'] = start_url
info['level'] = 0
reactor.callWhenRunning(getLinkHTML, info)
reactor.suggestThreadPoolSize(2)
reactor.run()
if __name__ == "__main__":
main()
感謝讓保羅的迴應。我知道scrapy,但是想使用這個報廢項目來學習扭曲以及與其他技術進行比較。我對你所指的身份感到困惑。你能給個例子嗎?由於python使用標識來定義代碼的「塊」而不是像C中的括號,我不確定如何處理你的建議。關於網絡負載過重,我認爲情況並非如此,至少在任何明顯的情況下都如此。 – WacKaDoodle
另外,我知道連接可以處理的請求數量超過了我提供的請求數量,因爲我成功地通過數百個線程與我的傳統線程化碎片一次運行數百個請求。有趣的是主反應器迴路本身是螺紋的(?)。我一直認爲這是一個單線程,通過調用'reactor.suggestThreadPool()'我正在修改反應器循環之外的線程數量,還是它本身會在反應器循環之外生成額外的線程來提供DNS請求,因此製作「反應堆線程池」? – WacKaDoodle
通過使用reactor.callInThread()我打算在主反應器循環線程之外創建線程,而使用reactor.callFromThread()時,我希望將我的代碼執行插入主反應器循環/線程。這是一個正確的理解?我沒有嘗試將suggestThreadPool擴展到〜30,沒有任何成功:/。我希望這裏有一些明顯的做法是錯誤的,但這聽起來像我的代碼過於複雜,無法簡單理解。我會盡量使它的一個簡化版本,同時保持主要的扭曲的呼叫。再次感謝您的幫助! – WacKaDoodle