0

(對不起,我的英語) 我想使解析器一些網站,包括大量的JS腳本, 爲了這個,我用硒+ phantomjs + LXML 。需要這個解析器工作得很快,每1小時至少有1000條鏈接。爲此,我使用多處理 (不是Threading !!,因爲GIL)和模塊Future和ProcessExecutorPool。Losе鏈接(硒/ phantomJS/ProcessPoolExecutor)

在接下來的問題,當我給10個鏈接和5個工人輸入列表,執行後 丟失一些鏈接。它可以是1個鏈接或更大(直到6! - 最大值但很少)。這當然是不好的結果。 對於增加流程量,增加丟失鏈接的數量有一定的依賴性。首先我跟蹤程序中斷的地方。 (因爲多處理,assert不能正常工作) 我發現,程序打破了字符串「browser.get(l)」之後。然後我把time.sleep(x) - 給下載頁面一些時間 。不給結果。然後我嘗試從selenium.webdriver調試.get().... remote.webdriver.py 但它重新加載.execute() - 並且此函數需要這麼多參數 - 並且發現它 - 太長,並且 對我來說很困難。 ..同時我嘗試運行1個程序的程序 - 而且我失去了1個鏈接。我認爲,可能 不是硒和PhantomJS問題,比我更換concurrent.futures.Future.ProcessExecutorPool 多處理.Pool - 問題解決,鏈接不會丟失,但如果進程量 - < = 4,則工作

""" 
multiprocessing.pool.RemoteTraceback: 
Traceback (most recent call last): 
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker 
    result = (True, func(*args, **kwds)) 
File "/usr/lib/python3.4/multiprocessing/pool.py", line 44, in mapstar 
    return list(map(*args)) 
File "interface.py", line 34, in hotline_to_mysql 
    w = Parse_hotline().browser_manipulation(link) 
File "/home/water/work/parsing/class_parser/parsing_classes.py", line 352, in browser_manipulation 
    browser.get(l) 
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 247, in get 
    self.execute(Command.GET, {'url': url}) 
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/webdriver.py", line 233, in execute 
    response = self.command_executor.execute(driver_command, params) 
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 401, in execute 
    return self._request(command_info[0], url, body=data) 
File "/usr/local/lib/python3.4/dist-packages/selenium/webdriver/remote/remote_connection.py", line 471, in _request 
    resp = opener.open(request, timeout=self._timeout) 
File "/usr/lib/python3.4/urllib/request.py", line 463, in open 
    response = self._open(req, data) 
File "/usr/lib/python3.4/urllib/request.py", line 481, in _open 
    '_open', req) 
File "/usr/lib/python3.4/urllib/request.py", line 441, in _call_chain 
    result = func(*args) 
File "/usr/lib/python3.4/urllib/request.py", line 1210, in http_open 
    return self.do_open(http.client.HTTPConnection, req) 
File "/usr/lib/python3.4/urllib/request.py", line 1185, in do_open 
    r = h.getresponse() 
File "/usr/lib/python3.4/http/client.py", line 1171, in getresponse 
    response.begin() 
File "/usr/lib/python3.4/http/client.py", line 351, in begin 
    version, status, reason = self._read_status() 
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status 
    raise BadStatusLine(line) 
http.client.BadStatusLine: '' 

The above exception was the direct cause of the following exception: 

Traceback (most recent call last): 
File "interface.py", line 69, in <module> 
    main() 
File "interface.py", line 63, in main 
    executor.map(hotline_to_mysql, link_list) 
File "/usr/lib/python3.4/multiprocessing/pool.py", line 260, in map 
    return self._map_async(func, iterable, mapstar, chunksize).get() 
File "/usr/lib/python3.4/multiprocessing/pool.py", line 599, in get 
    raise self._value 
http.client.BadStatusLine: '' 
""" 


import random 
import time 
import lxml.html as lh 
from selenium import webdriver 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
from multiprocessing import Pool 
from selenium.webdriver.common.keys import Keys 
from concurrent.futures import Future, ProcessPoolExecutor, ThreadPoolExecutor 
AMOUNT_PROCESS = 5 

def parse(h)->list: 
    # h - str, html of page 
    lxml_ = lh.document_fromstring(h) 
    name = lxml_.xpath('/html/body/div[2]/div[7]/div[6]/ul/li[1]/a/@title') 
    prices_ = (price.text_content().strip().replace('\xa0', ' ') 
       for price in lxml_.xpath('//*[@id="gotoshop-price"]')) 
    markets_ =(market.text_content().strip() for market in 
      lxml_.find_class('cell shop-title')) 
    wares = [[name[0], market, price] for (market, price) 
      in zip(markets_, prices_)] 
    return wares 


def browser_manipulation(l): 
    #options = [] 
    #options.append('--load-images=false') 
    #options.append('--proxy={}:{}'.format(host, port)) 
    #options.append('--proxy-type=http') 
    #options.append('--user-agent={}'.format(user_agent)) #тут хедеры рандомно 

    dcap = dict(DesiredCapabilities.PHANTOMJS) 
    #user agent takes from my config.py 
    dcap["phantomjs.page.settings.userAgent"] = (random.choice(USER_AGENT)) 
    browser = webdriver.PhantomJS(desired_capabilities=dcap) 
    #print(browser) 
    #print('~~~~~~', l) 
    #browser.implicitly_wait(20) 
    #browser.set_page_load_timeout(80) 
    #time.sleep(2) 
    browser.get(l) 
    time.sleep(20) 
    result = parse(browser.page_source) 
    #print('++++++', result[0][0]) 
    browser.quit() 
    return result 

def main(): 
    #open some file with links 

    with open(sys.argv[1], 'r') as f: 
     link_list = [i.replace('\n', '') for i in f] 
    with Pool(AMOUNT_PROCESS) as executor: 
     executor.map(browser_manipulation, link_list) 

if __name__ == '__main__': 
    main() 

問題出在哪裏(硒+ phantomJS,ThreadPoolExecutor的,我的代碼):(設置4 < =過程的量,當這個錯誤出現)差不多好了,但過一段時間appeares新的錯誤?爲什麼鏈接丟失? 如何提高解析速度? 最後,可能有解析動態網站沒有硒+ phantomjs,在python上的替代方法? 當然,重要的是解析速度。 感謝您的回答。

回答

0

我嘗試,而不是ProcessPoolExecutor - ThreadPoolExecutor,失去鏈接停止。在Thread ...情況下,速度大約等於Process。

這個問題是真實的,如果你有這方面的一些信息,請寫。謝謝。