2017-09-13 37 views
0

我工作的一個項目刮 - 看什麼recylcing公司在英國刮網站孤男寡女互動

我碰到的與本網站的問題提供不同的產品:

http://www.musicmagpie.co.uk/entertainment/

我有一個條形碼清單,我想找到他們的購買價格(輸入條形碼到搜索框中,點擊「添加按鈕」)。我已經設法讓Selenium Webdriver工作,但這是一個非常緩慢的過程,如果沒有網站出現在我身邊並在某個時候殺死我的流程,我無法運行大量條形碼。

我的目標是每秒約1次搜索,目前平均需要約5秒以上。這是我運行代碼:

driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe") 
driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media') 

countx = 0 
count = 0 
for EAN in EANs: 
    countx += 1 
    count += 1 

    if count % 200 == 0: 
     driver.close() 
     driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe") 
     driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media') 
     count = 1 

    driver.find_element_by_xpath("""//*[@id="txtBarcode"]""").send_keys(str(EAN)) 

    #If popup window appears, exception will close it as first click will fail. 
    try:  
     driver.find_element_by_xpath("""//*[@id="getValSmall"]""").click() 
    except: 
     driver.find_element_by_xpath("""//*[@id="gform_close"]""").click() 

    prodnames = driver.find_elements_by_xpath("""//div[@class='col_Title']""") 
    if len(prodnames) == count: 
     ProductName.append(prodnames[0].text) 
     BuyPrice.append(driver.find_elements_by_xpath("""//div[@class='col_Price']""")[0].text) 
    else: 
     ProductName.append('nan') 
     BuyPrice.append('nan') 
     count = len(prodnames) 

    elapsed = time.clock()  
    print('MusicMagpieScraper:', EAN, '--', countx, '/', len(EANs), '--', (elapsed - start), 's') 

driver.close() 

我有使用urllib而與BeautifulSoup解析了一定的經驗,並希望切換到這一點。但是,我不知道如何在沒有webdriver執行點擊操作的情況下提取數據。

任何建議/提示將非常appriciated!

加了:

添加按鈕鏈接是:

__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall','') 

這在JS功能我發現:

{name: "__EVENTTARGET", value: ""} 
{name: "__EVENTARGUMENT", value: ""} 
{name: "__VIEWSTATE", value: "/wEPDwUENTM4MQ9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWBGYPZB…uZSAhaW1wb3J0YW50O2RkQweS+jvDtjK8er7dCKBBRwOWWuE="} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$signIn_8$hdn_BasketValue", value: "2"} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$wtmBarcode_ClientState", value: ""} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$txtSearch", value: "Enter item (e.g. iPhone 5)"} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$wmSearch_ClientState", value: ""} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$LegoVal_12$ddlLego", value: "-999"} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher_sm", value: ""} 
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher", value: ""} 
{name: "__SCROLLPOSITIONX", value: "0"} 
{name: "__SCROLLPOSITIONY", value: "0"} 
{name: "hiddenInputToUpdateATBuffer_CommonToolkitScripts", value: "1"} 

線4是其中所述條形碼是輸入:

{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"} 

Hop efully有用的信息,我不知道從哪裏去,此地谷歌並沒有幫助太多

+0

去這些教程,他們會幫助你。 https://www.youtube.com/playlist?list=PLQVvvaa0QuDfV1MIRBOcqClP6VZXsvyZS – babygame0ver

+0

示例條形碼? –

+0

__doPostBack('ctl00 $ ctl00 $ ctl00 $ ContentPlaceHolderDefault $ mainContent $ tabbedMediaVal_10 $ getValSmall','') –

回答

1

我設法找到一個解決這個使用請求

get_response = requests.get(url='http://www.musicmagpie.co.uk/start-selling/') 
    post_data = {'__EVENTTARGET' : 'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall', 
      '__EVENTARGUMENT' : '', 
      'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode' : ean} 
    # POST some form-encoded data: 
    post_response = requests.post(url='http://www.musicmagpie.co.uk/start-selling/', data=post_data)  

    soup = BeautifulSoup(post_response.text, "lxml") 

    BuyPrice = soup.find('div', class_='col_Price').text.rstrip() 
    ProductName = soup.find('div', class_='col_Title').text.rstrip() 

該代碼發送的功能字典/值(可能不是正確的術語!),它會觸發一個易於解析的響應,從中抽取我想要的數據!