2017-02-25 109 views
0

之間這是我到目前爲止的代碼內容:http://pastebin.com/CdUiXpdf無法顯示在span標籤

import requests 
from bs4 import BeautifulSoup 


def web_crawler(max_pages): 
    page = 1 
    while page <= max_pages: 
     url = "https://www.kupindo.com/Knjige/artikli/1_strana_" + str(page) 
     source_code = requests.get(url) 
     plain_text = source_code.text 
     soup = BeautifulSoup(plain_text, "html.parser") 
     print("PAGE: " + str(page)) 
     for link in soup.find_all("a", class_="item_link"): 
      href = link.get("href") 
      # title = link.string 
      print(href) 
      # print(title) 
      extended_crawler(href) 
     page += 1 


def extended_crawler(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    for view_counter in soup.find_all("span", id="BrojPregleda"): 
     print("View Count: ", view_counter.text) 


web_crawler(1) 

輸出是例如

PAGE: 1 
https://www.kupindo.com/showcontent/2143/Beletristika/37875219_VUK-DRASKOVIC-Izabrana-dela-1-7-Srpska-rec 
View Count: 

所以瀏覽次數是空的,甚至儘管有用於查找帶有BrojPregleda標識的跨度的expanded_crawler函數,不顯示任何內容。

+0

@Arman你是什麼意思PDF格式的代碼? pastebin鏈接隨機以pdf結尾,它是純文本 – dovla

回答

1

那是因爲其具有的ID BrojPregleda跨度正在通過Ajax調用填充。無論是用Selenium來獲取值或者請按照下列步驟操作:

1)獲取從產品ID在URL

2)後到http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php有一個FORMDATA關鍵 - 與1的值IDPredmet

3)獲得的觀看次數

例子:

def extended_crawler(item_url): 
    source_code = requests.get(item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "html.parser") 
    ViewCount = requests.post('http://www.kupindo.com/inc/ajx/Predmet/ajxGetBrojPregleda.php', data = {'IDPredmet': item_url[item_url.rfind('/') + 1:item_url.rfind('_')]}) 
    print (ViewCount.text) 
+0

這很有效,非常感謝。從來沒有想到這一點 – dovla