2016-05-17 57 views
0
def parser(self): 
    r = requests.get(self.url) 
    self.soup = BeautifulSoup(r.content, "lxml") 

但是當我打印湯時,發現它與我真正想要的網頁源代碼不同。python parse lib不正確返回網頁源代碼

例如,這是下面的網頁源代碼:

{div class="zh-question-followers-sidebar"} 
{div class="zg-gray-normal"} 

{a href="/question/24269892/followers"}{strong}109141{/strong}{/a} 
people focus on the questions 

{/div} 

但是當我使用beautifulsoup獲取XML,它不顯示代碼的方式。 相反,它表明這樣的:

{div class="zm-side-section"} 
{div class="zm-side-section-inner zg-gray-normal" id="zh-question-side-header-wrap"} 
{button class="follow-button zg-follow zg-btn-green" data-follow="q:m:button" data-id="1889792"}focus question{/button} 

109143 
people focus on the questions 

{/div} 
{/div} 

誰能告訴我,爲什麼和如何得到正確的源代碼?

回答

1

並非所有客戶端都在同一頁面上。 您應該將請求的用戶代理設置爲流行的桌面瀏覽器:通過添加頁眉

headers = {'User-Agent': '''Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) 
          AppleWebKit/537.36 (KHTML, like Gecko) 
          Chrome/39.0.2171.95 Safari/537.36'''} 

response = requests.get(url, headers=headers) 
+0

我現在就可以得到正確的網頁源代碼,謝謝! –