Beautifulsoup不能在本網站

import requests 
from bs4 import BeautifulSoup 
import lxml 
import urllib2 
opener = urllib2.build_opener() 
opener.addheaders = [('User-agent', 'Mozilla/5.0')] 
f =open('ala2009link.csv','r') 
s=open('2009alanews.csv','w') 
for row in csv.reader(f): 
url=row[0] 
print url 
res = requests.get(url) 
print res.content 
soup = BeautifulSoup(res.content) 
print soup 
data=soup.find_all("article",{"class":"article-wrapper news"}) 
#data=soup.find_all("main",{"class":"main-content"}) 
for item in data: 
    title= item.find_all("h2",{"class","article-headline"})[0].text 
    s.write("%s \n"% title) 
content=soup.find_all("p") 
for main in content: 
    k=main.text.encode('utf-8') 
    s.write("%s \n"% k) 
    #k=csv.writer(s) 
    #k.writerow('%s\n'% (main)) 
s.close() 
f.close()

這是我的代碼來提取網頁數據中提取數據，但我不知道爲什麼我不能提取數據，這是廣告攔截警告來阻止我beautifulsoup？這就是例子鏈接：http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football Beautifulsoup不能在本網站

來源

2016-08-01 Leong Wun Meng

這是for循環for csv.reader（f）中的行的最後一行是什麼？是否有機會提供示例html或鏈接？ – pawelty

你能提供鏈接嗎？ –

樣本：樣本：http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football @pawelty –

的原因，不返回任何結果，因爲該網站要求你在你的請求的用戶代理頭。

要解決此問題，請將User-Agent的標題參數添加到requests.get()。

url = 'http://www.rolltide.com/news/2009/6/23/Bert_Bank_Passes_Away.aspx?path=football' 
headers = { 
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/29.0.1547.65 Chrome/29.0.1547.65 Safari/537.36', 
    } 
res = requests.get(url, headers=headers)

來源

2016-08-01 12:00:50 Mono

這非常有用！你怎麼知道你還需要傳遞頭文件參數？到目前爲止，Haven-t遇到了這樣的情況。 – pawelty

只是一個快速測試和經驗。但爲了學習，您可以使用fiddler（或其他類似軟件）來捕獲瀏覽器發出的請求以及您的代碼發出的請求，並瞭解它們之間的區別。 – Mono

謝謝！我會做一些測試。 – pawelty

Beautifulsoup不能在本網站

回答

相關問題