網絡爬蟲抓取並非所有URL

2015-05-19 40 views 1 likes

所以我編寫了下面的程序，從這個搜索結果頁面https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=網絡爬蟲抓取並非所有URL

大約有18,400+鏈接提取獲取所有配置文件的URL。

但是，當我運行代碼時，它不會超出URL＃1623，它會停止而不會給出任何錯誤或任何內容。

這裏是我的代碼

from bs4 import BeautifulSoup 
import requests 

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=' 

with requests.Session() as session: 
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'} 

    response = session.get(url) 
    soup = BeautifulSoup(response.content, "lxml") 

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"): 
     print 'https://www.ohiobar.org' + link.get("href")

請建議什麼，我做錯了什麼？

謝謝

來源

2015-05-19 pb_ng

回答

由於我還不能評論，我將添加此作爲答案。我試着Python的3.4運行你的代碼，這是我得到了什麼：

Good Results!

如果有可能，你也許只是更新您的Python版本。

就這一線的小變化：

soup = BeautifulSoup(response.content)

代碼：

from bs4 import BeautifulSoup 
import requests 

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=' 

with requests.Session() as session: 
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'} 

    response = session.get(url) 
    soup = BeautifulSoup(response.content) 
    counter = 0 

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"): 
     print(counter , ": " , 'https://www.ohiobar.org' , link.get("href")) 
     counter += 1

問候，亞歷克斯

來源

2015-05-21 16:14:14 Alex

相關問題

11. PHP網絡爬蟲不會抓取.php文件
12. 網絡爬蟲和非ASCII字符的sitemap.xml在URL中
13. python網絡爬蟲，當我抓取一個URL時，status_code顯示405
14. Erlang中的並行HTTP網絡爬蟲
15. 自動網絡爬蟲
16. 網絡爬蟲的功能
17. 網絡爬蟲，反饋？
18. 網絡爬蟲的Java
19. 遞歸網絡爬蟲perl
20. 簡單的網絡爬蟲
21. Python中的網絡爬蟲
22. 需要網絡爬蟲
23. 網絡爬蟲文本雲
24. 硒與python網絡爬蟲
25. 網絡爬蟲從Android Market
26. 網絡爬蟲應用
27. 網絡爬蟲不打印
28. Scrapy網絡爬蟲獲取錯誤
29. 存儲URL邊界並分發網絡爬蟲的工作？
30. Java 8 CompletedFuture網絡爬蟲不爬行一個URL