2015-05-19 38 views
1

所以我編寫了下面的程序,從這個搜索結果頁面https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=網絡爬蟲抓取並非所有URL

大約有18,400+鏈接提取獲取所有配置文件的URL。

但是,當我運行代碼時,它不會超出URL#1623,它會停止而不會給出任何錯誤或任何內容。

這裏是我的代碼

from bs4 import BeautifulSoup 
import requests 

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=' 

with requests.Session() as session: 
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'} 

    response = session.get(url) 
    soup = BeautifulSoup(response.content, "lxml") 

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"): 
     print 'https://www.ohiobar.org' + link.get("href") 

請建議什麼,我做錯了什麼?

謝謝

回答

1

由於我還不能評論,我將添加此作爲答案。我試着Python的3.4運行你的代碼,這是我得到了什麼:

Good Results!

如果有可能,你也許只是更新您的Python版本。

就這一線的小變化:

soup = BeautifulSoup(response.content) 

代碼:

from bs4 import BeautifulSoup 
import requests 

url = 'https://www.ohiobar.org/Pages/Find-a-Lawyer.aspx?sFN=&sLN=&sPA=&sCI=&sST=OH&sZC=' 

with requests.Session() as session: 
    session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'} 

    response = session.get(url) 
    soup = BeautifulSoup(response.content) 
    counter = 0 

    for link in soup.select("div#content_findResults div#content_column1 ul li a[href*=MemberProfile]"): 
     print(counter , ": " , 'https://www.ohiobar.org' , link.get("href")) 
     counter += 1 

問候, 亞歷克斯