2017-06-06 60 views
0

我想湊網頁用下面的代碼: -循環多個URL

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs ={'class' :'details-panel'}) 
hrefs = [link['href'] for link in links] 

for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 =BeautifulSoup(pages.content, 'html.parser') 

    Date = soup_2.find_all('li', attrs ={'class' :'sold-date'}) 
    Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
    Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'}) 
    Address = [Address.text.strip() for Address in Address_1] 

上面的代碼只返回從HREF中的第一個網址的細節。

['Mon 05-Jun-17'] ['261 Keilor Road, Essendon, Vic 3040'] 

我需要遍歷的HREF每個URL運行&從每個URL中的HREF返回類似的細節。 請建議我應該在上面的代碼中添加/編輯什麼。 任何幫助將不勝感激。

感謝

回答

1

它表現正確。 您需要將信息存儲在外部列表中,然後將其返回。

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs ={'class' :'details-panel'}) 
hrefs = [link['href'] for link in links] 
Data = [] 
for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 =BeautifulSoup(pages.content, 'html.parser') 

    Date = soup_2.find_all('li', attrs ={'class' :'sold-date'}) 
    Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
    Address_1 = soup_2.find_all('p', attrs={'class' :'full-address'}) 
    Address = [Address.text.strip() for Address in Address_1] 
    Data.append(Sold_Date + Address) 
return Data 
+0

非常感謝Anubhav,它現在對我有用,, –

+0

你能不能也請指導我如何在同一網站上運行相同的代碼說10或20頁,而不必每次都提供每個新頁面的鏈接? –

+0

如果正在工作,請批准答案以結束問題。 –

1

您在每次迭代覆蓋AddressSold_Date對象:

# after assignment previous data will be lost 
Sold_Date = [Sold_Date.text.strip() for Sold_Date in Date] 
Address = [Address.text.strip() for Address in Address_1] 

嘗試初始化循環的空list。外面和擴展它們

import requests 
from bs4 import BeautifulSoup 

page = requests.get("http://www.realcommercial.com.au/sold/property-offices-retail-showrooms+bulky+goods-land+development-hotel+leisure-medical+consulting-other-in-vic/list-1?includePropertiesWithin=includesurrounding&activeSort=list-date&autoSuggest=true") 

soup = BeautifulSoup(page.content, 'html.parser') 
links = soup.find_all('a', attrs={'class': 'details-panel'}) 
hrefs = [link['href'] for link in links] 

addresses = [] 
sold_dates = [] 
for urls in hrefs: 
    pages = requests.get(urls) 
    soup_2 = BeautifulSoup(pages.content, 'html.parser') 

    dates_tags = soup_2.find_all('li', attrs={'class': 'sold-date'}) 
    sold_dates += [date_tag.text.strip() for date_tag in dates_tags] 
    addresses_tags = soup_2.find_all('p', attrs={'class': 'full-address'}) 
    addresses += [address_tag.text.strip() for address_tag in addresses_tags] 

給我們

>>>sold_dates 
[u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Tue 06-Jun-17', 
u'Mon 05-Jun-17', 
u'Mon 05-Jun-17', 
u'Mon 05-Jun-17'] 
>>>addresses 
[u'141 Napier Street, Essendon, Vic 3040', 
u'5 Loupe Crescent, Leopold, Vic 3224', 
u'80 Ryrie Street, Geelong, Vic 3220', 
u'18 Boase Street, Brunswick, Vic 3056', 
u'130-186 Buckley Street, West Footscray, Vic 3012', 
u'223 Park Street, South Melbourne, Vic 3205', 
u'48-50 The Centreway, Lara, Vic 3212', 
u'14 Webster Street, Ballarat, Vic 3350', 
u'323 Nepean Highway, Frankston, Vic 3199', 
u'341 Buckley Street, Aberfeldie, Vic 3040'] 
+0

非常感謝你的回覆Azat !! –

+0

@Renusharma:它的工作? –