2017-03-03 36 views
0
<div class="columns small-5 medium-4 cell header">Ref No.</div> 
<div class="columns small-7 medium-8 cell">110B60329</div>               

網站是https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results的FindAll用美麗的湯產量空白回報div標籤

我想運行一個循環,並回歸「110B60329」。我跑了美麗的湯,做了一個find_all(div),然後根據他們的類將2個不同的標籤定義爲頭部和數據。然後我通過'head'標籤運行迭代,希望它能返回我定義爲數據的div標籤中的信息。

Python返回空白(cmd提示重新打印filepth)。

請問有人會知道我該如何解決這個問題。我的完整代碼是.....謝謝

import requests 
from bs4 import BeautifulSoup as soup 
import csv 


url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results' 

baseurl = 'https://www.saa.gov.uk' 

session = requests.session() 

response = session.get(url) 

# content of search page in soup 
html= soup(response.content,"lxml") 
properties_col = html.find_all('div') 



for col in properties_col: 
    ref = 'n/a' 
    des = 'n/a' 

    head = col.find_all("div",{"class": "columns small-5 medium-4 cell header"}) 

    data = col.find_all("div",{"class":"columns small-7 medium-8 cell"}) 

    for i,elem in enumerate(head): 
    #for i in range(elems): 
     if head [i].text == "Ref No.": 
      ref = data[i].text 
      print ref    

回答

1

你可以通過兩種方式做到這一點。

1)如果您確定您正在抓取的網站不會更改其內容,您可以找到該類的所有div,並通過提供索引來獲取內容。

2)找到所有左側的div(標題),如果其中一個匹配你想要的下一個兄弟獲取文本。

實施例:

import requests 
from bs4 import BeautifulSoup as soup 

url = 'https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results' 

baseurl = 'https://www.saa.gov.uk' 

session = requests.session() 

response = session.get(url) 

# content of search page in soup 
html = soup(response.content,"lxml") 

#Method 1 
LeftBlockData = html.find_all("div", class_="columns small-7 medium-8 cell") 
Reference = LeftBlockData[0].get_text().strip() 
Description = LeftBlockData[2].get_text().strip() 
print(Reference) 
print(Description) 

#Method 2 
for column in html.find_all("div", class_="columns small-5 medium-4 cell header"): 
    RightColumn = column.next_sibling.next_sibling.get_text().strip() 
    if "Ref No." in column.get_text().strip(): 
     print (RightColumn) 
    if "Description" in column.get_text().strip(): 
     print (RightColumn) 

印刷品將輸出(按順序):

110B60329

STORE

110B60329

STORE

你的問題是,你正在試圖匹配一個節點文本有很多標籤與一個非間隔字符串。

例如您head [i].text變量包含 Ref No.,所以如果你有Ref No.比較一下,它會給出錯誤的結果。剝離它將解決。

+0

非常感謝,工作 –

+0

您好,我遵循相同的邏輯,嘗試通過添加行來從同一頁提升'Rateable Value RightBlockData = html.find_all(「div」,class _ =「columns small-12中等5「) Rateable_Value = RightBlockData [2] .get_text()。strip() –

+0

但是我得到錯誤RightBlockData = html.find_all(」div「,class _ =」columns small-12 medium-5「) Rateable_Value = RightBlockData [2] .get_text()。strip() –

1
import requests 
from bs4 import BeautifulSoup 

r = requests.get("https://www.saa.gov.uk/search/?SEARCHED=1&ST=&SEARCH_TERM=city+of+edinburgh%2C+BOSWALL+PARKWAY%2C+EDINBURGH&ASSESSOR_ID=&SEARCH_TABLE=valuation_roll_cpsplit&DISPLAY_COUNT=10&TYPE_FLAG=CP&ORDER_BY=PROPERTY_ADDRESS&H_ORDER_BY=SET+DESC&DRILL_SEARCH_TERM=BOSWALL+PARKWAY%2C+EDINBURGH&DD_TOWN=EDINBURGH&DD_STREET=BOSWALL+PARKWAY&UARN=110B60329&PPRN=000000000001745&ASSESSOR_IDX=10&DISPLAY_MODE=FULL#results") 
soup = BeautifulSoup(r.text, 'lxml') 
for row in soup.find_all(class_='table-row'): 

    print(row.get_text(strip=True, separator='|').split('|')) 

出來:

['Ref No.', '110B60329'] 
['Office', 'LOTHIAN VJB'] 
['Description', 'STORE'] 
['Property Address', '29 BOSWALL PARKWAY', 'EDINBURGH', 'EH5 2BR'] 
['Proprietor', 'SCOTTISH MIDLAND CO-OP SOCIETY LTD.'] 
['Tenant', 'PROPRIETOR'] 
['Occupier'] 
['Net Annual Value', '£1,750'] 
['Marker'] 
['Rateable Value', '£1,750'] 
['Effective Date', '01-APR-10'] 
['Other Appeal', 'NO'] 
['Reval Appeal', 'NO'] 

get_text()是非常強大的工具,你可以剝離在文本中的空白,並把分離。

您可以使用此方法獲取乾淨的數據並對其進行過濾。

+1

感謝您的方法。 –