Python beautifulsoup提取值無標識符

我面臨一個問題，不知道如何正確解決它。我想提取價格（所以在第一個例子中是130歐元，在第二個130歐元中）。Python beautifulsoup提取值無標識符

問題在於屬性總是在變化。所以我不能做這樣的事情，因爲我刮數以百計的網站，並和每個站點的「id」屬性的第2個字符可能會有所不同：

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'(07_content$)')})

即使我會用這樣的事情它不會工作，因爲沒有鏈接到價格，我可能會得到一些其他值：

tag = soup_expose_html.find('span', attrs={'id' : re.compile(r'([0-9]{2}_content$)')})

實例的html代碼：

<span id="07_lbl" class="lbl">Price:</span> 
<span id="07_content" class="content">130 €</span> 
<span id="08_lbl" class="lbl">Value:</span> 
<span id="08_content" class="content">90000 €</span> 


<span id="03_lbl" class="lbl">Price:</span> 
<span id="03_content" class="content">130 €</span> 
<span id="04_lbl" class="lbl">Value:</span> 
<span id="04_content" class="content">90000 €</span>

我能想象的唯一的事情，此刻是Ť o用「text ='Price：'」來標識價格標籤，然後獲得.next_sibling並提取字符串。但我不確定是否有更好的方法來做到這一點。有什麼建議麼？ :-)

來源

2014-09-29 user3740082

爲什麼在第二個例子中不是130？ – 2014-09-29 07:53:39

我認爲使用beautifulsoup爲數百個網站編寫這樣的通用爬行程序會非常困難。 – WeaselFox 2014-09-29 07:54:53

你想提取僅價格或價格和價值的內容嗎？到目前爲止，所提出的答案都是針對這兩者的。 – 2014-09-29 08:13:17

這裏是怎麼樣的，你腦子裏在你原來的職位，你會很容易地只提取價值。

html = """ 
     <span id="07_lbl" class="lbl">Price:</span> 
     <span id="07_content" class="content">130 €</span> 
     <span id="08_lbl" class="lbl">Value:</span> 
     <span id="08_content" class="content">90000 €</span> 


     <span id="03_lbl" class="lbl">Price:</span> 
     <span id="03_content" class="content">130 €</span> 
     <span id="04_lbl" class="lbl">Value:</span> 
     <span id="04_content" class="content">90000 €</span> 
""" 

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html) 

price_texts = soup.find_all('span', text='Price:') 
for element in price_texts: 
    # .next_sibling() might work, too, with a parent element present 
    price_value = element.find_next_sibling('span') 
    print price_value.get_text() 

# It prints: 
# 130 € 
# 130 €

該解決方案代碼少，IMO更清晰。

來源

2014-09-29 11:59:50

非常感謝！ – user3740082 2014-09-30 11:50:09

嘗試美麗的湯選擇功能。它使用CSS選擇器：

for span in soup_expose_html.select("span[id$=_content]"): 
    print span.text

結果是與有_content

來源

2014-09-29 07:53:35

如果我這樣做，我該如何選擇價格？我得到一個不同號碼的列表，但不能確定價格是哪一個。 – user3740082 2014-09-29 09:43:22

怎麼樣findAll解決結尾的ID的所有跨度列表？
首先收集所有候選條件ID前綴，然後遍歷它們，並得到所有元素

>>> from bs4 import BeautifulSoup 
>>> import re 
>>> html = """ 
...   <span id="07_lbl" class="lbl">Price:</span> 
...   <span id="07_content" class="content">130 €</span> 
...   <span id="08_lbl" class="lbl">Value:</span> 
...   <span id="08_content" class="content">90000 €</span> 
... 
... 
...   <span id="03_lbl" class="lbl">Price:</span> 
...   <span id="03_content" class="content">130 €</span> 
...   <span id="04_lbl" class="lbl">Value:</span> 
...   <span id="04_content" class="content">90000 €</span> 
... """ 
>>> 
>>> soup = BeautifulSoup(html) 
>>> span_id_prefixes = [ 
...  span['id'].replace("_content","") 
...  for span in soup.findAll('span', attrs={'id' : re.compile(r'(_content$)')}) 
... ] 
>>> for prefix in span_id_prefixes: 
...  lbl  = soup.find('span', attrs={'id' : '%s_lbl' % prefix}) 
...  content = soup.find('span', attrs={'id' : '%s_content' % prefix}) 
...  if lbl and content: 
...   print lbl.text, content.text 
... 
Price: 130 € 
Value: 90000 € 
Price: 130 € 
Value: 90000 €

來源

2014-09-29 08:07:07 xecgr

如果我這樣做，我該如何選擇價格？我得到一個不同號碼的列表，但不能確定價格是哪一個。 – user3740082 2014-09-29 09:44:09

看看我最後的編輯，我想我已經很好地理解你的評論 – xecgr 2014-09-29 09:51:48

非常感謝你，這種方式比我想象的更好。 – user3740082 2014-09-29 11:06:48

Python beautifulsoup提取值無標識符

回答

相關問題