如何使用BeautifulSoup

我一直在研究一個購物網站長標籤得到部分文字，我想提取的HTML代碼中的名優產品，產品名稱如下所示：如何使用BeautifulSoup

<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span></h1>

我想：results = soup.findAll("h1", {"class" : "product-name elim-suites"})[0].text

，並得到：u'ChantecailleLimited Edition Protect the Lion Eye Palette'

正如你所看到的，香緹卡是名優產品，剩下的就是產品的名稱，但它們現在貼到對方，有何建議？謝謝！

來源

2016-09-14 user6606453

嘗試使用'.contents'或'.strings'代替'.text'然後加入字符串作爲證明[這裏]（http://stackoverflow.com /問題/ 16121001 /建議-上獲得文本功能於beautifulsoup） – bunji

您可以使用previous_sibling，它獲取具有相同父級（分析樹中相同級別）的上一個節點。

此外，而不是findAll，當您搜索單個元素時，請使用find。

item_span = soup.find("h1", {"class" : "product-name elim-suites"}).find("span") 

product_name = item_span.previous_sibling 
brand_name = item_span.text 

print product_name 
print brand_name

輸出：

Chantecaille 
Limited Edition Protect the Lion Eye Palette

來源

2016-09-14 18:58:16 Jarvis

你可以使用get_text並傳遞一個字符的文本分離或拉使用的h1. h1.find(text=True, recursive=False)文字和拉從跨度直接的文字：

In [1]: h ="""<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette 
    ...: </span></h1>""" 

In [2]: from bs4 import BeautifulSoup 

In [3]: soup = BeautifulSoup(h, "html.parser") 

In [4]: h1 = soup.select_one("h1.product-name.elim-suites") 

In [5]: print(h1.get_text("\n")) 
Chantecaille 
Limited Edition Protect the Lion Eye Palette 


In [6]: prod, desc = h1.find(text=True, recursive=False), h1.span.text 

In [7]: print(prod, desc) 
(u'Chantecaille', u'Limited Edition Protect the Lion Eye Palette\n')

或者如果文本可能出現在跨度也使用find_all：

In [8]: h ="""<h1 class="product-name elim-suites">Chantecaille 
<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span>other text</h1>""" 


In [9]: from bs4 import BeautifulSoup 

In [10]: soup = BeautifulSoup(h, "html.parser") 

In [11]: h1 = soup.select_one("h1.product-name.elim-suites") 

In [12]: print(h1.get_text("\n")) 
Chantecaille 
Limited Edition Protect the Lion Eye Palette 
other text 

In [13]: prod, desc = " ".join(h1.find_all(text=True, recursive=False)), h1.span.text 

In [14]: 

In [14]: print(prod, desc) 
(u'Chantecaille other text', u'Limited Edition Protect the Lion Eye Palette')

來源

2016-09-14 19:21:53

如何使用BeautifulSoup

回答

相關問題