獲取在Python

<h3> 
<a href="article.jsp?tp=&arnumber=16"> 
Granular computing based 
<span class="snippet">data</span> 
<span class="snippet">mining</span> 
in the views of rough set and fuzzy set 
</a> 
</h3>

使用Python兩個標記之間的數據我想從它應該是在粗糙集和模糊集獲取在Python

的意見粒度基於計算的數據挖掘我嘗試使用lxml的錨標記上的值

parser = etree.HTMLParser() 
tree = etree.parse(StringIO.StringIO(html), parser)     
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()" 
rawResponse = tree.xpath(xpath1)    
print rawResponse

，並得到以下輸出

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\]

來源

2013-05-26 Jack

你是否必須使用'lxml'？因爲我大概可以想出一個解決方案，用'BeautifulSoup' – TerryA

我可以使用任何東西 – Jack

你可以使用text_content方法：

import lxml.html as LH 

html = '''<h3> 
<a href="article.jsp?tp=&arnumber=16"> 
Granular computing based 
<span class="snippet">data</span> 
<span class="snippet">mining</span> 
in the views of rough set and fuzzy set 
</a> 
</h3>''' 

root = LH.fromstring(html) 
for elt in root.xpath('//a'): 
    print(elt.text_content())

產生

Granular computing based 
data 
mining 
in the views of rough set and fuzzy set

，或者刪除空格，你可以使用

print(' '.join(elt.text_content().split()))

獲得

Granular computing based data mining in the views of rough set and fuzzy set

這裏，你可能會發現有用的另一種選擇：

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')]))

產生

Granular computing based data mining in the views of rough set and fuzzy set

（請注意，離開data和mining之間的額外空間但是。）

'//a/descendant-or-self::text()'是一個比較廣義版本 "//a/child::text() | //a/span/child::text()"。它會遍歷所有的孩子和孫子等。

來源

2013-05-26 11:27:11 unutbu

感謝您的幫助 – Jack

隨着BeautifulSoup：

>>> from bs4 import BeautifulSoup 
>>> html = (the html you posted above) 
>>> soup = BeautifulSoup(html) 
>>> print " ".join(soup.h3.text.split()) 
Granular computing based data mining in the views of rough set and fuzzy set

說明：

BeautifulSoup解析HTML，使之方便。 soup.h3訪問HTML中的h3標籤。

.text，簡單地說就是從h3標記中獲得所有標記，但不包括所有其他標記，例如span s。

我在這裏使用split()來擺脫多餘的空格和換行符，然後" ".join()作爲split函數返回一個列表。

來源

2013-05-26 11:16:26 TerryA

謝謝。它的工作 – Jack

回答

相關問題