2013-05-26 66 views
2
<h3> 
<a href="article.jsp?tp=&arnumber=16"> 
Granular computing based 
<span class="snippet">data</span> 
<span class="snippet">mining</span> 
in the views of rough set and fuzzy set 
</a> 
</h3> 

使用Python兩個標記之間的數據我想從它應該是在粗糙集和模糊集獲取在Python

的意見粒度基於計算的數據挖掘我嘗試使用lxml的錨標記上的值

parser = etree.HTMLParser() 
tree = etree.parse(StringIO.StringIO(html), parser)     
xpath1 = "//h3/a/child::text() | //h3/a/span/child::text()" 
rawResponse = tree.xpath(xpath1)    
print rawResponse 

,並得到以下輸出

['\r\n\t\t','\r\n\t\t\t\t\t\t\t\t\tgranular computing based','data','mining','in the view of roughset and fuzzyset\r\n\t\t\t\t\t\t\] 
+0

你是否必須使用'lxml'?因爲我大概可以想出一個解決方案,用'BeautifulSoup' – TerryA

+0

我可以使用任何東西 – Jack

回答

3

你可以使用text_content方法:

import lxml.html as LH 

html = '''<h3> 
<a href="article.jsp?tp=&arnumber=16"> 
Granular computing based 
<span class="snippet">data</span> 
<span class="snippet">mining</span> 
in the views of rough set and fuzzy set 
</a> 
</h3>''' 

root = LH.fromstring(html) 
for elt in root.xpath('//a'): 
    print(elt.text_content()) 

產生

Granular computing based 
data 
mining 
in the views of rough set and fuzzy set 

,或者刪除空格,你可以使用

print(' '.join(elt.text_content().split())) 

獲得

Granular computing based data mining in the views of rough set and fuzzy set 

這裏,你可能會發現有用的另一種選擇:

print(' '.join([elt.strip() for elt in root.xpath('//a/descendant-or-self::text()')])) 

產生

Granular computing based data mining in the views of rough set and fuzzy set 

(請注意,離開datamining之間的額外空間但是。)

'//a/descendant-or-self::text()'是一個比較廣義版本 "//a/child::text() | //a/span/child::text()"。它會遍歷所有的孩子和孫子等。

+0

感謝您的幫助 – Jack

1

隨着BeautifulSoup

>>> from bs4 import BeautifulSoup 
>>> html = (the html you posted above) 
>>> soup = BeautifulSoup(html) 
>>> print " ".join(soup.h3.text.split()) 
Granular computing based data mining in the views of rough set and fuzzy set 

說明:

BeautifulSoup解析HTML,使之方便。 soup.h3訪問HTML中的h3標籤。

.text,簡單地說就是從h3標記中獲得所有標記,但不包括所有其他標記,例如span s。

我在這裏使用split()來擺脫多餘的空格和換行符,然後" ".join()作爲split函數返回一個列表。

+0

謝謝。它的工作 – Jack