2011-11-30 209 views
1
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font> 

提取信息出這我需要 上傳10-29 18:50  ,面積4.36  吉布和NLUPPER002在兩個單獨的陣列。我該怎麼做?如何使用美麗的湯在Python

編輯:

這是有很多不同的值這些HTML字體標籤的html頁面的一部分。我需要一個通用的解決方案,如果有的話使用湯。否則,如所暗示的,我會研究正則表達式。

編輯2:

我對此有疑問。如果我們使用「class」作爲遍歷湯的關鍵字,那麼它不會與python關鍵字類一起使用並拋出一個錯誤嗎?

+0

正則表達式可以幫助你輕鬆做到這一點。 –

+1

@JohnRiselvato不,正則表達式幾乎從來不是解析XML/HTML的好解決方案 –

+0

我可以將它轉儲爲JSON,但仍然無法解決我的解決方案,因爲此頁面的HTML編寫得不好。或者我想! – Hick

回答

2
soup = BeautifulSoup(your_data) 
uploaded = [] 
link_data = [] 
for f in soup.findAll("font", {"class":"detDesc"}): 
    uploaded.append(f.contents[0]) 
    link_data.append(f.a.contents[0]) 

例如,使用以下數據:

your_data = """ 
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font> 
<div id="meow">test</div> 
<font class="detDesc">Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font> 
""" 

運行上面的代碼爲您提供:

>>> print uploaded 
[u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by '] 
>>> print link_data 
[u'NLUPPER002', u'NLUPPER003'] 

來獲取文本的確切形式,正如你所說,您可以後處理的列表或循環自身內部分析數據。例如:

>>> [",".join(x.split(",")[:2]).replace("&nbsp;", " ") for x in uploaded] 
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB'] 

附:如果你是列表中理解的粉絲,該解決方案可以作爲表達一個班輪:

output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})] 

這給了你:

>>> output # list of tuples 
[(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ', u'NLUPPER003')] 

>>> uploaded, link_data = zip(*output) # split into two separate lists 
>>> uploaded 
(u'Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by ', u'Uploaded 10-26&nbsp;19:23, Size 1.16&nbsp;GiB, ULed by ') 
>>> link_data 
(u'NLUPPER002', u'NLUPPER003') 
+0

偉大的解決方案。謝謝。我喜歡使用lxml @acorn給出的解決方案。乾杯! – Hick

1

您需要用來查找您感興趣的元素的表達式取決於與文檔中的其他元素相比,這些元素的唯一性。因此,如果沒有元素的背景,就很難提供幫助。

您是否感興趣的元素是font元素中的唯一元素,並且其類別爲detDesc

如果是這樣,在這裏是使用lxml溶液:

import lxml.html as lh 

html = ''' 
<font class="detDesc">Uploaded 10-29&nbsp;18:50, Size 4.36&nbsp;GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font> 
''' 

tree = lh.fromstring(html) 

results = [] 

# iterate over all elements in the document that have a class of "detDesc" 
for el in tree.xpath("//font[@class='detDesc']"): 

    # extract text from the font element 
    first = el.text 

    # extract text from the first <a> within the font element 
    second = el.xpath("a")[0].text 

    results.append((first, second)) 

print results 

結果:

[(u'Uploaded 10-29\xa018:50, Size 4.36\xa0GiB, ULed by ', 'NLUPPER002')] 
+0

哇!很有意思。 – Hick

+0

是的。完美解決方案我可以在我使用過beautifulSoup的其他地方使用相同的概念。或者這將是一個壞主意? – Hick

+0

我的意思是,我使用soup.findAll('a',title ='something')來取代與重複的特定標題的所有鏈接。 – Hick