soup = BeautifulSoup(your_data)
uploaded = []
link_data = []
for f in soup.findAll("font", {"class":"detDesc"}):
uploaded.append(f.contents[0])
link_data.append(f.a.contents[0])
例如,使用以下數據:
your_data = """
<font class="detDesc">Uploaded 10-29 18:50, Size 4.36 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER002</a></font>
<div id="meow">test</div>
<font class="detDesc">Uploaded 10-26 19:23, Size 1.16 GiB, ULed by <a class="detDesc" href="/user/NLUPPER002/" title="Browse NLUPPER002">NLUPPER003</a></font>
"""
運行上面的代碼爲您提供:
>>> print uploaded
[u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ']
>>> print link_data
[u'NLUPPER002', u'NLUPPER003']
來獲取文本的確切形式,正如你所說,您可以後處理的列表或循環自身內部分析數據。例如:
>>> [",".join(x.split(",")[:2]).replace(" ", " ") for x in uploaded]
[u'Uploaded 10-29 18:50, Size 4.36 GiB', u'Uploaded 10-26 19:23, Size 1.16 GiB']
附:如果你是列表中理解的粉絲,該解決方案可以作爲表達一個班輪:
output = [(f.contents[0], f.a.contents[0]) for f in soup.findAll("font", {"class":"detDesc"})]
這給了你:
>>> output # list of tuples
[(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'NLUPPER002'), (u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ', u'NLUPPER003')]
>>> uploaded, link_data = zip(*output) # split into two separate lists
>>> uploaded
(u'Uploaded 10-29 18:50, Size 4.36 GiB, ULed by ', u'Uploaded 10-26 19:23, Size 1.16 GiB, ULed by ')
>>> link_data
(u'NLUPPER002', u'NLUPPER003')
正則表達式可以幫助你輕鬆做到這一點。 –
@JohnRiselvato不,正則表達式幾乎從來不是解析XML/HTML的好解決方案 –
我可以將它轉儲爲JSON,但仍然無法解決我的解決方案,因爲此頁面的HTML編寫得不好。或者我想! – Hick