美麗的湯臭蟲？

我有下面的代碼：美麗的湯臭蟲？

for table in soup.findAll("table","tableData"): 
    for row in table.findAll("tr"): 
     data = row.findAll("td") 
     url = data[0].a 
     print type(url)

我得到一個輸出：

<class 'bs4.element.Tag'>

這意味着，網址是類標籤的對象，我可以從這個對象獲得attribytes。但是，如果我更換print type(url)到print url['href']我得到下一個回溯

Traceback (most recent call last): 
File "baseCreator.py", line 57, in <module> 
    createStoresTable() 
File "baseCreator.py", line 46, in createStoresTable 
    print url['href'] 
TypeError: 'NoneType' object has no attribute '__getitem__'

有什麼不對？以及我如何獲得href屬性的值。

來源

2012-07-26 KoirN

你有一個循環;你確定*所有* tr> td元素都有''標籤嗎？ – 2012-07-26 17:58:00

這個錯誤意味着它失敗的URL是None。嘗試使用'if url：'print url ['href']'運行它。 – Lenna 2012-07-26 18:02:32

謝謝，你是對的。頁面包含非常大的表格，每行都有網址。但是當我仔細地看着我發現，在一行網址丟失。 – KoirN 2012-07-26 18:34:13

我喜歡BeautifulSoup，但我個人更喜歡lxml.html（不太多古怪 HTML）的，因爲利用的XPath的能力。

import lxml.html 
page = lxml.html.parse('http://somesite.tld') 
print page.xpath('//tr/td/a/@href')

儘管取決於結構，可能需要實現某種形式的「軸」。

您還可以使用elementsoup作爲一個解析器 - 細節在http://lxml.de/elementsoup.html

來源

2012-07-26 18:16:47

lxml也有一個美麗的後端。 – Marcin 2012-07-26 18:17:45

@Marcin好的一點，我忘了提及soupparser - 更新，謝謝。 – 2012-07-26 18:20:48

美麗的湯臭蟲？

回答

相關問題