下面我提供兩個實施例和一個可能的解決方案:
- 實施例1示出了工作示例。
- 示例2顯示了一個非工作示例,提高了您報告的錯誤。
- 解決方案顯示可能的解決方案。
實施例1:將HTML具有預期的div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry-content"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
例2:HTML沒有在內容中的預期的div
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
"""
The error will rise here because the first find does not return nothing,
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll'
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
可能的解決方案:
doc = ['<html><head><title>Page title</title></head>',
'<body><div class="entry"><div>http://teste.com</div>',
'<div>http://teste2.com</div></div></body>',
'</html>']
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})
"""
Deal with documents that do not have the expected html structure
"""
if url:
url = url.findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
fobj.write(getting.string.encode('utf8'))
else:
print("The html source does not comply with expected structure")
如果你使用'.text'而不是'.string'? – alecxe
@alecxe是的,它的工作,但你能告訴我爲什麼? –