將BeautifulSoup內容寫入文件

我最近問this關於在BeautifulSoup中編碼印地文字符的問題。該問題的答案確實解決了這個問題，但是我有另一個問題。將BeautifulSoup內容寫入文件

我的代碼是：

import urllib2 
from bs4 import BeautifulSoup 

htmlUrl = "http://archives.ndtv.com/articles/2012-01.html" 
FileName = "NDTV_2012_01.txt" 

fptr = open(FileName, "w") 
fptr.seek(0) 

page = urllib2.urlopen(htmlUrl) 
soup = BeautifulSoup(page, from_encoding="UTF-8") 

li = soup.findAll('li') 
for link_tag in li: 
    hypref = link_tag.find('a').contents[0] 
    strhyp = hypref.encode('utf-8') 
    fptr.write(strhyp) 
    fptr.write("\n")

我得到一個錯誤

Traceback (most recent call last): 
File "./ScrapeTemplate.py", line 29, in <module> 
hypref = link_tag.find('a').contents[0] 
IndexError: list index out of range

看來，當我替補print strhyp而不是fptr.write()工作。我該如何解決？

編輯：代碼有一個錯誤，我沒有發現。修正了它，但我仍然得到相同的錯誤。

來源

2013-01-19 Kitchi

我試過你的代碼，我沒有得到任何錯誤。你想達到什麼目的？想要獲得鏈接的href嗎？你能發佈你的預期輸出嗎？謝謝。 –

@AnneLagang - 更改了代碼。輸出應該是HTML頁面中的標題列表，除了我收到此錯誤。 – Kitchi

您的代碼正在跳過頁面底部的鏈接。跳過這些：

for link_tag in li: 
    contents = link_tag.find('a').contents 
    if len(contents) > 0: 
    hypref = contents[0] 
    strhyp = hypref.encode('utf-8') 
    fptr.write(strhyp) 
    fptr.write("\n")

來源

2013-01-19 10:54:09 oefe

哦，對！我沒有檢查終端的整個方式。這很有用，謝謝！ – Kitchi

錯誤的原因與寫入文件沒有任何關係。看起來link_tag.find('a').contents有時會返回一個空列表，並在嘗試獲取第一個項目時出現錯誤。您可以嘗試如下所示：

for link_tag in li: 
    try: 
     hypref = link_tag.find('a').contents[0] 
    except IndexError: 
     print link_tag #print link tag where it couldn't extract the 'a' 
     continue 
    strhyp = hypref.encode('utf-8') 
    fptr.write(strhyp) 
    fptr.write("\n")

來源

2013-01-19 10:26:36 root

但是，當我將相同的代碼直接打印到終端時，不會出現錯誤，這就是爲什麼我懷疑問題在於寫入文件。 – Kitchi

@Kitchi它也發生如果你打印到終端（我試過）。你的代碼跳過頁面底部的鏈接（RSS，新聞快訊，手機等） – oefe

將BeautifulSoup內容寫入文件

回答

相關問題