如何「清除」Feedparser Feed中的所有條目

我以Google的XML格式備份了我的博客。這很長。到目前爲止，我已經做到了這一點：如何「清除」Feedparser Feed中的所有條目

>>> import feedparser 
>>> blogxml = feedparser.parse('blog.xml') 
>>> type(blogxml) 
<class 'feedparser.FeedParserDict'>

在這本書中，我閱讀，筆者做這個的：

>>> import feedparser 
>>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom") 
>>> llog['feed']['title'] u'Language Log' 
>>> len(llog.entries) 15 
>>> post = llog.entries[2] 
>>> post.title u"He's My BF" 
>>> content = post.content[0].value 
>>> content[:70] u'<p>Today I was chatting with three of our visiting graduate students f' 
>>> nltk.word_tokenize(nltk.html_clean(content))

這對我的作品中的條目按入門基礎。正如你所看到的，我已經有了一種使用NLTK清理HTML的方法。但我真正想要的是獲取所有條目，清除HTML（我已經知道該怎麼做，並且不會問怎麼做，請仔細閱讀這個問題），然後將它們寫入文件中明文字符串。這與正確使用feedparser有關。有沒有簡單的方法來做到這一點？

更新：

我還沒有接近，事實證明，尋找一種簡單的方法來做到這一點。由於我對Python的無能，我不得不做一些有點醜陋的事情。

這就是我想我會做的事：

import feedparser 
import nltk 

blog = feedparser.parse('myblog.xml') 

with open('myblog','w') as outfile: 
    for itemnumber in range(0, len(blog.entries)): 
     conts = blog.entries[itemnumber].content 
     cleanconts = nltk.word_tokenize(nltk.html_clean(conts)) 
     outfile.write(cleanconts)

所以，非常感謝你，@Rob考伊，但你的版本（這看起來很棒）沒有工作。我感到不好，因爲沒有早點指出，接受答案，但我沒有太多時間來處理這個項目。我在下面寫的東西是我能夠工作的所有東西，但是我會留下這個問題，以防有人擁有更優雅的東西。

import feedparser 
import sys 

blog = feedparser.parse('myblog.xml') 
sys.stdout = open('blog','w') 

for itemnumber in range(0, len(blog.entries)): 
    print blog.entries[itemnumber].content 

sys.stdout.close()

然後我CTRL-D'ed出解釋器，因爲我不知道如何關閉打開的文件，而不關閉Python的標準輸出。然後我重新進入解釋器，打開文件，讀取文件，並從那裏清理HTML。（nltk.html_clean是NLTK書本身的在線版本中的錯字，順便說一下......它實際上是nltk.clean_html）。我最終以幾乎但不完全是明文。

來源

2011-06-29 magnetar

[使用Python從HTML文件中提取文本]的可能的副本（http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python） –

@Sentinel它不是重複的。 ..我的問題更多地與feedparser有關。我知道如何清理HTML，而且我已經證明我可以做到這一點。我只是不知道如何使用feedparser對每個條目執行此操作。 – magnetar

import feedparser 
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom") 

with open('myblog.txt', 'w') as outfile: 
    for entry in llog.entries: 
     ## Do your processing here 
     content = entry.content[0].value 
     clean_content = nltk.word_tokenize(nltk.html_clean(content)) 
     outfile.write(clean_content)

從根本上說，你需要打開一個文件，重複的條目（feed.entries），處理條目需要和編寫相應的表示該文件。

我不假定你想如何分隔文本文件中的發佈內容。該片段也不會將帖子標題或任何元數據寫入文件。

來源

2011-07-03 09:08:13

我相信你必須通過做這個博客文章中的東西來迭代條目：http://frizzletech.blogspot.com/2011/02/how-i-created-my-weekly-feed-digest.html ..你不能只寫post.content [0] ... – magnetar

@magnetar;你在我的例子中發現了一個錯誤。我_am_遍歷條目，但引用'post'會引發NameError。複製/粘貼錯誤，我想。 –

見 Extracting text from HTML file using Python

來源

2011-06-29 19:01:55

如何「清除」Feedparser Feed中的所有條目

回答

相關問題