爲長碼後面的代碼提前道歉。我是BeautifulSoup的新手,但發現有一些有用的教程使用它來抓取博客的RSS提要。全面披露:這是從這個視頻教程改編的代碼,它極大地有助於實現這一目標:http://www.youtube.com/watch?v=Ap_DlSrT-iE。Python博客RSS Feed刮到BeautifulSoup輸出到.txt文件
這是我的問題:視頻在展示如何將相關內容打印到控制檯方面做得很好。我需要將每篇文章的文本寫入一個單獨的.txt文件並將其保存到某個目錄(現在我只是想保存到我的桌面)。我知道問題在於代碼末尾附近的兩個for循環的範圍(我試圖對此進行評論以供人們快速查看 - 這是開始的最後一個註釋#這裏是我迷失的地方...... ),但我似乎無法自行解決。
當前程序的功能是從程序讀入的上一篇文章中獲取文本,並將其寫入變量listIterator
中指示的.txt文件的數量。因此,在這種情況下,我相信有20個.txt文件可以寫出來,但它們都包含上一篇文章的文本。我希望程序執行的是循環播放每篇文章,並將每篇文章的文本輸出到單獨的.txt文件中。抱歉的詳細程度,但任何洞察力將非常感激。
from urllib import urlopen
from bs4 import BeautifulSoup
import re
# Read in webpage.
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read()
# On RSS Feed site, find tags for title of articles and
# tags for article links to be downloaded.
patFinderTitle = re.compile('<title>(.*)</title>')
patFinderLink = re.compile('<link rel.*href="(.*)"/>')
# Find the tags listed in variables above in the articles.
findPatTitle = re.findall(patFinderTitle, webpage)
findPatLink = re.findall(patFinderLink, webpage)
# Create a list that is the length of the number of links
# from the RSS feed page. Use this to iterate over each article,
# read it in, and find relevant text or <p> tags.
listIterator = []
listIterator[:] = range(len(findPatTitle))
for i in listIterator:
# Print each title to console to ensure program is working.
print findPatTitle[i]
# Read in the linked-to article.
articlePage = urlopen(findPatLink[i]).read()
# Find the beginning and end of articles using tags listed below.
divBegin = articlePage.find("<div class='story-teaser'>")
divEnd = articlePage.find("<footer class='article-footer'>")
# Define article variable that will contain all the content between the
# beginning of the article to the end as indicated by variables above.
article = articlePage[divBegin:divEnd]
# Parse the page using BeautifulSoup
soup = BeautifulSoup(article)
# Compile list of all <p> tags for each article and store in paragList
paragList = soup.findAll('p')
# Create empty string to eventually convert items in paragList to string to
# be written to .txt files.
para_string = ''
# Here's where I'm lost and have some sort of scope issue with my for-loops.
for i in paragList:
para_string = para_string + str(i)
newlist = range(len(findPatTitle))
for i in newlist:
ofile = open(str(listIterator[i])+'.txt', 'w')
ofile.write(para_string)
ofile.close()
對於第一個問題,這太棒了。謝謝。 –