2013-10-27 94 views
4

爲長碼後面的代碼提前道歉。我是BeautifulSoup的新手,但發現有一些有用的教程使用它來抓取博客的RSS提要。全面披露:這是從這個視頻教程改編的代碼,它極大地有助於實現這一目標:http://www.youtube.com/watch?v=Ap_DlSrT-iEPython博客RSS Feed刮到BeautifulSoup輸出到.txt文件

這是我的問題:視頻在展示如何將相關內容打印到控制檯方面做得很好。我需要將每篇文章的文本寫入一個單獨的.txt文件並將其保存到某個目錄(現在我只是想保存到我的桌面)。我知道問題在於代碼末尾附近的兩個for循環的範圍(我試圖對此進行評論以供人們快速查看 - 這是開始的最後一個註釋#這裏是我迷失的地方...... ),但我似乎無法自行解決。

當前程序的功能是從程序讀入的上一篇文章中獲取文本,並將其寫入變量listIterator中指示的.txt文件的數量。因此,在這種情況下,我相信有20個.txt文件可以寫出來,但它們都包含上一篇文章的文本。我希望程序執行的是循環播放每篇文章,並將每篇文章的文本輸出到單獨的.txt文件中。抱歉的詳細程度,但任何洞察力將非常感激。

from urllib import urlopen 
from bs4 import BeautifulSoup 
import re 

# Read in webpage. 
webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read() 

# On RSS Feed site, find tags for title of articles and 
# tags for article links to be downloaded. 

patFinderTitle = re.compile('<title>(.*)</title>') 
patFinderLink = re.compile('<link rel.*href="(.*)"/>') 

# Find the tags listed in variables above in the articles. 
findPatTitle = re.findall(patFinderTitle, webpage) 
findPatLink = re.findall(patFinderLink, webpage) 

# Create a list that is the length of the number of links 
# from the RSS feed page. Use this to iterate over each article, 
# read it in, and find relevant text or <p> tags. 
listIterator = [] 
listIterator[:] = range(len(findPatTitle)) 

for i in listIterator: 
    # Print each title to console to ensure program is working. 
    print findPatTitle[i] 

    # Read in the linked-to article. 
    articlePage = urlopen(findPatLink[i]).read() 

    # Find the beginning and end of articles using tags listed below. 
    divBegin = articlePage.find("<div class='story-teaser'>") 
    divEnd = articlePage.find("<footer class='article-footer'>") 

    # Define article variable that will contain all the content between the 
    # beginning of the article to the end as indicated by variables above. 
    article = articlePage[divBegin:divEnd] 

    # Parse the page using BeautifulSoup 
    soup = BeautifulSoup(article) 

    # Compile list of all <p> tags for each article and store in paragList 
    paragList = soup.findAll('p') 

    # Create empty string to eventually convert items in paragList to string to 
    # be written to .txt files. 
    para_string = '' 

    # Here's where I'm lost and have some sort of scope issue with my for-loops. 
    for i in paragList: 
     para_string = para_string + str(i) 
     newlist = range(len(findPatTitle)) 
     for i in newlist: 
      ofile = open(str(listIterator[i])+'.txt', 'w') 
      ofile.write(para_string) 
      ofile.close() 
+0

對於第一個問題,這太棒了。謝謝。 –

回答

3

爲什麼它似乎只有最後一篇文章是寫下來,究其原因是因爲所有的文章翻來覆去都是作家20個單獨的文件。讓我們看看下面的例子:

for i in paragList: 
    para_string = para_string + str(i) 
    newlist = range(len(findPatTitle)) 
    for i in newlist: 
     ofile = open(str(listIterator[i])+'.txt', 'w') 
     ofile.write(para_string) 
     ofile.close() 

你一遍又一遍寫parag_string再次到同一個20個文件每個迭代。你需要做什麼是這樣的,你的一切parag_string小號追加到一個單獨的列表,說paraStringList,然後寫它的所有內容到單獨的文件,像這樣:

for i, var in enumerate(paraStringList): # Enumerate creates a tuple 
    with open("{0}.txt".format(i), 'w') as writer: 
     writer.write(var) 

現在,這就需要將外你的主循環即for i in listIterator:(...)。這是該程序的工作版本:

from urllib import urlopen 
from bs4 import BeautifulSoup 
import re 


webpage = urlopen('http://talkingpointsmemo.com/feed/livewire').read() 

patFinderTitle = re.compile('<title>(.*)</title>') 
patFinderLink = re.compile('<link rel.*href="(.*)"/>') 

findPatTitle = re.findall(patFinderTitle, webpage)[0:4] 
findPatLink = re.findall(patFinderLink, webpage)[0:4] 

listIterator = [] 
listIterator[:] = range(len(findPatTitle)) 
paraStringList = [] 

for i in listIterator: 

    print findPatTitle[i] 

    articlePage = urlopen(findPatLink[i]).read() 

    divBegin = articlePage.find("<div class='story-teaser'>") 
    divEnd = articlePage.find("<footer class='article-footer'>") 

    article = articlePage[divBegin:divEnd] 

    soup = BeautifulSoup(article) 

    paragList = soup.findAll('p') 

    para_string = '' 

    for i in paragList: 
     para_string += str(i) 

    paraStringList.append(para_string) 

for i, var in enumerate(paraStringList): 
    with open("{0}.txt".format(i), 'w') as writer: 
     writer.write(var) 
+0

我不能感謝你看看這個。在這一個星期裏,我一直用頭撞牆。謝謝!還有一個問題 - 最後一個for循環中的語法對我來說真的很陌生。清楚的是,你是循環訪問'code'paraStringList'code',然後用'code'enumerate'code'爲該列表中的每個索引創建一個文本文件?再次感謝。 – kylerthecreator

+0

真棒,你可以upvote和接受。 –

+0

@ user2925607你忘了點擊向上箭頭按鈕。 –