2011-11-30 71 views
0

我想刮掉多個頁面的單一網站美麗的解析。到目前爲止,我已經嘗試使用urllib2來做到這一點,但一直遇到一些問題。我已經嘗試是:刮美麗的湯解析多個頁面

import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

for numb in ('85753', '87433'): 
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb) 
html = urllib2.urlopen(address).read() 
soup = BeautifulSoup(html) 

title = soup.find("span", {"class":"paperstitle"}) 
date = soup.find("span", {"class":"docdate"}) 
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit 
paras = [x for x in span.findAllNext("p")] 

first = title.string 
second = date.string 
start = span.string 
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]]) 
last = paras[-1].contents[0] 

print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last) 

這只是讓我在numb序列結果第二個數字,即http://www.presidency.ucsb.edu/ws/index.php?pid=87433。我也嘗試過使用機械化,但沒有成功。理想情況下,我希望能夠做的是有一個頁面和一個鏈接列表,然後自動選擇一個鏈接,將HTML傳遞給BeautifulSoup,然後移動到列表中的下一個鏈接。

回答

1

您需要將其餘代碼放入循環中。現在你迭代了元組中的兩個元素,但是在迭代結束時,只有最後一個元素仍然分配給address,隨後在循環外部進行解析。

+0

這是問題所在。非常感謝你。 – user1074057

1

這裏的(使用lxml的)一個整潔的解決方案:

import lxml.html as lh 

root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid=' 
page_ids = ['85753', '87433'] 

def scrape_page(page_id): 
    url = root_url + page_id 
    tree = lh.parse(url) 

    title = tree.xpath("//span[@class='paperstitle']")[0].text 
    date = tree.xpath("//span[@class='docdate']")[0].text 
    text = tree.xpath("//span[@class='displaytext']")[0].text_content() 

    return title, date, text 

if __name__ == '__main__': 
    for page_id in page_ids: 
     title, date, text = scrape_page(page_id) 
+0

謝謝。我其實比BeautifulSoup方法更好。 – user1074057

+0

我喜歡這個解決方案。你將如何保存你正在刮的頁面? – Joe

+0

@Joe應該像第三個例子一樣簡單:http://docs.python.org/2/library/csv.html#examples – Acorn

1

我認爲你錯過了在循環縮進:

import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

for numb in ('85753', '87433'): 
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb) 
    html = urllib2.urlopen(address).read() 
    soup = BeautifulSoup(html) 

    title = soup.find("span", {"class":"paperstitle"}) 
    date = soup.find("span", {"class":"docdate"}) 
    span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit 
    paras = [x for x in span.findAllNext("p")] 

    first = title.string 
    second = date.string 
    start = span.string 
    middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]]) 
    last = paras[-1].contents[0] 

    print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last) 

我想這應該解決的問題..

+0

這就是問題所在。上面的答案指出了這一點。非常感謝您的幫助。 – user1074057

+0

豎起大拇指.. :) –