刮美麗的湯解析多個頁面

我想刮掉多個頁面的單一網站美麗的解析。到目前爲止，我已經嘗試使用urllib2來做到這一點，但一直遇到一些問題。我已經嘗試是：刮美麗的湯解析多個頁面

import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

for numb in ('85753', '87433'): 
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb) 
html = urllib2.urlopen(address).read() 
soup = BeautifulSoup(html) 

title = soup.find("span", {"class":"paperstitle"}) 
date = soup.find("span", {"class":"docdate"}) 
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit 
paras = [x for x in span.findAllNext("p")] 

first = title.string 
second = date.string 
start = span.string 
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]]) 
last = paras[-1].contents[0] 

print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

這只是讓我在numb序列結果第二個數字，即http://www.presidency.ucsb.edu/ws/index.php?pid=87433。我也嘗試過使用機械化，但沒有成功。理想情況下，我希望能夠做的是有一個頁面和一個鏈接列表，然後自動選擇一個鏈接，將HTML傳遞給BeautifulSoup，然後移動到列表中的下一個鏈接。

來源

2011-11-30 user1074057

您需要將其餘代碼放入循環中。現在你迭代了元組中的兩個元素，但是在迭代結束時，只有最後一個元素仍然分配給address，隨後在循環外部進行解析。

來源

2011-12-01 00:03:02

這是問題所在。非常感謝你。 – user1074057

這裏的（使用lxml的）一個整潔的解決方案：

import lxml.html as lh 

root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid=' 
page_ids = ['85753', '87433'] 

def scrape_page(page_id): 
    url = root_url + page_id 
    tree = lh.parse(url) 

    title = tree.xpath("//span[@class='paperstitle']")[0].text 
    date = tree.xpath("//span[@class='docdate']")[0].text 
    text = tree.xpath("//span[@class='displaytext']")[0].text_content() 

    return title, date, text 

if __name__ == '__main__': 
    for page_id in page_ids: 
     title, date, text = scrape_page(page_id)

來源

2011-12-01 00:27:52 Acorn

謝謝。我其實比BeautifulSoup方法更好。 – user1074057

我喜歡這個解決方案。你將如何保存你正在刮的頁面？ – Joe

@Joe應該像第三個例子一樣簡單：http://docs.python.org/2/library/csv.html#examples – Acorn

我認爲你錯過了在循環縮進：

import urllib2,sys 
from BeautifulSoup import BeautifulSoup 

for numb in ('85753', '87433'): 
    address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb) 
    html = urllib2.urlopen(address).read() 
    soup = BeautifulSoup(html) 

    title = soup.find("span", {"class":"paperstitle"}) 
    date = soup.find("span", {"class":"docdate"}) 
    span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit 
    paras = [x for x in span.findAllNext("p")] 

    first = title.string 
    second = date.string 
    start = span.string 
    middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]]) 
    last = paras[-1].contents[0] 

    print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)

我想這應該解決的問題..

來源

2011-12-01 00:28:06

這就是問題所在。上面的答案指出了這一點。非常感謝您的幫助。 – user1074057

豎起大拇指.. :) –

刮美麗的湯解析多個頁面

回答

相關問題