我想刮掉多個頁面的單一網站美麗的解析。到目前爲止,我已經嘗試使用urllib2來做到這一點,但一直遇到一些問題。我已經嘗試是:刮美麗的湯解析多個頁面
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
這只是讓我在numb
序列結果第二個數字,即http://www.presidency.ucsb.edu/ws/index.php?pid=87433。我也嘗試過使用機械化,但沒有成功。理想情況下,我希望能夠做的是有一個頁面和一個鏈接列表,然後自動選擇一個鏈接,將HTML傳遞給BeautifulSoup,然後移動到列表中的下一個鏈接。
這是問題所在。非常感謝你。 – user1074057