我想從newyork時間獲得一組url(這是網頁),但我得到了一個不同的答案,我相信我給了一個正確的類,儘管它提取了不同的類。我ny_url.txt有 「http://query.nytimes.com/search/sitesearch/?action=click®ion=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis; http://query.nytimes.com/search/sitesearch/?action=click®ion=Masthead&pgtype=SectionFront&module=SearchSubmit&contentCollection=us&t=qry900#/isis/since1851/allresults/2/」BeautifulSoup不解析整個頁面的內容
這裏是我的代碼:
import urllib2
import urllib
from cookielib import CookieJar
from bs4 import BeautifulSoup
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
text_file = open('ny_url.txt', 'r')
for line in text_file:
print line
soup = BeautifulSoup(opener.open(line))
links = soup.find_all('div', attrs = {'class' : 'element2'})
for href in links:
print href
Iam期待這個結果「http://topics.nytimes.com/top/reference/timestopics/organizations/i/isis/index.html?8qa,http://www.nytimes.com/2014/08/ 25/world/middleeast/isis-militants-capture-air-base-from-syrian-government-forces.html「 – 2014-09-26 20:35:43
行可能包含'\ n'字符。試試'opener.open(line [: - 1])' – user3557327 2014-09-26 20:35:53
這就是我得到的結果:根據你的建議進行編輯後......「http://query.nytimes.com/search/sitesearch/?action= click&region = Masthead&pgtype = SectionFront&module = SearchSubmit&contentCollection = us&t = qry900#/ isis