2016-12-14 67 views
0

我是一個完整的編程初學者,所以請原諒我,如果我不能表達我的問題非常好。我想編寫一個腳本,將尋找通過一系列的新聞網頁,並記錄在文章標題和他們聯繫。我已成功地拿到了第一頁做,問題越來越後續頁面的內容。通過計算器搜索,我想我設法找到一個解決方案,使腳本訪問多個URL,但它似乎會覆蓋從每一頁中提取的內容是訪問,所以我總是在相同數量的記錄物的最終文件。一些可能幫助:我知道網址遵循以下模式:「?/ ultimas /頁= 1」,「?/ ultimas /頁= 2」等,這似乎是使用AJAX請求新的文章Python的 - 報廢分頁站點和結果寫入文件

這裏是我的代碼:

import csv 
import requests 
from bs4 import BeautifulSoup as Soup 
import urllib 
r = base_url = "http://agenciabrasil.ebc.com.br/" 
program_url = base_url + "/ultimas/?page=" 

for page in range(1, 4): 
    url = "%s%d" % (program_url, page) 
    soup = Soup(urllib.urlopen(url)) 



letters = soup.find_all("div", class_="titulo-noticia") 

letters[0] 

lobbying = {} 
for element in letters: 
    lobbying[element.a.get_text()] = {} 

letters[0].a["href"] 
prefix = "http://agenciabrasil.ebc.com.br" 

for element in letters: 
    lobbying[element.a.get_text()]["link"] = prefix + element.a["href"] 



for item in lobbying.keys(): 
    print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t" 

import os, csv 
os.chdir("...") 

with open("lobbying.csv", "w") as toWrite: 
    writer = csv.writer(toWrite, delimiter=",") 
    writer.writerow(["name", "link",]) 
    for a in lobbying.keys(): 
     writer.writerow([a.encode("utf-8"), lobbying[a]["link"]]) 

     import json 

with open("lobbying.json", "w") as writeJSON: 
    json.dump(lobbying, writeJSON) 

print "Fim" 

,我怎麼可能去加入每一頁的內容,以最終文件任何幫助將是非常讚賞。謝謝!

+0

它也可能是一個好主意,看看像[scrapy]工具(https://scrapy.org/) – intelis

+0

我的問題得到解決通過另一張海報,但我會研究,無論如何,謝謝你的建議! – Maldoror

回答

0

它看起來像代碼的循環(for page in range(1, 4):)不被稱爲你因您的文件沒有被正確地縮進:

如果你收拾一下你的代碼,它的工作原理:

import csv, requests, os, json, urllib 
from bs4 import BeautifulSoup as Soup 

r = base_url = "http://agenciabrasil.ebc.com.br/" 
program_url = base_url + "/ultimas/?page=" 

for page in range(1, 4): 
    url = "%s%d" % (program_url, page) 
    soup = Soup(urllib.urlopen(url)) 



    letters = soup.find_all("div", class_="titulo-noticia") 

    lobbying = {} 
    for element in letters: 
     lobbying[element.a.get_text()] = {} 

    prefix = "http://agenciabrasil.ebc.com.br" 

    for element in letters: 
     lobbying[element.a.get_text()]["link"] = prefix + element.a["href"] 



    for item in lobbying.keys(): 
     print item + ": " + "\n\t" + "link: " + lobbying[item]["link"] + "\n\t" 

#os.chdir("...") 

with open("lobbying.csv", "w") as toWrite: 
    writer = csv.writer(toWrite, delimiter=",") 
    writer.writerow(["name", "link",]) 
    for a in lobbying.keys(): 
     writer.writerow([a.encode("utf-8"), lobbying[a]["link"]]) 


with open("lobbying.json", "w") as writeJSON: 
    json.dump(lobbying, writeJSON) 

print "Fim" 
+0

男人,太感謝你了,它的工作原理就像我想它。我仍然習慣於編碼時必須遵守的紀律。 – Maldoror

+0

練習完美! Python是一個偉大的拼寫語言。快樂的編碼! – Ryan

1

這個怎麼樣,如果服務於同一個目的:

import csv, requests 
from lxml import html 

base_url = "http://agenciabrasil.ebc.com.br" 
program_url = base_url + "/ultimas/?page={0}" 
outfile = open('scraped_data.csv', 'w', newline='') 
writer = csv.writer(outfile) 
writer.writerow(["Caption","Link"]) 
for url in [program_url.format(page) for page in range(1, 4)]: 
    response = requests.get(url) 
    tree = html.fromstring(response.text) 
    for title in tree.xpath("//div[@class='noticia']"): 
     caption = title.xpath('.//span[@class="field-content"]/a/text()')[0] 
     policy = title.xpath('.//span[@class="field-content"]/a/@href')[0] 
     writer.writerow([caption , base_url + policy]) 
+0

哦,我最終對這個腳本做了一些修改,直到今天它幾乎無法識別。但感謝您的意見!事實上,你的方法似乎更有效率。我會在稍後嘗試。 – Maldoror