以已知網址格式自動抓取多個網頁

我在抓取點擊列表時遇到問題。每年在某個網頁上都有一個具有特定網址的匹配列表。該網址包含年份，所以我想用命中列表爲每年製作一個csv文件。以已知網址格式自動抓取多個網頁

可惜我不能讓它順序，我得到以下錯誤：

ValueError: unknown url type: 'h'

這裏是我嘗試使用的代碼。我很抱歉，如果有簡單的錯誤，但我是pyhon的新手，我無法找到任何順序在論壇適應這種情況。

import urllib 
import urllib.request 
from bs4 import BeautifulSoup 
from urllib.request import urlopen as uReq 
years = list(range(1947,2016)) 

for year in years: 
    my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm') 
    my_url = my_urls[0] 
    for my_url in my_urls: 
     uClient = uReq(my_url) 
     html_input = uClient.read() 
     uClient.close() 
     page_soup = BeautifulSoup(html_input, "html.parser") 
     container = page_soup.findAll("li") 
     filename = "singoli" + str(year) + ".csv" 
     f = open(singoli + str(year), "w") 
     headers = "lista" 
     f.write(headers) 
     lista = container.text 
     print("lista: " + lista) 
     f.write(lista + "\n") 
     f.close()

來源

2017-09-07 Davide Rossi

對不起。我只是注意到我粘貼了一箇舊版本的代碼，在那裏有一個簡單的錯誤，而不是lista = container [0] .text我寫了lista = container.text –

您可以使用['edit']（https：// stackoverflow.com/posts/46100207/edit）按鈕來更改你的問題。 –

謝謝。我無法在問題中找到它，但只能在評論中找到它。 –

你認爲你正在定義一個元組('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm')，但你只是定義了一個簡單的字符串。

所以你在一個字符串中循環，所以循環逐個字母，而不是URL通過url。

當你想要定義一個單元素的元組時，你必須明確指出它的結尾是,，例如：("foo",)。

修復：

my_urls = ('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm',)

Reference：

A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parentheses; a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses). Ugly, but effective.

來源

2017-09-07 15:37:34 Arount

謝謝!!!!!它現在正在工作！我只是有一個問題，因爲我不明白爲什麼它只適用於1947-1954，而不是之後... –

它看起來像它不能編碼\ x85字符....我能做什麼？ –

試試這個。希望它能解決問題：

import csv 
import urllib.request 
from bs4 import BeautifulSoup 

outfile = open("hitparade.csv","w",newline='',encoding='utf8') 
writer = csv.writer(outfile) 

for year in range(1947,2016): 
    my_urls = urllib.request.urlopen('http://www.hitparadeitalia.it/hp_yends/hpe' + str(year) + '.htm').read() 
    soup = BeautifulSoup(my_urls, "lxml") 
    [scr.extract() for scr in soup('script')] 
    for container in soup.select(".li1,.liy,li"): 
     writer.writerow([container.text.strip()]) 
     print("lista: " + container.text.strip()) 
outfile.close()

來源

2017-09-08 09:27:23 SIM

你有沒有試過這個腳本？ – SIM

以已知網址格式自動抓取多個網頁

回答

相關問題