2016-11-16 37 views
2

我試圖從多個網址中提取特定的類。標籤和類保持不變,但我需要我的Python程序來抓取所有,因爲我只是輸入我的鏈接。使用美味的湯刮掉多個網址

這是我工作的一個樣本:

from bs4 import BeautifulSoup 
import requests 
import pprint 
import re 
import pyperclip 

url = input('insert URL here: ') 
#scrape elements 
response = requests.get(url) 
soup = BeautifulSoup(response.content, "html.parser") 

#print titles only 
h1 = soup.find("h1", class_= "class-headline") 
print(h1.get_text()) 

這適用於單個URL,但不是一個批次。感謝您的幫助。我從這個社區學到了很多東西。

回答

2

有一個url列表並遍歷它。

from bs4 import BeautifulSoup 
import requests 
import pprint 
import re 
import pyperclip 

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....] 
#scrape elements 
for url in urls: 
    response = requests.get(url) 
    soup = BeautifulSoup(response.content, "html.parser") 

    #print titles only 
    h1 = soup.find("h1", class_= "class-headline") 
    print(h1.get_text()) 

如果如果你想刮批量鏈接,您會提示用戶爲每個網站輸入則是可以做到這樣

from bs4 import BeautifulSoup 
import requests 
import pprint 
import re 
import pyperclip 

urls = ['www.website1.com', 'www.website2.com', 'www.website3.com', .....] 
#scrape elements 
msg = 'Enter Url, to exit type q and hit enter.' 
url = input(msg) 
while(url!='q'): 
    response = requests.get(url) 
    soup = BeautifulSoup(response.content, "html.parser") 

    #print titles only 
    h1 = soup.find("h1", class_= "class-headline") 
    print(h1.get_text()) 
    input(msg) 
+0

我得到這個錯誤: 回溯(最近通話最後一個): 文件 「/Users/Computer/Desktop/test.py」,7號線,在 的url =輸入['HTTPS: //website.com/link1','https://website.com/link2'] TypeError:'builtin_function_or_method'對象不可訂閱 –

+0

您打算從用戶那裏獲取每個網址的輸入嗎?如果沒有,那麼簡單地把所有的網址列表,如我的答案中所示。不要把輸入法中的列表。 – falloutcoder

+0

我正在考慮用戶輸入分隔線? –

1

。指定批量大小並對其進行迭代。

from bs4 import BeautifulSoup 
import requests 
import pprint 
import re 
import pyperclip 

batch_size = 5 
urllist = ["url1", "url2", "url3", .....] 
url_chunks = [urllist[x:x+batch_size] for x in xrange(0, len(urllist), batch_size)] 

def scrape_url(url): 
    response = requests.get(url) 
    soup = BeautifulSoup(response.content, "html.parser") 
    h1 = soup.find("h1", class_= "class-headline") 
    return (h1.get_text()) 

def scrape_batch(url_chunk): 
    chunk_resp = [] 
    for url in url_chunk: 
     chunk_resp.append(scrape_url(url)) 
    return chunk_resp 

for url_chunk in url_chunks: 
    print scrape_batch(url_chunk) 
+0

如果我想以10的間隔將請求空間分配給每個url,我怎麼能這樣做?而且我對url塊不熟悉,它們的目的是什麼? – ColeWorld

+0

用於間隔請求導入時間並在scrape_url函數中使用time.sleep(10)。 url_chunks是一個變量,它是一個包含url列表的python列表。對於例如: [['www.website1.com','www.website2.com'],['www.website3.com','www.website3.com']] –