0

我應該如何優化我的時間在提出請求優化刮請求網頁

link=['http://youtube.com/watch?v=JfLt7ia_mLg', 
'http://youtube.com/watch?v=RiYRxPWQnbE' 
'http://youtube.com/watch?v=tC7pBOPgqic' 
'http://youtube.com/watch?v=3EXl9xl8yOk' 
'http://youtube.com/watch?v=3vb1yIBXjlM' 
'http://youtube.com/watch?v=8UBY0N9fWtk' 
'http://youtube.com/watch?v=uRPf9uDplD8' 
'http://youtube.com/watch?v=Coattwt5iyg' 
'http://youtube.com/watch?v=WaprDDYFpjE' 
'http://youtube.com/watch?v=Pm5B-iRlZfI' 
'http://youtube.com/watch?v=op3hW7tSYCE' 
'http://youtube.com/watch?v=ogYN9bbU8bs' 
'http://youtube.com/watch?v=ObF8Wz4X4Jg' 
'http://youtube.com/watch?v=x1el0wiePt4' 
'http://youtube.com/watch?v=kkeMYeAIcXg' 
'http://youtube.com/watch?v=zUdfNvqmTOY' 
'http://youtube.com/watch?v=0ONtIsEaTGE' 
'http://youtube.com/watch?v=7QedW6FcHgQ' 
'http://youtube.com/watch?v=Sb33c9e1XbY'] 

我的第一頁的YouTube搜索結果15-20鏈接列表現在的任務是讓喜歡,不喜歡,認爲從每個視頻的網址,併爲我做了什麼計數

def parse(url,i,arr): 
    req=requests.get(url) 
    soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib') 
    try: 
     likes=int(soup.find("button",attrs={"title": "I like this"}).getText().__str__().replace(",","")) 
    except: 
     likes=0 
    try: 
     dislikes=int(soup.find("button",attrs={"title": "I dislike this"}).getText().__str__().replace(",","")) 
    except: 
     dislikes=0 
    try: 
     view=int(soup.find("div",attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",","")) 
    except: 
     view=0 
    arr[i]=(likes,dislikes,view,url) 
    time.sleep(0.3) 

def parse_list(link): 
    arr=len(link)*[0] 
    threadarr=len(link)*[0] 
    import threading 
    a=time.clock() 
    for i in range(len(link)): 
     threadarr[i]=threading.Thread(target=parse,args=(link[i],i,arr)) 
     threadarr[i].start() 
    for i in range(len(link)): 
     threadarr[i].join() 
    print(time.clock()-a) 
    return arr 

arr=parse_list(link) 

現在我得到約6 seconds.Is有沒有更快的方法我可以得到我的陣列(ARR)的填充結果陣列等等它需要比6秒更少的時間

我的數組前4種元素的樣子,讓你得到一個粗略的想法

[(105, 11, 2836, 'http://youtube.com/watch?v=JfLt7ia_mLg'), 
(32, 18, 5420, 'http://youtube.com/watch?v=RiYRxPWQnbE'), 
(45, 3, 7988, 'http://youtube.com/watch?v=tC7pBOPgqic'), 
(106, 38, 4968, 'http://youtube.com/watch?v=3EXl9xl8yOk')] 

Thanks in advance :) 
+2

如果你的代碼的工作,但你要找一些改進,你應該問你的問題上[代碼審查(https://codereview.stackexchange.com/) – Andersson

回答

1

我會用多池對象爲特定的情況下。

import requests 
import bs4 
from multiprocessing import Pool, cpu_count 


links = [ 
'http://youtube.com/watch?v=JfLt7ia_mLg', 
'http://youtube.com/watch?v=RiYRxPWQnbE', 
'http://youtube.com/watch?v=tC7pBOPgqic', 
'http://youtube.com/watch?v=3EXl9xl8yOk' 
] 

def parse_url(url): 
    req=requests.get(url) 
    soup = bs4.BeautifulSoup(req.text,"lxml")#, 'html5lib') 
    try: 
     likes=int(soup.find("button", attrs={"title": "I like this"}).getText().__str__().replace(",","")) 
    except: 
     likes=0 
    try: 
     dislikes=int(soup.find("button", attrs={"title": "I dislike this"}).getText().__str__().replace(",","")) 
    except: 
     dislikes=0 
    try: 
     view=int(soup.find("div", attrs={"class": "watch-view-count"}).getText().__str__().split()[0].replace(",","")) 
    except: 
     view=0 
    return (likes, dislikes, view, url) 

pool = Pool(cpu_count) # number of processes 
data = pool.map(parse_url, links) # this is where your results are 

這樣比較乾淨,因爲只有一個函數可以編寫,並且結果完全相同。

+0

錯誤:類型錯誤:「<」不支持'方法'和'int' –

0

這不是一種解決方法,但它可以保存您的腳本使用「try/except塊」,它肯定起到一定的作用,可以減慢操作速度。

for url in links: 
    response = requests.get(url).text 
    soup = BeautifulSoup(response,"html.parser") 
    for item in soup.select("div#watch-header"): 
     view = item.select("div.watch-view-count")[0].text 
     likes = item.select("button[title~='like'] span.yt-uix-button-content")[0].text 
     dislikes = item.select("button[title~='dislike'] span.yt-uix-button-content")[0].text 
     print(view, likes, dislikes) 
+0

的實例之間的嘗試,除了有點必要在我的程序中使用,因爲一些視頻也禁用顯示喜歡和不喜歡等 –

+0

但您上面提供的鏈接沒有問題沒有他們。我測試了它.. – SIM

+0

amm試試這.......「https://www.youtube.com/watch?v=frw6uu3nonQ」 –