2016-12-17 25 views
1

我剛剛寫下了一些代碼,它們逐個抄襲了網站上提到的每個GSOC組織的頁面。如何在Python中更快更高效地抓取多個頁面

目前,這工作正常,但是相當緩慢。 有沒有辦法讓它更快?另外,請提供任何其他建議以改進此代碼。

from bs4 import BeautifulSoup 
    import requests, sys, os 

    f = open('GSOC-Organizations.txt', 'w') 
    r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/") 
    soup = BeautifulSoup(r.content, "html.parser") 
    a_tags = soup.find_all("a", {"class": "organization-card__link"}) 
    title_heads = soup.find_all("h4", {"class": "organization-card__name"}) 
    links,titles = [],[] 
    for tag in a_tags: 
     links.append("https://summerofcode.withgoogle.com"+tag.get('href')) 
    for title in title_heads: 
     titles.append(title.getText()) 
    for i in range(0,len(links)): 
     ct=1 
     print "Currently Scraping : ", 
     print titles[i] 
     name = titles[i] + "\n" + "\tTechnologies: \n" 
     name = name.encode('utf-8') 
     f.write(str(name)) 
     req = requests.get(links[i]) 
     page = BeautifulSoup(req.content, "html.parser") 
     techs = page.find_all("li",{"class": "organization__tag--technology"}) 
     for item in techs: 
      text,ct = ("\t" + str(ct)+".) " + item.getText()+"\n").encode('utf-8'),ct+1 
      f.write(str(text)) 
     newlines=("\n\n").encode('utf-8') 
     f.write(newlines) 
+0

這應該在[代碼審查]張貼(http://codereview.stackexchange.com/)的同時請求代碼改進。那裏的人非常有幫助。現在,所有外部'for'循環都可以真正使用'zip'移植到一個循環中。而*很慢*意味着什麼?對我來說,所有178個鏈接都在約4分鐘內被取消。這太慢了嗎? – Parfait

回答

1

相反刮所有links[i]順序的,則可以按並行方式用grequests刮:

from bs4 import BeautifulSoup 
import requests, sys, os 
import grequests 

f = open('GSOC-Organizations.txt', 'w') 
r = requests.get("https://summerofcode.withgoogle.com/archive/2016/organizations/") 
soup = BeautifulSoup(r.content, "html.parser") 
a_tags = soup.find_all("a", {"class": "organization-card__link"}) 
title_heads = soup.find_all("h4", {"class": "organization-card__name"}) 
links,titles = [],[] 
for tag in a_tags: 
    links.append("https://summerofcode.withgoogle.com"+tag.get('href')) 
for title in title_heads: 
    titles.append(title.getText()) 

rs = (grequests.get(u) for u in links) 

for i, resp in enumerate(grequests.map(rs)): 
    print resp, resp.url 
    # ... continue parsing ...