2015-11-18 128 views
0

我想知道如何從一個網站使用美麗的湯/請求抓取多個不同的網頁/城市,而不必一遍又一遍地重複我的代碼。如何從網站抓取多個網頁/城市(BeautifulSoup,Requests,Python3)

這裏是我的代碼現在:

Region = "Marrakech" 
Spider = 20 

def trade_spider(max_pages): 
    page = -1 

    partner_ID = 2 
    location_ID = 25 

    already_printed = set() 

    while page <= max_pages: 
     page += 1 
     response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page)) 
     jsondata = json.loads(response.read().decode("utf-8")) 
     format = (jsondata['activities']) 
     g_data = format.strip("'<>()[]\"` ").replace('\'', '\"') 
     soup = BeautifulSoup(g_data) 



     hallo = soup.find_all("article", {"class": "activity-card"}) 

     for item in hallo: 
      headers = item.find_all("h3", {"class": "activity-card"}) 
      for header in headers: 
       header_final = header.text.strip() 
       if header_final not in already_printed: 
        already_printed.add(header_final) 

      deeplinks = item.find_all("a", {"class": "activity"}) 
      for t in set(t.get("href") for t in deeplinks): 
       deeplink_final = t 
       if deeplink_final not in already_printed: 
        already_printed.add(deeplink_final) 

      end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final 
      print(end_final) 

trade_spider(int(Spider)) 

我的目標是從一個特定的網站非常抓取多個城市/地區。

現在,我可以通過反覆重複我的代碼並抓取每個單獨的網站,然後將每個這些數據框的結果連接在一起,但看起來非常和諧。我想知道是否有人有更快的方式或任何建議?

我試着一二線城市加入到我的區域標誌,但不工作

Region = "Marrakech","London" 

誰能幫助我嗎?任何反饋意見。

+0

您是否嘗試過for循環以外的while循環來遍歷多個區域? –

回答

1
Region = ["Marrakech","London"] 

將while循環放入for循環,然後將頁面重置爲-1。

for reg in Region: 
    pages = -1 

並在請求url時用reg替換Region。

Region = ["Marrakech","London"]  
Spider = 20 

def trade_spider(max_pages): 

    partner_ID = 2 
    location_ID = 25 
    already_printed = set() 
    for reg in Region: 
     page = -1 
     while page <= max_pages: 
      page += 1 
      response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page)) 
      jsondata = json.loads(response.read().decode("utf-8")) 
      format = (jsondata['activities']) 
      g_data = format.strip("'<>()[]\"` ").replace('\'', '\"') 
      soup = BeautifulSoup(g_data) 



      hallo = soup.find_all("article", {"class": "activity-card"}) 

      for item in hallo: 
       headers = item.find_all("h3", {"class": "activity-card"}) 
       for header in headers: 
        header_final = header.text.strip() 
        if header_final not in already_printed: 
         already_printed.add(header_final) 

       deeplinks = item.find_all("a", {"class": "activity"}) 
       for t in set(t.get("href") for t in deeplinks): 
        deeplink_final = t 
        if deeplink_final not in already_printed: 
         already_printed.add(deeplink_final) 

       end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final 
       print(end_final) 
trade_spider(int(Spider)) 
+0

感謝您的反饋,但我沒有正確理解。如果可能,你能詳細說明一下嗎?在哪裏我應該取代區域與reg,同時要求url。感謝您的反饋,並對不起,如果我對你的神經感到厭煩,但我仍然是初學者 –

+0

@SeriousRuffy在這一行'response = urllib.request.urlopen(「http://www.jsox.com/s/search.json ?q =「+ str(Region)+」&page =「+ str(page))' – sinisteraadi

+0

非常感謝。欣賞它 –