如何從網站抓取多個網頁/城市（BeautifulSoup，Requests，Python3）

我想知道如何從一個網站使用美麗的湯/請求抓取多個不同的網頁/城市，而不必一遍又一遍地重複我的代碼。如何從網站抓取多個網頁/城市（BeautifulSoup，Requests，Python3）

這裏是我的代碼現在：

Region = "Marrakech" 
Spider = 20 

def trade_spider(max_pages): 
    page = -1 

    partner_ID = 2 
    location_ID = 25 

    already_printed = set() 

    while page <= max_pages: 
     page += 1 
     response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(Region) +"&page=" + str(page)) 
     jsondata = json.loads(response.read().decode("utf-8")) 
     format = (jsondata['activities']) 
     g_data = format.strip("'<>()[]\"` ").replace('\'', '\"') 
     soup = BeautifulSoup(g_data) 



     hallo = soup.find_all("article", {"class": "activity-card"}) 

     for item in hallo: 
      headers = item.find_all("h3", {"class": "activity-card"}) 
      for header in headers: 
       header_final = header.text.strip() 
       if header_final not in already_printed: 
        already_printed.add(header_final) 

      deeplinks = item.find_all("a", {"class": "activity"}) 
      for t in set(t.get("href") for t in deeplinks): 
       deeplink_final = t 
       if deeplink_final not in already_printed: 
        already_printed.add(deeplink_final) 

      end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final 
      print(end_final) 

trade_spider(int(Spider))

我的目標是從一個特定的網站非常抓取多個城市/地區。

現在，我可以通過反覆重複我的代碼並抓取每個單獨的網站，然後將每個這些數據框的結果連接在一起，但看起來非常和諧。我想知道是否有人有更快的方式或任何建議？

我試着一二線城市加入到我的區域標誌，但不工作

Region = "Marrakech","London"

誰能幫助我嗎？任何反饋意見。

來源

2015-11-18 Serious Ruffy

您是否嘗試過for循環以外的while循環來遍歷多個區域？ –

Region = ["Marrakech","London"]

將while循環放入for循環，然後將頁面重置爲-1。

for reg in Region: 
    pages = -1

並在請求url時用reg替換Region。

Region = ["Marrakech","London"]  
Spider = 20 

def trade_spider(max_pages): 

    partner_ID = 2 
    location_ID = 25 
    already_printed = set() 
    for reg in Region: 
     page = -1 
     while page <= max_pages: 
      page += 1 
      response = urllib.request.urlopen("http://www.jsox.com/s/search.json?q=" + str(reg) +"&page=" + str(page)) 
      jsondata = json.loads(response.read().decode("utf-8")) 
      format = (jsondata['activities']) 
      g_data = format.strip("'<>()[]\"` ").replace('\'', '\"') 
      soup = BeautifulSoup(g_data) 



      hallo = soup.find_all("article", {"class": "activity-card"}) 

      for item in hallo: 
       headers = item.find_all("h3", {"class": "activity-card"}) 
       for header in headers: 
        header_final = header.text.strip() 
        if header_final not in already_printed: 
         already_printed.add(header_final) 

       deeplinks = item.find_all("a", {"class": "activity"}) 
       for t in set(t.get("href") for t in deeplinks): 
        deeplink_final = t 
        if deeplink_final not in already_printed: 
         already_printed.add(deeplink_final) 

       end_final = "Header: " + header_final + " | " + "Deeplink: " + deeplink_final 
       print(end_final) 
trade_spider(int(Spider))

來源

2015-11-18 14:32:32 sinisteraadi

感謝您的反饋，但我沒有正確理解。如果可能，你能詳細說明一下嗎？在哪裏我應該取代區域與reg，同時要求url。感謝您的反饋，並對不起，如果我對你的神經感到厭煩，但我仍然是初學者 –

@SeriousRuffy在這一行'response = urllib.request.urlopen（「http://www.jsox.com/s/search.json ？q =「+ str（Region）+」＆page =「+ str（page））' – sinisteraadi

非常感謝。欣賞它 –

如何從網站抓取多個網頁/城市（BeautifulSoup，Requests，Python3）

回答

相關問題