2017-09-16 48 views
0

我想從URL下載位置細節從Instagram刮,但我不能使用加載更多的選項來從網址刮更多的位置。如何使用加載更多選項與非頭部網頁刮板[Instagram]

我很欣賞關於如何修改代碼的建議,或者需要使用哪個新代碼塊來獲取特定網址中的所有可用位置。

代碼:

import re 
import requests 
import json 
import pandas as pd 
import numpy as np 
import csv 
from geopy.geocoders import Nominatim 

def Location_city(F_name): 
    path="D:\\Everyday_around_world\\instagram\\" 
    filename=path+F_name 
    url1="https://www.instagram.com/explore/locations/c1027234/hyderabad-india/" 
    r = requests.get(url1) 
    df3=pd.DataFrame() 
    match = re.search('window._sharedData = (.*);</script>', r.text) 
    a= json.loads(match.group(1)) 
    b=a['entry_data']['LocationsDirectoryPage'][0]['location_list'] 
    for j in range(0,len(b)): 
     z= b[j] 
     if all(ord(char) < 128 for char in z['name'])==True: 
      x=str(z['name']) 
      print (x) 
      geolocator = Nominatim() 
      location = geolocator.geocode(x,timeout=10000) 
      if location!=None: 
       #print((location.latitude, location.longitude)) 
       df3 = df3.append(pd.DataFrame({'name': z['name'], 'id':z['id'],'latitude':location.latitude, 
             'longitude':location.longitude},index=[0]), ignore_index=True) 
    df3.to_csv(filename,header=True,index=False) 
Location_city("Hyderabad_locations.csv") 

在此先感謝您的幫助..

回答

1

對於Instagram的URL爲「看多」按鈕,我想你可能會描述增加了頁碼,你的網址像這樣:https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page=2

只要你繼續收到結果,你可以添加一個迭代計數器來模擬增加頁碼和循環。我添加一個try,除了監視沒有更多結果時引發的KeyError,然後設置條件以退出循環並將數據框寫入csv。

修改後的代碼:

import re 
import requests 
import json 
import pandas as pd 
import numpy as np 
import csv 
from geopy.geocoders import Nominatim 

def Location_city(F_name): 
    path="D:\\Everyday_around_world\\instagram\\" 
    filename=path+F_name 
    url1="https://www.instagram.com/explore/locations/c1027234/hyderabad-india/?page=" 
    pageNumber = 1 
    r = requests.get(url1+ str(pageNumber)) #grabs page 1 
    df3=pd.DataFrame() 
    searching = True 
    while searching: 
     match = re.search('window._sharedData = (.*);</script>', r.text) 
     a= json.loads(match.group(1)) 
     try: 
      b=a['entry_data']['LocationsDirectoryPage'][0]['location_list'] 
     except KeyError: # 
      print "No more locations returned" 
      searching = False # will exit while loop 
      b = [] # avoids duplicated from previous results 
     if len(b) > 0: # skips this section if there are no results 
      for j in range(0,len(b)): 
       z= b[j] 
       if all(ord(char) < 128 for char in z['name'])==True: 
        x=str(z['name']) 
        print (x) 
        geolocator = Nominatim() 
        location = geolocator.geocode(x,timeout=10000) 
        if location!=None: 
         #print((location.latitude, location.longitude)) 
         df3 = df3.append(pd.DataFrame({'name': z['name'], 'id':z['id'],'latitude':location.latitude, 
             'longitude':location.longitude},index=[0]), ignore_index=True) 
     pageNumber += 1 
     next = url1 + str(pageNumber) # increments url 
     r = requests.get(next) # gets results for next url 
    df3.to_csv(filename,header=True,index=False) #When finished looping through pages, write dataframe to csv 
Location_city("Hyderabad_locations.csv") 
+0

謝謝你這麼多。解決方案就像一個魅力..我只能在未來24小時內分配獎勵點..將盡快..再次謝謝.. –