1

我想拉嘰嘰喳喳使用Selenium鉻的webdriver和BeautifulSoup對於具有80K追隨者賬戶追隨者數據的所有追隨者。 我在腳本中遇到兩個問題:使用python中的selenium chrome webdriver提取twitter追隨者數據?無法加載

1)在所有關注者加載後,滾動到頁面底部以獲取整個頁面源代碼時,我的腳本不會一直滾動到底部。在加載隨機數的追隨者之後,它停止滾動,然後開始遍歷每個追隨者配置文件以獲取他們的數據。我希望它加載頁面上的所有追隨者,然後開始遍歷配置文件。

2)我的第二個問題是,我每次運行該腳本,它會嘗試通過一個滾動到下一個,直到所有的追隨者被加載,然後開始通過一次解析一個跟隨數據提取數據的時間。這需要4到5天才能獲取我的案例中的所有追隨者數據(80K追隨者)。有沒有更好的方法來做到這一點。

這裏是我的腳本:

from bs4 import BeautifulSoup 
 
import sys 
 
import os,re 
 
import time 
 
from selenium import webdriver 
 
from selenium.webdriver.support.ui import Select 
 
from selenium.webdriver.common.keys import Keys 
 
from os import listdir 
 
from os.path import isfile, join 
 

 
print "Running for chrome." 
 

 
chromedriver=sys.argv[1] 
 
download_path=sys.argv[2] 
 
os.system('killall -9 "Google Chrome"') 
 
try: 
 
\t os.environ["webdriver.chrome.driver"]=chromedriver 
 
\t chromeOptions = webdriver.ChromeOptions() 
 
\t prefs = {"download.default_directory" : download_path} 
 
\t chromeOptions.add_experimental_option("prefs",prefs) 
 
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions) 
 
\t driver.implicitly_wait(20) 
 
\t driver.maximize_window() 
 
except Exception as err: 
 
\t print "Error:Failed to open chrome." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 
\t 
 
#opening the web page 
 
try: 
 
\t driver.get('https://twitter.com/login') 
 
except Exception as err: 
 
\t print "Error:Failed to open url." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 

 
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']") 
 
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']") 
 

 
username.send_keys("###########") 
 
password.send_keys("###########") 
 
driver.find_element_by_xpath("//button[@type='submit']").click() 
 
#os.system('killall -9 "Google Chrome"') 
 
driver.get('https://twitter.com/sadserver/followers') 
 

 

 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
output=open('twitter_follower_sadoperator.csv','a') 
 
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n') 
 
div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
name_list=[] 
 
lastHeight = driver.execute_script("return document.body.scrollHeight") 
 

 

 
for _ in xrange(0, followers_count/followers_per_page + 1): 
 
     driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") 
 
     time.sleep(5) 
 
     newHeight = driver.execute_script("return document.body.scrollHeight") 
 
     if newHeight == lastHeight: 
 
       followers_link=driver.page_source #follwer page 18at a time 
 
       soup=BeautifulSoup(followers_link,'html.parser') 
 
       div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
       bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
       for name in bref: 
 
         name_list.append(name['href']) 
 
       break 
 
     lastHeight = newHeight 
 
     followers_link='' 
 

 
print len(name_list) 
 

 

 
for x in range(0,len(name_list)): 
 
     #print name['href'] 
 
     #print name.text 
 
     driver.stop_client() 
 
     driver.get('https://twitter.com'+name_list[x]) 
 
     page_source=driver.page_source 
 
     each_soup=BeautifulSoup(page_source,'html.parser') 
 
     profile=each_soup.find('div',{'class':'ProfileHeaderCard'}) 
 
          
 
     try: 
 
       name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text 
 
       if name: 
 
         output.write('"'+name.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in name:',e 
 

 
     try: 
 
       handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text 
 
       if handle: 
 
         output.write('"'+handle.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in handle:',e 
 

 
     try: 
 
       location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text 
 
       if location: 
 
         output.write('"'+location.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in location:',e 
 

 
     try: 
 
       bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text 
 
       if bio: 
 
         output.write('"'+bio.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in bio:',e 
 
         
 
     try: 
 
       joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text 
 
       if joinDate: 
 
         output.write('"'+joinDate.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in joindate:',e 
 
     
 
     try: 
 
       url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1] 
 
       if url: 
 
         output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n') 
 
       else: 
 
         output.write(' '+'\n') 
 
     except Exception as e: 
 
       output.write(' '+'\n') 
 
       print 'Error in url:',e 
 
     
 

 

 
     
 
output.close() 
 

 

 
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")

回答

0

有一個更好的方式。 使用Twitter的API,這裏有一個快速Github上腳本,我發現Github Script 對不起,你可能覺得你已經使用硒細腰大量的時間(有專業人士不使用API​​) 偉大的職位上的自動化和得到的東西是如何工作的:Twitter API

有滾動多次的方式,但你必須做一些數學或設置條件來阻止這一切。

driver.execute_script("window.scrollTo(0, 10000);") 

比方說,你有10K追隨者和INTIAL顯示追隨者,之後你會加載10followers每個滾動。你會滾動另一個倍。

下面是您的案例的確切用法,當然由alecxe:D。 Qudora* answer By - alecxe -

html = driver.page_source 

.page_source可一旦你揭示所有的追隨者(滾動),然後用一些分析它像BeautifulSoup

+0

我們可以通過手動滾動加載所有的追隨者,並保存在文本文件頁面的源代碼,然後遍歷從文本文件,而不是去到Twitter網站的所有追隨者的數據。我不知道這是否會起作用。如果它會那麼能否請您提供的代碼要做到這一點,因爲我一直想這樣做,並沒有成功。謝謝。 –

+0

是的,有硒FUNC> .page_source例如 HTML = driver.page_source –

0

我做T他實現由提到使用alecxe在他的回答中,但我的腳本仍然沒有解析所有的追隨者。它仍在加載隨機數的追隨者。似乎無法得到這個底部。有人可以嘗試運行它們,看看他們是否能夠加載所有的追隨者。下面是修改後的腳本:

from bs4 import BeautifulSoup 
 
import sys 
 
import os,re 
 
import time 
 
from selenium import webdriver 
 
from selenium.webdriver.support.ui import Select 
 
from selenium.webdriver.common.keys import Keys 
 
from os import listdir 
 
from os.path import isfile, join 
 

 
print "Running for chrome." 
 

 
chromedriver=sys.argv[1] 
 
download_path=sys.argv[2] 
 
os.system('killall -9 "Google Chrome"') 
 
try: 
 
\t os.environ["webdriver.chrome.driver"]=chromedriver 
 
\t chromeOptions = webdriver.ChromeOptions() 
 
\t prefs = {"download.default_directory" : download_path} 
 
\t chromeOptions.add_experimental_option("prefs",prefs) 
 
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions) 
 
\t driver.implicitly_wait(20) 
 
\t driver.maximize_window() 
 
except Exception as err: 
 
\t print "Error:Failed to open chrome." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 
\t 
 
#opening the web page 
 
try: 
 
\t driver.get('https://twitter.com/login') 
 
except Exception as err: 
 
\t print "Error:Failed to open url." 
 
\t print "Error: ",err 
 
\t driver.stop_client() 
 
\t driver.close() 
 

 
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']") 
 
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']") 
 

 
username.send_keys("*****************") 
 
password.send_keys("*****************") 
 
driver.find_element_by_xpath("//button[@type='submit']").click() 
 
#os.system('killall -9 "Google Chrome"') 
 
driver.get('https://twitter.com/sadoperator/followers') 
 

 

 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
output=open('twitter_follower_sadoperator.csv','a') 
 
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n') 
 
div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
name_list=[] 
 
lastHeight = driver.execute_script("return document.body.scrollHeight") 
 

 
followers_link=driver.page_source #follwer page 18at a time 
 
soup=BeautifulSoup(followers_link,'html.parser') 
 

 
followers_per_page = 18 
 
followers_count = 15777 
 

 

 
for _ in xrange(0, followers_count/followers_per_page + 1): 
 
     driver.execute_script("window.scrollTo(0, 7755000);") 
 
     time.sleep(2) 
 
     newHeight = driver.execute_script("return document.body.scrollHeight") 
 
     if newHeight == lastHeight: 
 
       followers_link=driver.page_source #follwer page 18at a time 
 
       soup=BeautifulSoup(followers_link,'html.parser') 
 
       div = soup.find('div',{'class':'GridTimeline-items has-items'}) 
 
       bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'}) 
 
       for name in bref: 
 
         name_list.append(name['href']) 
 
       break 
 
     lastHeight = newHeight 
 
     followers_link='' 
 

 
print len(name_list) 
 

 
''' 
 
for x in range(0,len(name_list)): 
 
     #print name['href'] 
 
     #print name.text 
 
     driver.stop_client() 
 
     driver.get('https://twitter.com'+name_list[x]) 
 
     page_source=driver.page_source 
 
     each_soup=BeautifulSoup(page_source,'html.parser') 
 
     profile=each_soup.find('div',{'class':'ProfileHeaderCard'}) 
 
          
 
     try: 
 
       name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text 
 
       if name: 
 
         output.write('"'+name.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in name:',e 
 

 
     try: 
 
       handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text 
 
       if handle: 
 
         output.write('"'+handle.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in handle:',e 
 

 
     try: 
 
       location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text 
 
       if location: 
 
         output.write('"'+location.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in location:',e 
 

 
     try: 
 
       bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text 
 
       if bio: 
 
         output.write('"'+bio.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in bio:',e 
 
         
 
     try: 
 
       joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text 
 
       if joinDate: 
 
         output.write('"'+joinDate.strip().encode('utf-8')+'"'+',') 
 
       else: 
 
         output.write(' '+',') 
 
     except Exception as e: 
 
       output.write(' '+',') 
 
       print 'Error in joindate:',e 
 
     
 
     try: 
 
       url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1] 
 
       if url: 
 
         output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n') 
 
       else: 
 
         output.write(' '+'\n') 
 
     except Exception as e: 
 
       output.write(' '+'\n') 
 
       print 'Error in url:',e 
 
     
 

 

 
     
 
output.close() 
 
''' 
 

 
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")

0
  1. 在Firefox或其他瀏覽器中打開開發者控制檯,並滾動翻頁過程中發生寫下來(複印件)要求 - 你會用它來構建你的申請。請求看起來水木清華這樣的 - https://twitter.com/DiaryofaMadeMan/followers/users?include_available_features=1&include_entities=1&max_position=1584951385597824282&reset_error_state=false,和搜索HTML源數據民位置這樣的 - 數據分位=使用PhantomJS「1584938620170076301」
  2. 加載HTML - 解析使用Beautifulsoup。你需要獲得第一部分追隨者和「數據最小」值。保存追隨者到列表中,「數據分位」變量
  3. 使用保存在第1階段的要求和「數據民」,構建新的要求 - 與保存數據分鐘只更換的請求數據-MAX的數字
  4. 使用Python請求(無webdriver的再)來發送請求和接收JSON響應。
  5. 獲得新的追隨者和新的數據min,從響應JSON
  6. 重複2,3,4直至數據分= 0

這種方式比API要好得多,因爲你可以加載大量的數據沒有任何限制

相關問題