我想拉嘰嘰喳喳使用Selenium鉻的webdriver和BeautifulSoup對於具有80K追隨者賬戶追隨者數據的所有追隨者。 我在腳本中遇到兩個問題:使用python中的selenium chrome webdriver提取twitter追隨者數據?無法加載
1)在所有關注者加載後,滾動到頁面底部以獲取整個頁面源代碼時,我的腳本不會一直滾動到底部。在加載隨機數的追隨者之後,它停止滾動,然後開始遍歷每個追隨者配置文件以獲取他們的數據。我希望它加載頁面上的所有追隨者,然後開始遍歷配置文件。
2)我的第二個問題是,我每次運行該腳本,它會嘗試通過一個滾動到下一個,直到所有的追隨者被加載,然後開始通過一次解析一個跟隨數據提取數據的時間。這需要4到5天才能獲取我的案例中的所有追隨者數據(80K追隨者)。有沒有更好的方法來做到這一點。
這裏是我的腳本:
from bs4 import BeautifulSoup
import sys
import os,re
import time
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from os import listdir
from os.path import isfile, join
print "Running for chrome."
chromedriver=sys.argv[1]
download_path=sys.argv[2]
os.system('killall -9 "Google Chrome"')
try:
\t os.environ["webdriver.chrome.driver"]=chromedriver
\t chromeOptions = webdriver.ChromeOptions()
\t prefs = {"download.default_directory" : download_path}
\t chromeOptions.add_experimental_option("prefs",prefs)
\t driver = webdriver.Chrome(executable_path=chromedriver, chrome_options=chromeOptions)
\t driver.implicitly_wait(20)
\t driver.maximize_window()
except Exception as err:
\t print "Error:Failed to open chrome."
\t print "Error: ",err
\t driver.stop_client()
\t driver.close()
\t
#opening the web page
try:
\t driver.get('https://twitter.com/login')
except Exception as err:
\t print "Error:Failed to open url."
\t print "Error: ",err
\t driver.stop_client()
\t driver.close()
username = driver.find_element_by_xpath("//input[@name='session[username_or_email]' and @class='js-username-field email-input js-initial-focus']")
password = driver.find_element_by_xpath("//input[@name='session[password]' and @class='js-password-field']")
username.send_keys("###########")
password.send_keys("###########")
driver.find_element_by_xpath("//button[@type='submit']").click()
#os.system('killall -9 "Google Chrome"')
driver.get('https://twitter.com/sadserver/followers')
followers_link=driver.page_source #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')
output=open('twitter_follower_sadoperator.csv','a')
output.write('Name,Twitter_Handle,Location,Bio,Join_Date,Link'+'\n')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
name_list=[]
lastHeight = driver.execute_script("return document.body.scrollHeight")
for _ in xrange(0, followers_count/followers_per_page + 1):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
followers_link=driver.page_source #follwer page 18at a time
soup=BeautifulSoup(followers_link,'html.parser')
div = soup.find('div',{'class':'GridTimeline-items has-items'})
bref = div.findAll('a',{'class':'ProfileCard-bg js-nav'})
for name in bref:
name_list.append(name['href'])
break
lastHeight = newHeight
followers_link=''
print len(name_list)
for x in range(0,len(name_list)):
#print name['href']
#print name.text
driver.stop_client()
driver.get('https://twitter.com'+name_list[x])
page_source=driver.page_source
each_soup=BeautifulSoup(page_source,'html.parser')
profile=each_soup.find('div',{'class':'ProfileHeaderCard'})
try:
name = profile.find('h1',{'class':'ProfileHeaderCard-name'}).find('a').text
if name:
output.write('"'+name.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in name:',e
try:
handle=profile.find('h2',{'class':'ProfileHeaderCard-screenname u-inlineBlock u-dir'}).text
if handle:
output.write('"'+handle.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in handle:',e
try:
location = profile.find('div',{'class':'ProfileHeaderCard-location'}).text
if location:
output.write('"'+location.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in location:',e
try:
bio=profile.find('p',{'class':'ProfileHeaderCard-bio u-dir'}).text
if bio:
output.write('"'+bio.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in bio:',e
try:
joinDate = profile.find('div',{'class':'ProfileHeaderCard-joinDate'}).text
if joinDate:
output.write('"'+joinDate.strip().encode('utf-8')+'"'+',')
else:
output.write(' '+',')
except Exception as e:
output.write(' '+',')
print 'Error in joindate:',e
try:
url = [check.find('a') for check in profile.find('div',{'class':'ProfileHeaderCard-url'}).findAll('span')][1]
if url:
output.write('"'+url['href'].strip().encode('utf-8')+'"'+'\n')
else:
output.write(' '+'\n')
except Exception as e:
output.write(' '+'\n')
print 'Error in url:',e
output.close()
os.system("kill -9 `ps -deaf | grep chrome | awk '{print $2}'`")
我們可以通過手動滾動加載所有的追隨者,並保存在文本文件頁面的源代碼,然後遍歷從文本文件,而不是去到Twitter網站的所有追隨者的數據。我不知道這是否會起作用。如果它會那麼能否請您提供的代碼要做到這一點,因爲我一直想這樣做,並沒有成功。謝謝。 –
是的,有硒FUNC> .page_source例如 HTML = driver.page_source –