使用Selenium和BeautifulSoup刮一個網站

所以我試圖用JS動態地加載一些網站。我的目標是建立一個快速的Python腳本來加載一個網站，看看是否有某個詞，然後給我發電子郵件，如果它在那裏。使用Selenium和BeautifulSoup刮一個網站

我對編碼比較陌生，所以如果有更好的方法，我會很樂意聽到。

我目前正在用Selenium加載頁面，然後用BeautifulSoup擦掉生成的頁面，這就是我遇到問題的地方。如何獲得美麗的妝容來刮掉我剛剛在硒上打開的網站？

from __future__ import print_function 
from bs4 import BeautifulSoup 
from selenium import webdriver 
import requests 
import urllib, urllib2 
import time 


url = 'http://www.somesite.com/' 

path_to_chromedriver = '/Users/admin/Downloads/chromedriver' 
browser = webdriver.Chrome(executable_path = path_to_chromedriver) 

site = browser.get(url) 

html = urllib.urlopen(site).read() 
soup = BeautifulSoup(html, "lxml") 
print(soup.prettify())

我有說

Traceback (most recent call last): 
    File "probation color.py", line 16, in <module> 
    html = urllib.urlopen(site).read() 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen 
    return opener.open(url) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 185, in open 
    fullurl = unwrap(toBytes(fullurl)) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 1075, in unwrap 
    url = url.strip() 
AttributeError: 'NoneType' object has no attribute 'strip'

，我真的不明白或理解爲什麼發生錯誤。它是否與urllib內部的東西？我如何解決它？我認爲解決這個問題可以解決我的問題。

來源

2015-12-10 Josh

可以在瀏覽器中使用「page_source」屬性找到HTML。這應該工作：

browser = webdriver.Chrome(executable_path = path_to_chromedriver) 
browser.get(url) 

html = browser.page_source 
soup = BeautifulSoup(html, "lxml") 
print(soup.prettify())

來源

2015-12-10 21:22:19 masnun

謝謝！正是我需要的。 – Josh

from __future__ import print_function 
from bs4 import BeautifulSoup 
from selenium import webdriver 
import requests 
#import urllib, urllib2 
import time 


url = 'http://www.somesite.com/' 

path_to_chromedriver = '/Users/admin/Downloads/chromedriver' 
browser = webdriver.Chrome(executable_path = path_to_chromedriver) 

site = browser.get(url) 
html = site.page_source #you should have used this... 

#html = urllib.urlopen(site).read() #this is the mistake u did... 
soup = BeautifulSoup(html, "lxml") 
print(soup.prettify())

來源

2016-10-11 09:46:45

使用Selenium和BeautifulSoup刮一個網站

回答

相關問題