BS未能得到部分ID硒檢索

頁

import re 
from lxml import html 
from bs4 import BeautifulSoup as BS 
from selenium import webdriver 
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary 
import requests 
import sys 
import datetime 

print ('start!') 
print(datetime.datetime.now()) 

list_file = 'list2.csv' 
#This should be the regular input list 

url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"] 
#This is an example input instead 

binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe') 
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation. 

for page in url_list: 
    print(page) 
    browser = webdriver.Firefox(firefox_binary=binary) 
    #I tried this too to solve the [WinError 6] but it is not working 
    browser.get(page) 
    print ("TEST BEGINS") 
    soup=BS(browser.page_source,"lxml") 
    soup=soup.find("summaries") 
    # This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries 
    print(soup) #It prints "None" indeed. 
    print ("TEST ENDS")

我正源代碼包含「摘要」。首先出現的是

<li> <a href="#summaries" ng-click="scrollTo('summaries')">Summaries</a></li>

再有就是

<section id="summaries" data-ga-label="Summaries" data-section="Summaries">

如這裏（Webscraping in python: BS, selenium, and None error）由@alexce建議，我試圖

summary = soup.find('section', attrs={'id':'summaries'})

（編輯：建議是_summaries但我沒有測試摘要也是）

但它也不起作用。所以我的問題是：爲什麼BS找不到摘要，並且爲什麼硒不斷打破，當我使用的腳本連續過多（重新啓動控制檯的作品，而另一方面，但這是乏味），或一個包含四個以上實例的列表？感謝

來源

2016-03-16 Ando Jurai

我測試過許多解決方案提出[這裏]（http://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id）和它doesn」工作。所以，我想這與我的具體頁面做...我還試圖用其他的東西，硒（robobrowser，機械湯），但是包是在Windows下使用... –

此：

summary = soup.find('section', attrs={'id':'_summaries'})

搜索元素section有屬性id設置爲_summaries：

<section id="_summary" />

有這些屬性沒有元素一世在頁面中。
你想要的那個可能是<section id="summaries" data-ga-label="Summaries" data-section="Summaries">。並且可以匹配：

results = soup.find('section', id_='summaries')

另外，關於爲什麼使用硒的附註。如果您不轉發cookie，該頁面將返回錯誤。所以爲了使用請求，你需要發送cookies。

我全碼：

1 from __future__ import unicode_literals 
    2 
    3 import re 
    4 import requests 
    5 from bs4 import BeautifulSoup as BS 
    6 
    7 
    8 data = requests.get(
    9  'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3', 
10  cookies={ 
11   'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C', 
12   'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+', 
13   'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ==' 
14  }).content 
15 
16 soup=BS(data) 
17 results=soup.find_all(string=re.compile('summary', re.I)) 
18 print(results) 
19 summary_re = re.compile('summary', re.I) 
20 results = soup.find('section', id_='summaries') 
21 print(results)

來源

2016-03-16 15:25:55 Cyrbil

對不起，我寫的建議，然後說，但我曾嘗試過summary = soup.find（'section'，attrs = {'id'：'summaries'}）。感謝您的建議，但爲什麼我必須使用帶有下劃線的「id_」？爲什麼它不被視爲屬性（如果我們考慮使用attrs = {'id'：'summaries'}））？有一個微妙的，我錯過了，但我想這是因爲我大多知道基本的HTML，我知道如何閱讀它，但我不知道它的「語法」。所以也許我把一個屬性與其他東西混淆了。 –

關於cookies：我覺得我不能使用請求，因爲該頁面受到保護，防止殭屍並阻止我的刮板（我只是在做一些手工操作，只是以人類的速度複製粘貼的東西，但實際上我在同時）。此外，這些cookie是標準的，還是我應該在哪裏找到這些列表？正如我所說，我知道一些事情，但這種情況在很大程度上超過了我的技能水平和知識。 –

'id'是一個python關鍵字，所以你必須加下劃線，並且beautifulsoup會將它理解爲'id'。第二種選擇是編寫'attrs = {'id'：'summaries'}'，其中''id''是一個字符串，不會被解釋爲python關鍵字。要繞過cookie，您需要先獲取頁面一次（帶選項動詞）獲取cookie，然後重新發送請求以獲取頁面。這基本上是你的瀏覽器和硒火狐。 – Cyrbil

元素，也許是尚未在頁面上。我會等元素與BS解析頁面源之前：

from selenium import webdriver 
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 

driver = webdriver.Firefox() 
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3") 
wait = WebDriverWait(driver, 10) 
wait.until(EC.visibility_of_element_located((By.ID, "summaries"))) 
soup = BS(driver.page_source,"lxml")

我注意到，你永遠不會調用driver.quit（），這可能是你的最新問題的原因。所以一定要調用它或嘗試重用相同的會話。

，並使其更加穩定和高性能的話，我會因爲拉扯和解析頁面的源代碼是昂貴的，因爲玉米粥與硒API工作成爲可能。

來源

2016-03-16 16:33:06

我的循環中有一個driver.quit（），但由於某種原因丟棄了它。我也試過driver.close（），但它也打破了。 –

謝謝。我試着等待時間或WebDriverWait，並在我的循環中有一個driver.quit（），但由於它不能解決問題而丟棄它。我也嘗試過driver.close（），但它也打破了。我知道這是昂貴的，但我只是試圖重現我以前見過的。其實我無法讓selenium的driver.find_element_by_id東西都不能工作，所以我試圖用互聯網上最常用的東西（我不夠好，不能重新發明輪子，我已經不能做一個圓形的單個卷軸...） –

BS未能得到部分ID硒檢索

回答

相關問題