Python的beautifulsoup不斂全表

我不知道如果不抓住全表，因爲mechanizePython的beautifulsoup不斂全表

這工作：

from bs4 import BeautifulSoup 
import requests 

page = 'http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp' 
r = requests.get(page) 

r.encoding = 'utf-8' 
soup = BeautifulSoup(r.text) 

div = soup.find('div', class_='mainRight').find_all('div')[1] 
table = div.find('table', recursive=False) 

for row in table.find_all('tr', recursive=False): 
    for cell in row('td', recursive=False): 
     print cell.text.split()

但這並不：

import mechanize 
from bs4 import BeautifulSoup 
import requests 

URL='http://www.airchina.com.cn/www/jsp/airlines_operating_data/exlshow_en.jsp' 
control_year=['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014'] 
control_month=['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'] 

br = mechanize.Browser() 
r=br.open(URL) 

br.select_form("exl") 
control_m = br.form.find_control('month') 
control_y = br.form.find_control('year') 

br[control_m.name]=['06'] 
br[control_y.name]=['2012'] 
response = br.submit() 
soup = BeautifulSoup(response,'html.parser') 
#div = soup.find('div', class_='mainRight') 


div = soup.find('div', class_='mainRight').find_all('div')[1] 
table = div.find('table', recursive=False) 
for row in table.find_all('tr', recursive=False): 
    for cell in row('td', recursive=False): 
     print cell.text.strip()

使用mechanize只生產以下，即使在螢火蟲我看到所有的tr和td

Jun 2012 
% change vs Jun 2011 
% change vs May 2012 
Cumulative Jun 2012 
% cumulative change

來源

2014-04-22 jason

很可能是它自動在表格中添加'tbody'元素。在'tr'之前循環遍歷'table'中的所有'tbody'。 – Wolph

@沃爾夫。我試過'table.find_all（'tbody'）'但是返回'[]' – jason

我相信它可能與您正在使用的'html.parser'有關，請參閱我的答案 – Wolph

將兩者結合使用時沒有問題，因此它可能與您正在使用的html.parser有關。

import mechanize 
from bs4 import BeautifulSoup 

URL = ('http://www.airchina.com.cn/www/jsp/airlines_operating_data/' 
     'exlshow_en.jsp') 
control_year = ['2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', 
       '2014'] 
control_month = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', 
       '11', '12'] 

br = mechanize.Browser() 
r = br.open(URL) 

br.select_form("exl") 
control_m = br.form.find_control('month') 
control_y = br.form.find_control('year') 

br[control_m.name] = ['06'] 
br[control_y.name] = ['2012'] 
response = br.submit() 

soup = BeautifulSoup(response) 

div = soup.find('div', class_='mainRight').find_all('div')[1] 
table = div.find('table', recursive=False) 

for row in table.find_all('tr', recursive=False): 
    for cell in row('td', recursive=False): 
     print cell.text.split()

來源

2014-04-22 10:38:40 Wolph

工作正常！謝謝您的幫助。我從來沒有想到這一點。 – jason

我不明白'html.parser'。有時它是唯一可行的。有時它不起作用。 – jason

'html.parser'是解析器，它是Python發行版的一部分，因此它始終可用。但這並不是最好的。 'lxml'要快得多且非常有效，但它是一個獨立的依賴項，意味着你需要自己安裝它。您可以在這裏找到解析器列表：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – Wolph

Python的beautifulsoup不斂全表

回答

相關問題