2015-06-02 42 views
0

我使用BeautifulSoup4湊一個頁面,下面的功能是給我2個問題:BeautifulSoup:很難獲得正確的表

def getTeamRoster(teamURL): 
    html = urllib.request.urlopen(teamURL).read() 
    soup = BeautifulSoup(html) 
    teamPlayers = [] 
    #second table 
    corebody = soup.find(id = "corebody") 
    teamTable = corebody.table.next_sibling.next_sibling.next_sibling.next_sibling 
    print(teamTable) 
    tableBody = teamTable.find('tbody') 
    print(tableBody) 
    tableRows = tableBody.findAll('tr') 

1)當我打電話「.next_sibling」只有4次(如上),我似乎得到正確的表。但是,我試圖訪問的表標籤是#corebody標識中的第6個表。當我調用「.next_sibling」5次時,我從BeautifulSoup得到-1,表示我所請求的表不存在?我認爲在這種情況發生時你通常會得到None。任何想法爲什麼調用「.next_sibling」5次沒有按預期工作?

網址是http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325

2)tableBody = teamTable.find( 'TBODY') 是給我一些麻煩。當我打印tableBody時,我得到無,但我不知道爲什麼會發生這種情況(我正在訪問的表中肯定有一個標籤)。

想法?

感謝您的幫助, bclayman

+0

'tbody'可能通過瀏覽器生成的。嘗試保存實際的html文件而不是查看它。 – tdihp

回答

2

我能得到球員的表使用pandas.read_html

import requests 
import pandas as pd 

url = 'http://modules.ussquash.com/ssm/pages/leagues/Team_Information.asp?id=11325' 
tables = pd.read_html(requests.get(url).content) 
tables[4] 
          \n\t\t\t\tPlayers\n\t\t\t   City Gender SinglesRating TeamPosition Expiration Win/Loss P Registered Code Ref. Exam 
0           Browne,Noah  Taunton  M   5.56   1 02/29/2016 14/4 - 08/28/14 -  NaN 
1          Ellis,Thornton   rye  M   4.27   10 02/29/2016 0/9 - 08/28/14 -  pass 
2           Line,James Glastonbury  M   4.25   10 02/29/2016 2/7 - 08/28/14 -  NaN 
3         Desantis,Scott J.  Sudbury  M   5.08   2 02/29/2016 9/10 - 08/28/14 -  pass 
4         Bahadori,Cameron Great Falls  M   4.97   3 01/12/2016 3/10 - 11/05/14 -  pass 
5          Groot,Michael  Victoria  M   4.76   4 02/29/2016 5/11 - 08/28/14 -  NaN 
6          Ehsani,Darian  Greenwich  M   4.76   5 02/29/2016 6/13 - 08/28/14 -  pass 
7           Kardon,Max   Weston  M   4.83   6 02/29/2016 5/14 - 08/28/14 -  pass 
8           Van,Jeremy   NaN  M   4.66   7 02/29/2016 5/13 - 08/28/14 -  NaN 
9        Southmayd,Alexander T.   Boston  M   4.91   8 02/29/2016 13/6 - 08/28/14 -  pass 
10         Cacouris,Stephen A   Alpine  M   4.68   9 02/29/2016 9/10 - 08/28/14 -  pass 
11         Groot,Christopher  Edmonton  M   4.62   - 02/29/2016 0/2 - 08/28/14 -  NaN 
12        Mack,Peter D. (sub)  N. Eastham  M   3.94   - 02/29/2016 0/1 - 11/23/14 -  NaN 
13        Shrager,Nathaniel O.  Stanford  M   0.00   - 02/29/2016 0/0 - 08/28/14 -  NaN 
14        Woolverton,Peter C. Chestnut Hill  M   4.06   - 02/29/2016 1/0 - 08/28/14 -  NaN 
15 Total Players: 15 Average singles rating: 4.36...   NaN NaN   NaN   NaN   NaN  NaN NaN  NaN NaN  NaN 
+0

'pandas.read_html'總是一件輕而易舉的事:-D – tdihp

+0

tbh在回答這個問題之前,我從未嘗試過! – maxymoo

1

使用soup.select

一個內膽:

[i.get_text() for i in soup.select('#corebody table tr td') if 'Won' in i.get_text() or 'Lost' in i.get_text()]` 

龍版本:

for i in soup.select('#corebody table tr td'): 
    if 'Won' in i.get_text() or 'Lost' in i.get_text(): 
     print i.get_text()` 

[u'Won 7-2', u'Won 5-4', u'Lost 1-8', u'Lost 1-8', u'Won 8-1', u'Lost 3-6', u'Won 7-2', u'Lost 0-9', u'Lost 1-8', u'Won 5-4', u'Lost 1-8', u'Lost 2-7', u'Won 8-1', u'Lost 3-6', u'Lost 4-5', u'Lost 4-5', u'Lost 1-8', u'Lost 4-5', u'Won 6-3']