2016-07-28 56 views
1

我想從this網頁抓取數據,如位置和玩家的名字。我的代碼如下。不能只使用BS4從表中拉出可見的文本

#create url for the wikipedia data we are going to scrape 
wikiURL = "https://en.wikipedia.org/wiki/2012_NFL_Draft" 

#create array to store player info in 
teams_players = [] 

# request and parse wikiURL 
r = requests.get(wikiURL) 
soup = BeautifulSoup(r.content, "html.parser") 

#find table in wikipedia 
playerData = soup.find('table', {"class": "wikitable sortable"}) 

for row in playerData.find_all('tr')[1:]: 
    cols = row.find_all(['td', 'th']) 
    if len(cols) < 6: 
     continue 
    teams_players.append((cols[5].text.strip(), cols[4].text.strip())) 

for team, player in teams_players: 
    print('{:35} {}'.format(team, player)) 

的問題是,有一個與文本和在名稱字段中顯示的文本是「排序關鍵字」 span標記,所以輸出最終被加倍,並且顯示了象徵。

QB         Luck, AndrewAndrew Luck † 
QB         Griffin III, RobertRobert Griffin III † 

我試圖尋找{「類」:「FN」}但這只是返回空括號的列表。

我該如何才能拉出顯示的文字並忽略符號呢?

回答

2

如果你只是想要的名稱和位置,可以簡化代碼查找每個跨度類FN表的每個TD內,得到的文本,然後查找接下來td並從td的錨點中提取文本。

from bs4 import BeautifulSoup 
import requests 
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2012_NFL_Draft").content,"lxml") 
table = soup.select_one("table.wikitable.sortable") 

for name_tag in table.select("tr + tr td span.fn"): 
    print(name_tag.text, name_tag.find_next("td").a.text) 

如果我們運行的代碼,你可以看到我們得到了所有我們想要的數據,並沒有任何符號:

In [1]: from bs4 import BeautifulSoup 
    ...: import requests 
    ...: soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2012_NF 
    ...: L_Draft").content,"lxml") 
    ...: table = soup.select_one("table.wikitable.sortable") 
    ...: for name_tag in table.select("tr + tr td span.fn"): 
    ...:  print(name_tag.text, name_tag.find_next("td").a.text) 
    ...:  

Andrew Luck QB 
Robert Griffin III QB 
Trent Richardson RB 
Matt Kalil OT 
Justin Blackmon WR 
Morris Claiborne CB 
Mark Barron S 
Ryan Tannehill QB 
Luke Kuechly LB 
Stephon Gilmore CB 
Dontari Poe NT 
Fletcher Cox DT 
Michael Floyd WR 
Michael Brockers DT 
Bruce Irvin DE 
Quinton Coples DE 
Dre Kirkpatrick CB 
Melvin Ingram LB 
Shea McClellin DE 
Kendall Wright WR 
Chandler Jones DE 
Brandon Weeden QB 
Riley Reiff OT 
David DeCastro G 
Dont'a Hightower LB 
Whitney Mercilus DE 
Kevin Zeitler G 
Nick Perry LB 
Harrison Smith S 
A. J. Jenkins WR 
Doug Martin RB 
David Wilson RB 
Brian Quick WR 
Coby Fleener TE 
Courtney Upshaw LB 
Derek Wolfe DT 
Mitchell Schwartz OT 
Andre Branch DE 
Janoris Jenkins CB 
Amini Silatolu G 
Cordy Glenn OT 
Jonathan Martin OT 
Stephen Hill WR 
Jeff Allen G 
Alshon Jeffery WR 
Mychal Kendricks LB 
Bobby Wagner LB 
Tavon Wilson S 
Kendall Reyes DT 
Isaiah Pead RB 
Jerel Worthy DT 
Zach Brown LB 
Devon Still DT 
Ryan Broyles WR 
Peter Konz C 
Mike Adams OT 
Brock Osweiler QB 
Lavonte David LB 
Vinny Curry DE 
Kelechi Osemele G 
LaMichael James RB 
Casey Hayward CB 
Rueben Randle WR 
Dwayne Allen TE 
Trumaine Johnson CB 
Josh Robinson CB 
Ronnie Hillman RB 
DeVier Posey WR 
T. J. Graham WR 
Bryan Anger P 
Josh LeRibeus G 
Olivier Vernon DE 
Brandon Taylor S 
Donald Stephenson OT 
Russell Wilson QB 
Brandon Brooks G 
Demario Davis LB 
Michael Egnew TE 
Brandon Hardin S 
Jamell Fleming CB 
Tyrone Crawford DE 
Mike Martin DT 
Mohamed Sanu WR 
Bernard Pierce RB 
Dwight Bentley CB 
Sean Spence LB 
John Hughes DT 
Nick Foles QB 
Akiem Hicks DT 
Jake Bequette DE 
Lamar Holmes OT 
T. Y. Hilton WR 
Brandon Thompson DT 
Jayron Hosley CB 
Tony Bergstrom G 
Chris Givens WR 
Lamar Miller RB 
Gino Gradkowski G 
Ben Jones C 
Travis Benjamin WR 
Omar Bolden CB 
Kirk Cousins QB 
Frank Alexander DE 
Joe Adams WR 
Nigel Bradham LB 
Robert Turbin RB 
Devon Wylie WR 
Philip Blake C 
Alameda Ta'amu DT 
Ladarius Green TE 
Evan Rodriguez TE 
Bobby Massie OT 
Kyle Wilber LB 
Jaye Howard DT 
Coty Sensabaugh CB 
Orson Charles TE 
Joe Looney G 
Jarius Wright WR 
Keenan Robinson LB 
James-Michael Johnson LB 
Keshawn Martin WR 
Nick Toon WR 
Brandon Boykin CB 
Ron Brooks CB 
Ronnell Lewis LB 
Jared Crick DE 
Adrien Robinson TE 
Rhett Ellison FB 
Miles Burris LB 
Christian Thompson S 
Brandon Mosley OT 
Mike Daniels DT 
Jerron McMillian S 
Greg Childs WR 
Matt Johnson S 
Josh Chapman DT 
Malik Jackson DE 
Tahir Whitehead LB 
Robert Blanton S 
Najee Goode LB 
Adam Gettis G 
Brandon Marshall LB 
Josh Norman CB 
Zebrie Sanders OT 
Taylor Thompson DE 
DeQuan Menzie CB 
Tank Carder LB 
Chris Greenwood CB 
Johnnie Troutman G 
Rokevious Watkins G 
Senio Kelemete G 
Danny Coale WR 
Dennis Kelly OT 
Korey Toomer LB 
Josh Kaddu LB 
Shaun Prater CB 
Bradie Ewing FB 
Jack Crawford DE 
Chris Rainey RB 
Ryan Miller G 
Randy Bullock K 
Corey White S 
Terrell Manning LB 
Jonathan Massaquoi DE 
Darius Fleming LB 
Marvin Jones WR 
George Iloka S 
Juron Criner WR 
Asa Jackson CB 
Vick Ballard RB 
Greg Zuerlein K 
Jeremy Lane CB 
Alfred Morris RB 
Keith Tandy CB 
Blair Walsh K 
Mike Harris CB 
Justin Bethel S 
Mark Asper G 
Andrew Tiller G 
Trenton Robinson S 
Winston Guy S 
Cyrus Gray RB 
B.J. Cunningham WR 
Isaiah Frey CB 
Ryan Lindley QB 
James Hanna TE 
Josh Bush S 
Danny Trevathan LB 
Christo Bilukidi DT 
Markelle Martin S 
Dan Herron RB 
Charles Mitchell S 
Tom Compton OT 
Marvin McNutt WR 
Nick Mondek OT 
Jonte Green CB 
Nate Ebner CB 
Tommy Streeter WR 
Jason Slowey OT 
Brandon Washington G 
Matt McCants OT 
Terrance Ganaway RB 
Robert Griffin G 
Emmanuel Acho LB 
Billy Winn DT 
LaVon Brazill WR 
Brad Nortman P 
Justin Anderson G 
Audie Cole LB 
Scott Solomon DE 
Michael Smith RB 
Richard Crawford CB 
Kheeston Randall DT 
D. J. Campbell S 
Jordan Bernstine CB 
Jerome Long DT 
Trevor Guyton DE 
Greg McCoy CB 
Nate Potter OT 
Caleb McSurdy ILB 
Travis Lewis OLB 
Alfonzo Dennard CB 
J. R. Sweezy G 
David Molk C 
Rishard Matthews WR 
Jeris Pendleton DT 
Bryce Brown RB 
Nathan Stupar OLB 
Toney Clemons WR 
Greg Scruggs DE 
Drake Dunsmore TE 
Marcel Jones OT 
Jeremy Ebert WR 
DeAngelo Tyson DT 
Cam Johnson DE 
Junior Hemingway WR 
Markus Kuhn DT 
David Paulson TE 
Andrew Datko OT 
Antonio Allen S 
B. J. Coleman QB 
Jordan White WR 
Trevin Wade CB 
Terrence Frederick CB 
Brad Smelley TE 
Kelvin Beachum G 
Travian Robertson DT 
Edwin Baker RB 
John Potter K 
Daryl Richardson RB 
Chandler Harnish QB 
+0

當我嘗試運行此代碼,我得到一個語法錯誤: print name_tag.text,name_tag.find_next(「td」)。a.text ^ SyntaxError:invalid syntax'我正在運行Python 3.5.2 – Michael

+0

@Michael,根據運行的示例代碼使用parens –

+0

這樣做。如果我要從數據中提取更多列,是否可以通過查找下一個或找到前一個「td」來完成?如果不是,那麼做到這一點的最好方法是什麼? – Michael