2017-07-07 49 views
0

我有一個HTML頁面。我會在最後引用它。 我準備頁面。xpath奇怪的行爲。 Python的HTML解析

page=page.content 
    res = html.fromstring(page) 

我做一個XPath的要求吧:

LIST_OF_NAMES = res.xpath(U '// LI/A /文()')

,但它並沒有列出名稱下的。

當我做: LIST_OF_NAMES = res.xpath(U '// DIV [@ ID =' rosterlists ']/DIV/LI/A /文()')

在瀏覽器中我得到了什麼我想要 - 名單列表。 enter image description here 但是在Python中,我得到了 ['A','B','C','D','E','f']。

出了什麼問題?它是否破壞HTML?如果是 - 如何解決?

的XPath在同一臺機器(其中成千上萬)

請在所有其他網頁的偉大工程 - 不建議ME BEAUTIFUL SOAP - 此模塊unapropriate這個項目。在任何情況下。

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 
<head profile="http://www.w3.org/2005/10/profile"> 
<link rel="icon" type="image/ico" href="/favicon.ico" /> 
<title>Roster | Primary Talent International</title> 
<meta name="description" content="Roster - Primary Talent International" /> 
<meta name="keywords" content="" /> 
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> 
<meta http-equiv="content-language" content="en" /> 
<meta name="language" content="en" /> 
<meta name="robots" content="index, follow" /> 
<meta name="author" content="Inogen Web Design Nottingham" /> 

<link href="/css/style.css" rel="stylesheet" type="text/css" media="screen"/> 
<link href="/css/scrollerstyle.css" rel="stylesheet" type="text/css" media="screen"/> 

<link href="/css/roster.css" rel="stylesheet" type="text/css" media="screen"/> 

<script type="text/javascript" src="/scripts/ddwindowlinks.js"></script> 
<script type="text/javascript" src="/scripts/scroller-settings.js"></script> 


<script type="text/javascript"> 

//<![CDATA[ 

var sglm=new Array(); 

sglm[0]='<a href="/news/jul2017#wolf-alice-visions-of-a-life-album">Wolf Alice: &#039;Visions Of A Life&#039; Album</a>'; 
sglm[1]='<a href="/news/jul2017#zombie-nation-new-video-for-knockout">Zombie Nation: New Video For &#039;Knockout&#039;</a>'; 
sglm[2]='<a href="/news/jul2017#sextile-albeit-living-review">Sextile: &#039;Albeit Living&#039; Review</a>'; 
sglm[3]='<a href="/news/jul2017#noisia-at-glastonbury-festival-2017">Noisia: At Glastonbury Festival 2017</a>'; 
sglm[4]='<a href="/news/jul2017#joe-ford-new-track-make-a-threat-ft-maluk">Joe Ford: New Track &#039;Make A Threat&#039; Ft. Maluk</a>'; 
sglm[5]='<a href="/news/jul2017#moscoman-obscure-cuts-on-xlr8r">Moscoman: &#039;Obscure Cuts&#039; On XLR8R</a>'; 
sglm[6]='<a href="/news/jul2017#james-welsh-new-thread-north-ep">James Welsh: New &#039;Thread/North&#039; EP</a>'; 
sglm[7]='<a href="/news/jul2017#steve-lamacq-going-deaf-for-a-living-tour">Steve Lamacq: &#039;Going Deaf For A Living&#039; Tour</a>'; 
sglm[8]='<a href="/news/jul2017#moscoman-remixes-cristobal-and-the-sea">Moscoman: Remixes Cristobal &amp; The Sea</a>'; 
sglm[9]='<a href="/news/jul2017#nadine-khouri-at-the-lexington-london">Nadine Khouri: At The Lexington, London</a>'; 
sglm[10]='<a href="/news/jul2017#faze-miyake-new-infamous-ep">Faze Miyake: New &#039;Infamous&#039; EP</a>'; 
sglm[11]='<a href="/news/jul2017#noisia-beyond-the-outer-edges-featured-in-skiddle">Noisia: &#039;Beyond The Outer Edges&#039; Featured In Skiddle</a>'; 

//]]> 

</script> 


<script type="text/javascript"> 

    var _gaq = _gaq || []; 
    _gaq.push(['_setAccount', 'UA-17266356-2']); 
    _gaq.push(['_trackPageview']); 

    (function() { 
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; 
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; 
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); 
    })(); 

</script> 



</head> 


<body onload="startbcscroll();"> 
<div id="wrapper"> 

<div id="masthead"> 

<a href="/"><img src="/images/primary-talent-international.png" alt="Primary Talent International" width="330" height="40" id="logo" /></a> 



<form action="/search/" method="get"> 
<fieldset id="search"> 
<input type="text" name="find" /> 
<input type="submit" value="search" class="submit" /> 
</fieldset> 
</form> 


<div id="masthead-images"> 
<a href="/counterfeit/"><img src="/artists/counterfeit/images/127x127/1.jpg" width="127" height="127" alt="Counterfeit" title="Counterfeit" /></a> 
<a href="/fredo/"><img src="/artists/fredo/images/127x127/1.jpg" width="127" height="127" alt="Fredo" title="Fredo" /></a> 
<a href="/shabazz-palaces/"><img src="/artists/shabazz-palaces/images/127x127/1.jpg" width="127" height="127" alt="Shabazz Palaces" title="Shabazz Palaces" /></a> 
<a href="/tom-demac/"><img src="/artists/tom-demac/images/127x127/1.jpg" width="127" height="127" alt="Tom Demac" title="Tom Demac" /></a> 
<a href="/kwamz-and-flava/"><img src="/artists/kwamz-and-flava/images/127x127/1.jpg" width="127" height="127" alt="Kwamz & Flava" title="Kwamz & Flava" /></a> 
<a href="/jelani-blackman/"><img src="/artists/jelani-blackman/images/127x127/1.jpg" width="127" height="127" alt="Jelani Blackman" title="Jelani Blackman" /></a> 
<a href="/blinkie/"><img src="/artists/blinkie/images/127x127/1.jpg" width="127" height="127" alt="Blinkie" title="Blinkie" /></a> 
</div> 

<div id="ticker"> 
<script type="text/javascript" src="/scripts/scroller.js"></script> 
</div> 

<div id="menu"> 
<div id="topmenu"> 

<ul> 

<li class="open"><a href="/roster/">LIVE ROSTER</a></li> 
<li><a href="/dj-roster/">DJ ROSTER</a><li> 

<li><a href="/news/">NEWS</a><li> 
<li><a href="/on-tour/">ON TOUR</a></li> 
<li><a href="/about-us/">ABOUT US</a></li> 
<li><a href="/new-signings/">NEW SIGNINGS</a></li> 

</ul> 

</div></div> 

</div> 





<div id="content"> 

<div id="rosterlists"> 

<div> 

<ul> 
</ul> 
</div> 
<div> 
</ul> 
<li><a href="/sandy-alex-g/" />(Sandy) Alex G</a></li> 
<li><a href="/andyouwillknowusbythetrailofdead/" />...And You Will Know Us By The Trail Of Dead</a></li> 
<li><a href="/2shy/" />2Shy</a></li> 
<li><a href="/808ink/" />808INK</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/a/">A</a></li> 
<li><a href="/a-tribe-called-red/" />A Tribe Called Red</a></li> 
<li><a href="/abattoir-blues/" />Abattoir Blues</a></li> 
<li><a href="/aeroplane/" />Aeroplane</a></li> 
<li><a href="/agar-agar/" />Agar Agar</a></li> 
<li><a href="/airways/" />Airways</a></li> 
<li><a href="/alaskalaska/" />ALASKALASKA</a></li> 
<li><a href="/alex-izenberg/" />Alex Izenberg</a></li> 
<li><a href="/all-get-out/" />All Get Out</a></li> 
<li><a href="/all-the-people/" />All The People</a></li> 
<li><a href="/allison-weiss/" />Allison Weiss</a></li> 
<li><a href="/alpines/" />Alpines</a></li> 
<li><a href="/alt-j/" />Alt-J</a></li> 
<li><a href="/alvvays/" />Alvvays</a></li> 
<li><a href="/ama-lou/" />Ama Lou</a></li> 
<li><a href="/amaroun/" />Amaroun</a></li> 
<li><a href="/andrea/" />AndreaLo</a></li> 
<li><a href="/andy-cooper/" />Andy Cooper (Ugly Duckling)</a></li> 
<li><a href="/anna-calvi/" />Anna Calvi</a></li> 
<li><a href="/anteros/" />Anteros</a></li> 
<li><a href="/apes-and-horses/" />Apes & Horses</a></li> 
<li><a href="/ara/" />ArA Harmonic</a></li> 
<li><a href="/araabmuzik/" />Araabmuzik</a></li> 
<li><a href="/archive/" />Archive</a></li> 
<li><a href="/aristophanes/" />Aristophanes</a></li> 
<li><a href="/ash-koosha/" />Ash Koosha</a></li> 
<li><a href="/atlas-genius/" />Atlas Genius</a></li> 
<li><a href="/augustines/" />Augustines</a></li> 
</ul> 
</div> 
<div> 
</ul> 
</ul> 
</div> 
<div> 
</ul> 
<li><a href="/avelino/" />Avelino</a></li> 
<li><a href="/awate/" />Awate</a></li> 
<li><a href="/azad/" />Azad</a></li> 
<li><a href="/azusena/" />Azusena</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/b/">B</a></li> 
<li><a href="/baba-naga/" />Baba Naga</a></li> 
<li><a href="/babeheaven/" />Babeheaven</a></li> 
<li><a href="/babyshambles/" />Babyshambles</a></li> 
<li><a href="/bad-gyal/" />Bad Gyal</a></li> 
<li><a href="/bad-kid/" />Bad Kid</a></li> 
<li><a href="/bad-nerves/" />Bad Nerves</a></li> 
<li><a href="/bad-pop/" />Bad Pop</a></li> 
<li><a href="/bad-sounds/" />Bad Sounds</a></li> 
<li><a href="/banners/" />Banners</a></li> 
<li><a href="/basement-jaxx/" />Basement Jaxx</a></li> 
<li><a href="/bash-and-pop/" />Bash & Pop</a></li> 
<li><a href="/bay/" />BAY</a></li> 
<li><a href="/bayside/" />Bayside</a></li> 
<li><a href="/be-charlotte/" />Be Charlotte</a></li> 
<li><a href="/beach-baby/" />Beach Baby</a></li> 
<li><a href="/beach-slang/" />Beach Slang</a></li> 
<li><a href="/beardyman/" />Beardyman</a></li> 
<li><a href="/bellevue-days/" />Bellevue Days</a></li> 
<li><a href="/ben-hobbs/" />Ben Hobbs</a></li> 
<li><a href="/ben-khan/" />Ben Khan</a></li> 
<li><a href="/ben-watt/" />Ben Watt</a></li> 
<li><a href="/benny-mails/" />Benny Mails</a></li> 
<li><a href="/bettens/" />Bettens</a></li> 
<li><a href="/big-ups/" />Big Ups</a></li> 
<li><a href="/bipolar-sunshine/" />Bipolar Sunshine</a></li> 
<li><a href="/birds-of-tokyo/" />Birds Of Tokyo</a></li> 
<li><a href="/blaenavon/" />Blaenavon</a></li> 
</ul> 
</div> 
<div> 
</ul> 
</ul> 
</div> 
<div> 
</ul> 
<li><a href="/bloodhound-gang/" />Bloodhound Gang</a></li> 
<li><a href="/bloxx/" />BLOXX</a></li> 
<li><a href="/blue-daisy/" />Blue Daisy</a></li> 
<li><a href="/bowling-for-soup/" />Bowling For Soup</a></li> 
<li><a href="/boys-noize/" />Boys Noize</a></li> 
<li><a href="/broadway-sounds/" />Broadway Sounds</a></li> 
<li><a href="/brooke-candy/" />Brooke Candy</a></li> 
<li><a href="/bryde/" />Bryde</a></li> 
<li><a href="/buraka-som-sistema/" />Buraka Som Sistema</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/c/">C</a></li> 
<li><a href="/cadet/" />Cadet</a></li> 
<li><a href="/cant-swim/" />Can't Swim</a></li> 
<li><a href="/candy-hearts/" />Candy Hearts</a></li> 
<li><a href="/cardiknox/" />Cardiknox</a></li> 
<li><a href="/carmody/" />Carmody</a></li> 
<li><a href="/catfish-and-the-bottlemen/" />Catfish and the Bottlemen</a></li> 
<li><a href="/cattle-and-cane/" />Cattle &amp; Cane</a></li> 
<li><a href="/central-cee/" />Central Cee</a></li> 
<li><a href="/cerrone/" />Cerrone</a></li> 
<li><a href="/chairlift/" />Chairlift</a></li> 
<li><a href="/champs/" />Champs</a></li> 
<li><a href="/charlotte-oc/" />Charlotte OC</a></li> 
<li><a href="/charly-bliss/" />Charly Bliss</a></li> 
<li><a href="/children-collide/" />Children Collide</a></li> 
<li><a href="/cigarettes-after-sex/" />Cigarettes After Sex</a></li> 
<li><a href="/circawaves/" />Circa Waves</a></li> 
<li><a href="/clairy-browne/" />Clairy Browne</a></li> 
<li><a href="/clean-spill/" />Clean Spill</a></li> 
<li><a href="/coco/" />Coco</a></li> 
<li><a href="/cold-specks/" />Cold Specks</a></li> 
<li><a href="/cole/" />Cole</a></li> 
<li><a href="/connan-mockasin/" />Connan Mockasin</a></li> 
</ul> 
</div> 
<div> 
</ul> 
</ul> 
</div> 
<div> 
</ul> 
<li><a href="/cosmo-pyke/" />Cosmo Pyke</a></li> 
<li><a href="/count-counsellor/" />Count Counsellor</a></li> 
<li><a href="/counterfeit/" />Counterfeit</a></li> 
<li><a href="/crossfaith/" />Crossfaith</a></li> 
<li><a href="/crows/" />Crows</a></li> 
<li><a href="/cuckoolander/" />CuckooLander</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/d/">D</a></li> 
<li><a href="/d-double-e/" />D Double E</a></li> 
<li><a href="/damfunk/" />D&#257;M-FunK</a></li> 
<li><a href="/daft-punk/" />Daft Punk</a></li> 
<li><a href="/daisy-victoria/" />Daisy Victoria</a></li> 
<li><a href="/daniel-og/" />Daniel OG</a></li> 
<li><a href="/darkstar/" />Darkstar</a></li> 
<li><a href="/darlia/" />Darlia</a></li> 
<li><a href="/dave/" />Dave</a></li> 
<li><a href="/day-wave/" />Day Wave</a></li> 
<li><a href="/decade/" />Decade</a></li> 
<li><a href="/delta-rae/" />Delta Rae</a></li> 
<li><a href="/denzel-himself/" />Denzel Himself</a></li> 
<li><a href="/desert-planes/" />Desert Planes</a></li> 
<li><a href="/digable-planets/" />Digable Planets </a></li> 
<li><a href="/digitalism/" />Digitalism</a></li> 
<li><a href="/DIIV/" />DIIV</a></li> 
<li><a href="/dilly-dally/" />Dilly Dally</a></li> 
<li><a href="/diztortion/" />Diztortion</a></li> 
<li><a href="/dizzee-rascal/" />Dizzee Rascal</a></li> 
<li><a href="/dj-cassidy/" />DJ Cassidy</a></li> 
<li><a href="/django-django/" />Django Django</a></li> 
<li><a href="/dmas/" />DMA's</a></li> 
<li><a href="/dominique-young-unique/" />Dominique Young Unique</a></li> 
<li><a href="/dropkick-murphys/" />Dropkick Murphys</a></li> 
<li><a href="/drowners/" />Drowners</a></li> 
</ul> 
</div> 
<div> 
</ul> 
</ul> 
</div> 
<div> 
</ul> 
<li><a href="/dub-pistols/" />Dub Pistols</a></li> 
<li><a href="/dutch-mob/" />Dutch Mob</a></li> 
<li><a href="/zappa-plays-zappa/" />Dweezil Zappa</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/e/">E</a></li> 
<li><a href="/eat-fast/" />Eat Fast</a></li> 
<li><a href="/eera/" />EERA</a></li> 
<li><a href="/emily-capell/" />Emily Capell</a></li> 
<li><a href="/emmy-the-great/" />Emmy The Great</a></li> 
<li><a href="/eprom/" />Eprom</a></li> 
<li><a href="/esther-joy/" />Esther Joy</a></li> 
<li><a href="/etienne-de-crecy-presents-super-discount/" />Etienne de Cr&#233;cy Presents Super Discount 3</a></li> 
<li>&nbsp;</li> 
<li class="initial"><a href="/roster/f/">f</a></li> 
<li><a href="/franskild-live/" />f r a n s k i l d (Live)</a></li> 
<li><a href="/fang-night/" />Fang Night</a></li> 
<li><a href="/fangclub/" />Fangclub</a></li> 
<li><a href="/felix-riebl/" />Felix Riebl</a></li> 
<li><a href="/fine-print/" />Fine Print</a></li> 
<li><a href="/first-hate/" />First Hate</a></li> 
<li><a href="/forever-came-calling/" />Forever Came Calling</a></li> 
<li><a href="/fours/" />FOURS</a></li> 
<li><a href="/foxygen/" />Foxygen</a></li> 
<li><a href="/freak/" />FREAK</a></li> 
<li><a href="/fredo/" />Fredo</a></li> 
<li><a href="/fun-lovin-criminals/" />Fun Lovin' Criminals</a></li> 

</ul> 

</div> 

<div class="clear">&nbsp;</div> 

</div> 





<div id="alphabetmenu"> 
<ul> 


<li class="active"><a href="/roster/">#<a></li> 
<li class="active"><a href="/roster/a/">A</a></li> 
<li class="active"><a href="/roster/b/">B</a></li> 
<li class="active"><a href="/roster/c/">C</a></li> 
<li class="active"><a href="/roster/d/">D</a></li> 
<li class="active"><a href="/roster/e/">E</a></li> 
<li class="active"><a href="/roster/f/">F</a></li> 
<li class="active"><a href="/roster/g/">G</a></li> 
<li><a href="/roster/h/">H</a></li> 
<li><a href="/roster/i/">I</a></li> 
<li><a href="/roster/j/">J</a></li> 
<li><a href="/roster/k/">K</a></li> 
<li><a href="/roster/l/">L</a></li> 
<li><a href="/roster/m/">M</a></li> 
<li><a href="/roster/n/">N</a></li> 
<li><a href="/roster/o/">O</a></li> 
<li><a href="/roster/p/">P</a></li> 
<li><a href="/roster/q/">Q</a></li> 
<li><a href="/roster/r/">R</a></li> 
<li><a href="/roster/s/">S</a></li> 
<li><a href="/roster/t/">T</a></li> 
<li><a href="/roster/u/">U</a></li> 
<li><a href="/roster/v/">V</a></li> 
<li><a href="/roster/w/">W</a></li> 
<li><a href="/roster/x/">X</a></li> 
<li><a href="/roster/y/">Y</a></li> 
<li><a href="/roster/z/">Z</a></li> 

</ul> 

</div> 
<div class="clear">&nbsp;</div> 

<div id="agentlist"> 

<h1>Contact</h1> 

<ul> 
<li><a href="http://decked-out.co.uk/alessia-avallone/">Alessia Avallone</a></li> 
<li><a href="/andy-duggan/">Andy Duggan</a></li> 
<li><a href="/andy-woolliscroft/">Andy Woolliscroft</a></li> 
<li><a href="/ben-winchester/">Ben Winchester</a></li> 
<li><a href="/charlie-renton/">Charlie Renton</a></li> 
<li><a href="/chris-smyth/">Chris Smyth</a></li> 
<li><a href="/cils-fyne-williams/">Cils Fyne-Williams</a></li> 
<li><a href="/claire-reilly/">Claire Reilly</a></li> 
<li><a href="/craig-dsouza/">Craig D'Souza</a></li> 
<li><a href="/dave-chumbley/">Dave Chumbley</a></li> 
<li><a href="/ed-sellers/">Ed Sellers</a></li> 
<li><a href="/eileen-mulligan/">Eileen Mulligan</a></li> 
<li><a href="/ellen-trickey/">Ellen Trickey</a></li> 
<li><a href="http://decked-out.co.uk/faye-adams/">Faye Adams</a></li> 
<li><a href="/francesco-caccamo/">Francesco Caccamo</a></li> 
<li><a href="/jack-herron/">Jack Herron</a></li> 
<li><a href="/kata-farkas/">Kata Farkas</a></li> 
<li><a href="http://decked-out.co.uk/laetitia-descouens/">Laetitia Descouens</a></li> 
<li><a href="http://decked-out.co.uk/lucinda-runham/">Lucinda Runham</a></li> 
<li><a href="/martin-hopewell/">Martin Hopewell</a></li> 
<li><a href="/martin-mackay/">Martin Mackay</a></li> 
<li><a href="http://decked-out.co.uk/martje-kremers/">Martje Kremers</a></li> 
<li><a href="/matt-bates/">Matt Bates</a></li> 
<li><a href="/matt-pickering-copley/">Matt Pickering-Copley</a></li> 
<li><a href="/moshope-osinibe/">Moshope Osinibi </a></li> 
<li><a href="/nick-holroyd/">Nick Holroyd</a></li> 
<li><a href="/nick-reddick/">Nick Reddick</a></li> 
<li><a href="/paul-mcqueen/">Paul McQueen</a></li> 
<li><a href="/peter-elliott/">Peter Elliott</a></li> 
<li><a href="/sally-gavaghan/">Sally Gavaghan</a></li> 
<li><a href="/scarlet-millar/">Scarlet Millar</a></li> 
<li><a href="/serena-parsons/">Serena Parsons</a></li> 
<li><a href="/stacey-owen/">Stacey Owen</a></li> 
<li><a href="/steve-backman/">Steve Backman</a></li> 
<li><a href="/tabbie-burleton/">Tabbie Burleton</a></li> 
<li><a href="/tom-permaul-baker/">Tom Permaul-Baker</a></li> 
<li><a href="/tracey-roper/">Tracey Roper</a></li> 
<li><a href="/wesley-doogan/">Wesley Doogan</a></li> 
<li><a href="/will-marshall/">Will Marshall</a></li> 
</ul> 

</div> 



</div> 



<div class="clear">&nbsp;</div> 


<div id="footer"> 
<ul> 
<li>&copy; 2017 Primary Talent International</li><li>|<a href="/tncs-of-use/">Terms &amp; Conditions of Use</a></li><li>| <a href="/privacy/">Privacy Policy</a></li><li>|<a href="/terms-of-business/">Terms of Business</a></li><li>|<a href="https://primarytalent.com">Contact Blog</a></li> 
</ul> 
</div> 


</div> 

</body> 
</html> 
+0

網址是必要 –

+0

@SimakisPanagiotis http://primarytalent.com/roster/ –

回答

1

這段代碼正常工作。 text_content()方法讓你從哪個包含其他元素

from lxml import html 
import requests 
req = requests.get('http://primarytalent.com/roster/') 
tree = html.fromstring(req.content) 
list_of_names = [_.text_content() for _ in tree.xpath("//*[@id='rosterlists']/div/li")] 
+1

謝謝Simakis元素乾淨的文本。做得好!!! (我發現了某種解決方案,但你的結構更加緊湊和美觀。) –