2016-06-10 61 views
-1

使用下面的html我想拉出2位數據並將它們添加到python列表中。每個大膽的文字他的馬名稱和以下是評論。Python:拉動粗體文本和下面的文本

<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open. 
 
    <br> 
 
    <br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. 
 
    She saw it out well and it´ll be interesting to see how she copes with a rise. 
 
    <br> 
 
    <br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW. 
 
    <br> 
 
    <br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form. 
 
    <br> 
 
    <br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton] 
 
    <br> 
 
    <br> 
 
    <div id="resultRaceReport" class="hide"></div> 
 
</div>

從上面的輸出我想它看起來像下面

[LADY MAKFI,表現出極大改善的形式闡明 這種季節性的首張處女標籤爲一個新的院子。馬丁去年爲菲尼克斯提供了很少的東西,但是在她首次亮相時確實展示了一些能力,並且在新鮮時顯然有能力。她看到它很好,稍後即會 有趣的,看看她是如何與一個崛起對應。]

[Weardiditallgorong,墜毀在這個較長的行程以及戰鬥 可能會再次對她的最後一次出第二個爲改善浴。這是又 她盡力而爲的AW]

[Chauvelin,在第二次眼罩,竟然在他最令人鼓舞 努力一段時間,肯定是好自己的最佳狀態的治療。]

[開心的傑克,不是第一次輕鬆地旅行,直到 被問及他的努力時惡劣的天氣。 [大衛·奧頓]

,但我只是不知道如何獲得所需的輸出(更其背後的邏輯)

我目前使用LXML刮含量,而且需要搭配大膽(馬名)對我的表,所以我可以大膽的後(文本)添加評論使用LXML我的數據庫

+3

(http://stackoverflow.com/questions/11709079/parsing-html-using-python) –

+0

@emma珀金斯[使用Python解析HTML]的可能的複製,我相信你按照你以前的問題使用lxml? –

+0

道歉是的,我是(我會加入到問題中) - 這是更多的邏輯,而不是如何 –

回答

2

h = """<div id="ANALYSIS" class="tabContent tabSelected">A weak handicap that looked wide open.<br><br> <b class="black">LADY MAKFI</b> showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it´ll be interesting to see how she copes with a rise.<br><br> <b class="black">Weardiditallgorong</b> went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.<br><br> <b class="black">Chauvelin</b>, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.<br><br> <b class="black">Happy Jack</b> not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]<br><br> <div id="resultRaceReport" class="hide"></div></div>""" 

from lxml import html 

x = html.fromstring(h) 

div = x.xpath("//*[@id='ANALYSIS']")[0] 

# find bold tags by class name 
for b in div.xpath(".//b[@class='black']"): 
    # get bold text 
    print(b.text) 
    # get text between current bold up to next br tag. 
    print(b.xpath("./following::text()[1]")) 

會給你:

LADY MAKFI 
[u' showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.'] 
Weardiditallgorong 
[' went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.'] 
Chauvelin 
[', in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.'] 
Happy Jack 
[' not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]'] 

如果你想這一切在一個單一的名單完全一樣貼:

from lxml import html 

x = html.fromstring(h) 
div = x.xpath("//*[@id='ANALYSIS']")[0] 
out = [b.text + "," + b.xpath("./following::text()[1]")[0].lstrip(",") for b in div.xpath(".//b[@class='black']")] 

它給你:

[u'LADY MAKFI, showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh. She saw it out well and it\xc2\xb4ll be interesting to see how she copes with a rise.', 
'Weardiditallgorong, went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.', 
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.', 
'Happy Jack, not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]'] 
+0

完美再次感謝 –

+0

沒有問題,我們實際上可以簡化xpath以在每個粗體標記之後獲取第一個以下文本。你在做數據分析嗎? –

+0

是的 - 所以即時收集過去的賽馬比賽的結果......然後對他們進行分析,以便投注:)所以每一次對馬匹的評論都需要在我的數據庫中輸入並匹配那匹馬 –

1

我喜歡Beautiful Soup的API相對於直接使用lxml的。我可以完全避免xpath,只需寫入python。

import bs4 
soup = bs4.BeautifulSoup(document, 'lxml') 
[b.text + b.next_sibling.rstrip() for b in soup.find_all('b')] 

輸出:

['LADY MAKFI showed vastly improved form to shed her maiden tag on this seasonal debut for a new yard. The filly offered little for Tony Martin last year, but did show some ability on her debut and is evidently capable when fresh.\n She saw it out well and it´ll be interesting to see how she copes with a rise.', 
'Weardiditallgorong went down fighting over this longer trip and probably improved again on her last-time-out second at Bath. This was her best effort yet on the AW.', 
'Chauvelin, in second-time blinkers, turned in his most encouraging effort for some time and is certainly well treated on his best form.', 
'Happy Jack not for the first time travelled easily until making heavy weather of it when asked for his effort. [David Orton]']