2015-11-06 109 views
1

我試圖抓取看起來像這樣的頁面,每個頁面有3個或更多的span標籤。我們的目標是讓恩類型的字典的名單:lxml xpath - 獲取span標籤內的所有文本

{'ctl02_lblAppearanceInfo1': 'Text', 
'ctl02_lblAppearanceInfo2': 'Text'} 

HTML:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE.............. </span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span> 


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span> 


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span> 

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span> 

我用

tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]') 

成功,因爲它返回一個元素對象通過ID和文本屬性,但如果我遇到這樣的事情:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> 
TEXT LINE 1 
<br>TEXT LINE 2 
<br>TEXT LINE 3 
<br>TEXT LINE 4</span> 

它只會返回回 「文本行1」

回答

2

使用text()

下面是代碼:

from lxml import html 

HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1.............. </span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span> 
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> 
TEXT LINE 10............. 
<br>TEXT LINE 11............. 
<br>TEXT LINE 12............. 
<br>TEXT LINE 13.............</span> 
""" 

tree = html.fromstring(HTML) 
text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]') 

results = dict() 

for i, text_line in enumerate(text_lines): 
    span_id = text_line.xpath('.//@id')[0] 
    span_text = [x.strip() for x in text_line.xpath('.//text()')] 
    results[i] = dict(id=span_id, texts=span_text) 

print results 

輸出:

{ 
    0: { 
     'texts': ['TEXT HERE 1..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1' 
    }, 
    1: { 
     'texts': ['TEXT HERE 2..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2' 
    }, 
    2: { 
     'texts': ['TEXT HERE 3..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace' 
    }, 
    3: { 
     'texts': ['TEXT HERE 4..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1' 
    }, 
    4: { 
     'texts': ['TEXT HERE 5..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2' 
    }, 
    5: { 
     'texts': ['TEXT HERE 6..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace' 
    }, 
    6: { 
     'texts': ['TEXT HERE 7..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1' 
    }, 
    7: { 
     'texts': ['TEXT HERE 8..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2' 
    }, 
    8: { 
     'texts': ['TEXT HERE 9..............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace' 
    }, 
    9: { 
     'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'], 
     'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1' 
    } 
} 
相關問題