2015-10-29 44 views
1
def parse_header(table): 
    ths = table.xpath('//tr/th') 
    if not ths: 
     ths = table.xpath('//tr[1]/td') # here is the problem, this will find tr[1]/td in all html file insted of this table 

    # bala bala something elese 

doc = html.fromstring(html_string) 
table = doc.xpath("//div[@id='divGridData']/div[2]/table")[0] 
parse_header(table) 

我想在我的表中找到所有tr[1]/td,但table.xpath("//tr[1]/td")仍然在html文件中找到所有。我如何才能找到這個元素而不是所有的html文件?用xpath查找表格元素中的所有tr?


編輯:

content = ''' 

<root> 
    <table id="table-one"> 
     <tr> 
      <td>content from table 1</td> 
     <tr> 
     <table> 
      <tr> 
       <!-- this is content I do not want to get --> 
       <td>content from embeded table</td> 
      <tr> 
     </table> 
    </table> 
</root>''' 

root = etree.fromstring(content) 
table_one = root.xpath('table[@id="table-one"]') 
all_td_elements = table_one.xpath('//td') # so this give me too much!!! 

現在我不想內嵌表的內容,我該怎麼辦呢?

回答

1

要查找作爲上下文節點的子元素的元素,請在期間將.運算符添加到XPath中。所以,我認爲你正在尋找的XPath是:

.//tr[1]/td 

這將選擇td元素,它們是當前表的子元素,而不是在整個HTML文件。

舉個例子:

from lxml import etree 

content = ''' 

<root> 
    <table id="table-one"> 
     <tr> 
      <td>content from table 1</td> 
     <tr> 
    </table> 
    <table id="table-two"> 
     <tr> 
      <td>content from table 2</td> 
     <tr> 
    </table> 
</root>''' 

root = etree.fromstring(content) 
table_one = root.xpath('table[@id="table-one"]') 

# this will select all td elements in the entire XML document (so two elements) 
all_td_elements = table_one.xpath('//td') 

# this will just select the single sub-element because of the period 
just_sub_td_elements = table_one.xpath('.//td') 
+0

我還有一個問題,我沒有更新我的問題,我怎麼能夠無視嵌入式表? – roger

+0

我不明白更新? – gtlambert

+0

我不想用'table_one.xpath('// td')' – roger

相關問題