使用xpath node（）更深入

我試圖在Wikipedia頁面的段落中找到所有超鏈接的周圍文本，並且我正在做這件事的方式涉及使用xpath tree.xpath("//p/node()")。事情在大多數鏈接上都能正常工作，並且我能夠找到大多數是<Element a at $mem_location$>的東西。但是，如果超鏈接是斜體的（請參見下面的示例），xpath node()僅將其視爲<Element i at $mem_location>，並且看起來沒有更深。使用xpath node（）更深入

這會導致我的代碼錯過超鏈接，並弄亂頁面其餘部分的索引。

例：

<p>The closely related term, <a href="/wiki/title="Mange">mange</a>, 
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a> 
(pets) and also livestock and wild mammals, whenever hair-loss is involved. 

<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i> 
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i> 
species are involved in mange, both of these genera are also involved in human skin diseases (by 
convention only, not called mange). <i>Sarcoptes</i> in humans is especially 
severe symptomatically, and causes the condition known as 
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>

的node()爭奪「疥癬」，「家畜」和「疥瘡」正確，但相當多的跳躍「疥」和「蠕形蟎」和螺釘了索引，因爲我篩選出的節點是<Element a at $mem_location$>而不是<Element i at $mem_location$>。

有沒有辦法更深入地看node()？我在文檔中找不到任何內容。

編輯：我的xpath現在是"//p/node()"，但它只抓取最外層的元素層。大多數情況下它是<a>，這很棒，但如果它包裹在<i>圖層中，它只能抓住它。我詢問是否有辦法進一步檢查，以便我能夠在<i>包裝中找到<a>。

相關的代碼如下：樹= etree.HTML（讀）

titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs 
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()')) 
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs 
t = list(tree.xpath("//p/node()")) 

b_count = 0 
a_count = 0 
test = [] 
for items in t: 
print items 
items = str(items) 
if "<Element b" in str(items): 
    test.append(b[b_count]) 
    b_count += 1 
    continue 
if "<Element a" in str(items): 
    test.append((hyperlinks[a_count],titles[a_count])) 
    a_count +=1 
    continue 

if "<Element " not in items: 
    pattern = re.compile('(\t(.*?)\n)') 
    look = pattern.search(str(items)) 

    if look != None: #if there is a match 
    test.append(look.group().partition("\t")[2].partition("\n")[0]) 

    period_pattern = re.compile("(\t(.*?)\.)") 
    look_period = period_pattern.search(str(items)) 
    if look_period != None: 
    test.append(look_period.group().partition("\t")[2])

來源

2015-06-18 MIT_noob

到目前爲止您使用的代碼是什麼？ –

在't'變量中，你想要所有的b和一個項目？ 't'變量究竟是什麼？ –

't'包含由xpath解析的所有元素，因此它是段落中所有內容的列表。下面是'print t [：15]' '[<0x7f59228cf248的元素b>，'是'，<元素a在0x7f5922947368>，'有'，<元素a在0x7f59229473b0>，'。' <元素sup在0x7f59228cf2d8>，'\ n'，'術語有以下幾個複雜性：\ n'，'Acariasis是一個'的術語，<元素a在0x7f5922947440>處，'，由蟎引起，有時帶有乳頭（'，<元素a在0x7f59228cf3b0>'），並且通常伴有嚴重'，<元素a在0x7f5922947488>]' –

我想不出一個直接的XPath，可以做的伎倆，但你總是可以遍歷內容並過濾掉像這樣的元素 -

for i,x in enumerate(t): 
    if x.tag == i: 
     aNodes = x.find('a') 
     if aNodes is not None and len(aNodes) > 0: 
      del t[i] 
      for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements. 
       t.insert(i+j,y)

這將處理多個a單i內，以及像<i><a>something</a><a>blah</a></i>

來源

2015-06-18 15:08:20

使用xpath node（）更深入

回答

相關問題