2015-06-18 35 views
0

我試圖在Wikipedia頁面的段落中找到所有超鏈接的周圍文本,並且我正在做這件事的方式涉及使用xpath tree.xpath("//p/node()")。事情在大多數鏈接上都能正常工作,並且我能夠找到大多數是<Element a at $mem_location$>的東西。但是,如果超鏈接是斜體的(請參見下面的示例),xpath node()僅將其視爲<Element i at $mem_location>,並且看起來沒有更深。使用xpath node()更深入

這會導致我的代碼錯過超鏈接,並弄亂頁面其餘部分的索引。

例:

<p>The closely related term, <a href="/wiki/title="Mange">mange</a>, 
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a> 
(pets) and also livestock and wild mammals, whenever hair-loss is involved. 

<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i> 
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i> 
species are involved in mange, both of these genera are also involved in human skin diseases (by 
convention only, not called mange). <i>Sarcoptes</i> in humans is especially 
severe symptomatically, and causes the condition known as 
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p> 

node()爭奪「疥癬」,「家畜」和「疥瘡」正確,但相當多的跳躍「疥」和「蠕形蟎」和螺釘了索引,因爲我篩選出的節點是<Element a at $mem_location$>而不是<Element i at $mem_location$>

有沒有辦法更深入地看node()?我在文檔中找不到任何內容。

編輯:我的xpath現在是"//p/node()",但它只抓取最外層的元素層。大多數情況下它是<a>,這很棒,但如果它包裹在<i>圖層中,它只能抓住它。我詢問是否有辦法進一步檢查,以便我能夠在<i>包裝中找到<a>

相關的代碼如下: 樹= etree.HTML(讀)

titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs 
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()')) 
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs 
t = list(tree.xpath("//p/node()")) 

b_count = 0 
a_count = 0 
test = [] 
for items in t: 
print items 
items = str(items) 
if "<Element b" in str(items): 
    test.append(b[b_count]) 
    b_count += 1 
    continue 
if "<Element a" in str(items): 
    test.append((hyperlinks[a_count],titles[a_count])) 
    a_count +=1 
    continue 

if "<Element " not in items: 
    pattern = re.compile('(\t(.*?)\n)') 
    look = pattern.search(str(items)) 

    if look != None: #if there is a match 
    test.append(look.group().partition("\t")[2].partition("\n")[0]) 

    period_pattern = re.compile("(\t(.*?)\.)") 
    look_period = period_pattern.search(str(items)) 
    if look_period != None: 
    test.append(look_period.group().partition("\t")[2]) 
+0

到目前爲止您使用的代碼是什麼? –

+0

在't'變量中,你想要所有的b和一個項目? 't'變量究竟是什麼? –

+0

't'包含由xpath解析的所有元素,因此它是段落中所有內容的列表。下面是'print t [:15]' '[<0x7f59228cf248的元素b>,'是',<元素a在0x7f5922947368>,'有',<元素a在0x7f59229473b0>,'。' <元素sup在0x7f59228cf2d8>,'\ n','術語有以下幾個複雜性:\ n','Acariasis是一個'的術語,<元素a在0x7f5922947440>處,',由蟎引起,有時帶有乳頭(',<元素a在0x7f59228cf3b0>'),並且通常伴有嚴重',<元素a在0x7f5922947488>]' –

回答

1

我想不出一個直接的XPath,可以做的伎倆,但你總是可以遍歷內容並過濾掉像這樣的元素 -

for i,x in enumerate(t): 
    if x.tag == i: 
     aNodes = x.find('a') 
     if aNodes is not None and len(aNodes) > 0: 
      del t[i] 
      for j, y in enumerate(x.findall('/nodes()')): #doing x.findall to take in text elements as well as a elements. 
       t.insert(i+j,y) 

這將處理多個ai內,以及像<i><a>something</a><a>blah</a></i>