我試圖在Wikipedia頁面的段落中找到所有超鏈接的周圍文本,並且我正在做這件事的方式涉及使用xpath tree.xpath("//p/node()")
。事情在大多數鏈接上都能正常工作,並且我能夠找到大多數是<Element a at $mem_location$>
的東西。但是,如果超鏈接是斜體的(請參見下面的示例),xpath node()
僅將其視爲<Element i at $mem_location>
,並且看起來沒有更深。使用xpath node()更深入
這會導致我的代碼錯過超鏈接,並弄亂頁面其餘部分的索引。
例:
<p>The closely related term, <a href="/wiki/title="Mange">mange</a>,
is commonly used with <a href="/wiki/Domestic_animal" title="Domestic animal" class="mw-redirect">domestic animals</a>
(pets) and also livestock and wild mammals, whenever hair-loss is involved.
<i><a href="/wiki/Sarcoptes" title="Sarcoptes">Sarcoptes</a></i>
and <i><a href="/wiki/Demodex" title="Demodex">Demodex</a></i>
species are involved in mange, both of these genera are also involved in human skin diseases (by
convention only, not called mange). <i>Sarcoptes</i> in humans is especially
severe symptomatically, and causes the condition known as
<a href="/wiki/Scabies" title="Scabies">scabies</a>.</p>
的node()
爭奪「疥癬」,「家畜」和「疥瘡」正確,但相當多的跳躍「疥」和「蠕形蟎」和螺釘了索引,因爲我篩選出的節點是<Element a at $mem_location$>
而不是<Element i at $mem_location$>
。
有沒有辦法更深入地看node()
?我在文檔中找不到任何內容。
編輯:我的xpath現在是"//p/node()"
,但它只抓取最外層的元素層。大多數情況下它是<a>
,這很棒,但如果它包裹在<i>
圖層中,它只能抓住它。我詢問是否有辦法進一步檢查,以便我能夠在<i>
包裝中找到<a>
。
相關的代碼如下: 樹= etree.HTML(讀)
titles = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/@title')) #extracts the titles of all hyperlinks in section paragraphs
hyperlinks = list(tree.xpath('//p//a[contains(@href,\"/wiki/\")]/text()'))
b = list(tree.xpath("//p/b/text()")) #extracts all bolded words in section paragraphs
t = list(tree.xpath("//p/node()"))
b_count = 0
a_count = 0
test = []
for items in t:
print items
items = str(items)
if "<Element b" in str(items):
test.append(b[b_count])
b_count += 1
continue
if "<Element a" in str(items):
test.append((hyperlinks[a_count],titles[a_count]))
a_count +=1
continue
if "<Element " not in items:
pattern = re.compile('(\t(.*?)\n)')
look = pattern.search(str(items))
if look != None: #if there is a match
test.append(look.group().partition("\t")[2].partition("\n")[0])
period_pattern = re.compile("(\t(.*?)\.)")
look_period = period_pattern.search(str(items))
if look_period != None:
test.append(look_period.group().partition("\t")[2])
到目前爲止您使用的代碼是什麼? –
在't'變量中,你想要所有的b和一個項目? 't'變量究竟是什麼? –
't'包含由xpath解析的所有元素,因此它是段落中所有內容的列表。下面是'print t [:15]' '[<0x7f59228cf248的元素b>,'是',<元素a在0x7f5922947368>,'有',<元素a在0x7f59229473b0>,'。' <元素sup在0x7f59228cf2d8>,'\ n','術語有以下幾個複雜性:\ n','Acariasis是一個'的術語,<元素a在0x7f5922947440>處,',由蟎引起,有時帶有乳頭(',<元素a在0x7f59228cf3b0>'),並且通常伴有嚴重',<元素a在0x7f5922947488>]' –