2016-05-24 67 views
-3

這裏是xml文件 http://www.diveintopython3.net/examples/feed.xml解析XML Python中沒有換行符

我的Python代碼:

from lxml import etree 
def lxml(): 
    tree = etree.parse('feed.xml') 
    NSMAP = {"nn":"http://www.w3.org/2005/Atom"} 
    test = tree.xpath('//nn:category[@term="html"]/..',namespaces=NSMAP) 
    for elem in tree.iter(): 
     print(elem.tag,'\t',elem.attrib) 
    print('-------------------------------') 
    test1 = tree.xpath('//nn:category',namespaces=NSMAP) 
    print('++++++++++++++++++++++++++++++++') 
    for node in test1: 
     test2 = node.xpath('./../nn:summary',namespaces=NSMAP) # return a list 
     print(test2.xpath('normalize-space(.)')) 
    print('*****************************************') 
    test3 = tree.xpath('//text()[normalize-space(.)]')# [normalize-space()] only remove the heading and tailing 
    print(test3) 

輸出爲:..

++++++++++++++++++++++++++++++++ 
['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] 
['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] 
['Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.'] 
['The accessibility orthodoxy does not permit people to\n  question the value of features that are rarely useful and rarely used.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
['These notes will eventually become part of a\n  tech talk on video encoding.'] 
***************************************** 
['\n ', 'dive into mark', '\n ', 'currently between addictions', '\n ', 'tag:diveintomark.org,2001-07-29:/', '\n ', '2009-03-27T21:56:07Z', '\n ', '\n ', '\n ', '\n ', '\n  ', 'Mark', '\n  ', 'http://diveintomark.org/', '\n ', '\n ', 'Dive into history, 2009 edition', '\n ', '\n ', 'tag:diveintomark.org,2009-03-27:/archives/20090327172042', '\n ', '2009-03-27T21:56:07Z', '\n ', '2009-03-27T17:20:42Z', '\n ', '\n ', '\n ', '\n ', 'Putting an entire chapter on one page sounds\n bloated, but consider this — my longest chapter so far\n would be 75 printed pages, and it loads in under 5 seconds…\n On dialup.', '\n ', '\n ', '\n ', '\n  ', 'Mark', '\n  ', 'http://diveintomark.org/', '\n ', '\n ', 'Accessibility is a harsh mistress', '\n ', '\n ', 'tag:diveintomark.org,2009-03-21:/archives/20090321200928', '\n ', '2009-03-22T01:05:37Z', '\n ', '2009-03-21T20:09:28Z', '\n ', '\n ', 'The accessibility orthodoxy does not permit people to\n  question the value of features that are rarely useful and rarely used.', '\n ', '\n ', '\n ', '\n  ', 'Mark', '\n ', '\n ', 'A gentle introduction to video encoding, part 1: container formats', '\n ', '\n ', 'tag:diveintomark.org,2008-12-18:/archives/20081218155422', '\n ', '2009-01-11T19:39:22Z', '\n ', '2008-12-18T15:54:22Z', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', '\n ', 'These notes will eventually become part of a\n  tech talk on video encoding.', '\n ', '\n'].. 

我的問題是,爲什麼有這麼多'\ n'。如何刪除它們?

額外的問題是如何直接查詢文本的標籤,比如make獲得「標記」(項文本的孩子的節點。

非常感謝

+2

請不要將代碼張貼爲圖片。將其作爲文本發佈,然後正確格式化(突出顯示/選擇文本 - >單擊「{}」)。謝謝 – har07

+0

我修好了。對不起,因爲我是初學者。謝謝 – jason

回答

1

我的問題是,爲什麼有這麼多的「\ n」。怎麼刪除?

在XML每個空白將由您的XPath來選擇。而且格式良好的XML通常含有大量換行符和s步伐。例如,在下面的XML還有將由//text()即一個<root><foo>之間選擇兩個空文本節點,並</foo></root>之間的另一個問題:

<root> 
    <foo>bar</foo> 
</root> 

您可以使用//text()[normalize-space()]避免選擇空文本節點首先。

額外的問題是如何直接查詢文本的標籤,比如make得到的節點‘馬克’(項文本的孩子。

your_text_node.getparent().tag 

上面應該得到變量your_text_node引用的文本節點的父元素,然後返回元素的標籤名稱。