iterparse無法解析字段，而其他類似的字段不正確

我使用Python的iterparse解析了nessus掃描（.nessus文件）的XML結果。解析失敗的意外記錄，威爾相似的人已被正確解析。iterparse無法解析字段，而其他類似的字段不正確

XML文件的一般結構是很多類似下面的記錄：

<ReportHost> 
    <ReportItem> 
    <foo>9.3</foo> 
    <bar>hello</bar> 
    </ReportItem> 
    <ReportItem> 
    <foo>10.0</foo> 
    <bar>world</bar> 
</ReportHost> 
<ReportHost> 
    ... 
</ReportHost>

換句話說大量主機（ReportHost）有很多項目的報告（ReportItem），和後者具有幾個特徵（foo，bar）。我將考慮爲每個產品生成一行，並具有其特徵。

解析在文件的中間處（在這種情況下是cvss_base_scorefoo）一個簡單的線路發生故障

<cvss_base_score>9.3</cvss_base_score>

同時〜200條類似線已沒有問題解析。

相關的一段代碼如下 - 它集上下文標記（inReportHost和inReportEvent這告訴我，在XML文件中我在的狹窄，並且或者分配或打印一個值，根據上下文）

import xml.etree.cElementTree as ET 
inReportHost = False 
inReportItem = False 

for event, elem in ET.iterparse("test2.nessus", events=("start", "end")): 
    if event == 'start' and elem.tag == "ReportHost": 
     inReportHost = True 
    if event == 'end' and elem.tag == "ReportHost": 
     inReportHost = False 
     elem.clear() 
    if inReportHost: 
     if event == 'start' and elem.tag == 'ReportItem': 
      inReportItem = True 
      cvss = '' 
     if event == 'start' and inReportItem: 
      if event == 'start' and elem.tag == 'cvss_base_score': 
       cvss = elem.text 
     if event == 'end' and elem.tag == 'ReportItem': 
      print cvss 
      inReportItem = False

cvss有時具有無值（cvss = elem.text分配後），即使相同的條目已經在文件中properely較早解析。

如果我添加了assignement下面的東西沿着

if cvss is None: cvss = "0"

線則許多進一步的cvss解析轉讓他們正確的價值觀（和其他一些是沒有的）。

當採取<ReportHost>...</reportHost>導致錯誤的解析並運行它通過程序 - 它工作正常（即cvss被分配9.3按預期）。

我迷失在我犯我代碼錯誤的地方，因爲有一大堆類似的記錄，一些apre處理正確，有些不 - （某些記錄是相同的，而且處理方式不同）。我也找不到任何關於失敗記錄的特別之處 - 早期和晚期都是一樣的。

來源

2013-02-03 WoJ

從iterparse() docs：

注：iterparse（）只保證它已經看到了「>」字符時，它發出一個「開始」事件的起始代碼的，所以屬性是定義，但是文本和尾部屬性的內容在此處未定義爲。這同樣適用於元素兒童; 他們可能或可能不存在。如果您需要一個完全填充的元素，則可以使用來查找「結束」事件。

刪除inReport*變量和進程ReportHost只在完成解析時的「結束」事件。使用ElementTree API從當前ReportHost元素獲取必要的信息，例如cvss_base_score。

要保留內存，這樣做：

import xml.etree.cElementTree as etree 

def getelements(filename_or_file, tag): 
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end'))) 
    _, root = next(context) # get root element 
    for event, elem in context: 
     if event == 'end' and elem.tag == tag: 
      yield elem 
      root.clear() # preserve memory 

for host in getelements("test2.nessus", "ReportHost"): 
    for cvss_el in host.iter("cvss_base_score"): 
     print(cvss_el.text)

來源

2013-02-03 10:23:58 jfs

謝謝你 - 這是真的很有幫助。我更新了XML文件的示例以更好地反映現實（每臺主機有幾個項目）。我將嘗試圍繞您的想法構建我的代碼，可能有兩個循環（一個用於主機，然後用於主機中的項目），但我首先必須清楚地瞭解迭代的工作原理。 – WoJ

@WoJ：'.iter（）'方法是遞歸的，即無論有多少'ReportItem';所有'cvss_base_score'都可以在他們所屬的任何「ReportItem」中找到（或者即使「cvss_base_score」元素在任何「ReportItem」之外）。 – jfs

@ j-f-sebastian：我明白了，但我也需要知道他們屬於哪個ReportItem。我試圖輸出表單行（以XML爲例）'host1，foo =「9.3」，bar =「hello」'和'host2，foo =「10.0」，bar =「world」'。現在我已經清楚（從你的例子）如何提取字段，但我需要保持它們在上下文中（將它們鏈接到它們所屬的項目） – WoJ

iterparse無法解析字段，而其他類似的字段不正確

回答

相關問題