2012-06-21 40 views
0

感謝brilliant help on my XML parsing problem我得到了一個讓我迷失在實際處理XML元素(使用lxml)的問題。Python:我不明白XML迭代是如何工作的

我的數據是NMAP掃描輸出,由許多的記錄,像下面的:

<?xml version="1.0"?> 
<?xml-stylesheet href="file:///usr/share/nmap/nmap.xsl" type="text/xsl"?> 
<nmaprun scanner="nmap" args="nmap -sV -p135,12345 -oX 10.232.0.0.16.xml 10.232.0.0/16" start="1340201347" startstr="Wed Jun 20 16:09:07 2012" version="5.21" xmloutputversion="1.03"> 
    <host> 
    <status state="down" reason="no-response"/> 
    <address addr="10.232.0.1" addrtype="ipv4"/> 
    </host> 
    <host starttime="1340201455" endtime="1340201930"> 
    <status state="up" reason="echo-reply"/> 
    <address addr="10.232.49.2" addrtype="ipv4"/> 
    <hostnames> 
     <hostname name="host1.example.com" type="PTR"/> 
    </hostnames> 
    <ports> 
     <port protocol="tcp" portid="135"> 
     <state state="open" reason="syn-ack" reason_ttl="123"/> 
     <service name="msrpc" product="Microsoft Windows RPC" ostype="Windows" method="probed" conf="10"/> 
     </port> 
     <port protocol="tcp" portid="12345"> 
     <state state="open" reason="syn-ack" reason_ttl="123"/> 
     <service name="http" product="Trend Micro OfficeScan Antivirus http config" method="probed" conf="10"/> 
     </port> 
    </ports> 
    <times srtt="890" rttvar="2835" to="100000"/> 
    </host> 
</nmaprun> 

我期待在產生行時

  • 12345端口是開放的
  • 端口135是開放的,12345是開放

我用這個下面的代碼,我與我的事情如何去理解說:

from lxml import etree 
import time 

scanTime = str(int(time.time())) 
d = etree.parse("10.233.85.0.22.xml") 

# find all hosts records 
for el_host in d.findall("host"): 
    # only process hosts UP 
    if el_host.find("status").attrib["state"] =="up": 

     # here comes a piece of code which sets the variable hostname 
     # used later - that part works fine (removed for clarity) 

     # get the status of port 135 and 12345 
     Open12345 = Open135 = False 
     for el_port in el_host.findall("ports/port"): 
      # we are now looping thought the <port> records for a given <host> 
      if el_port.attrib["portid"] == "135": 
       Open135 = el_host.find("ports/port/state").attrib["state"] == "open" 
      if el_port.attrib["portid"] == "12345": 
       Open12345 = el_host.find("ports/port/state").attrib["state"] == "open" 
       # I want to get for port 12345 the description, so I search 
       # for <service> within a given port - only 12345 in my case 
       # I just search the first one as there is only one 
       # this is the place I am not sure I get right 
       el_service = el_host.find("ports/port/service") 
       if el_service.get("product") is not None: 
        Type12345 = el_host.find("ports/port/service").attrib["product"] 

     if Open12345: 
      print "%s %s \"%s\"\n" % (scanTime,hostname,Type12345) 
     if not Open12345 and Open135: 
      print "%s %s \"%s\"\n" % (scanTime,hostname,"NO_OfficeScan") 

的地方我不知道在註釋中高亮顯示。使用此代碼,我始終匹配Microsoft Windows RPC,就好像我處於端口135的記錄內(它首先在端口12345之前的XML文件中)。

我相信這個問題在我瞭解find函數的某個地方。它可能匹配所有東西,與我所處的地點無關。換句話說,沒有遞歸(據我所知)。

在這種情況下,我該如何編碼「當您在端口12345的記錄中時獲取服務名稱」的概念?

謝謝。


編輯& SOLUTION:

我發現在我的代碼的問題。我轉貼整個腳本,如果有人在這個問題一天絆倒(輸出來自NMAP所以它可能是有趣的人重用 - 這一點,解釋的代碼如下:)的大塊:

#!/usr/bin/python 

from lxml import etree 
import time 
import argparse 

parser = argparse.ArgumentParser() 
parser.add_argument("file", help="XML file to parse") 
args = parser.parse_args() 


scanTime = str(int(time.time())) 
d = etree.parse(args.file) 

f = open("OfficeScanComplianceDSCampus."+scanTime,"w") 
print "Parsing "+ args.file 

# find all hosts records 
for el_host in d.findall("host"): 
    # only process hosts UP 
    if el_host.find("status").attrib["state"] =="up": 
     # get the first hostname if it exists, otherwise IP 
     el_hostname = el_host.find("hostnames/hostname") 
     if el_hostname is not None: 
      hostname = el_hostname.attrib["name"] 
     else: 
       hostname = el_host.find("address").attrib["addr"] 

     # get the status of port 135 and 12345 
     Open12345 = Open135 = False 
     for el_port in el_host.findall("ports/port"): 
      # we are now looping thought the <port> records for a given <host> 
      if el_port.attrib["portid"] == "135": 
       Open135 = el_port.find("state").attrib["state"] == "open" 
      if el_port.attrib["portid"] == "12345": 
       Open12345 = el_port.find("state").attrib["state"] == "open" 
       # if port open get info about service 
       if Open12345: 
        el_service = el_port.find("service") 
        if el_service is None: 
         Type12345 = "UNKNOWN" 
        elif el_service.get("method") == "probed": 
         Type12345 = el_service.get("product") 
        else: 
         Type12345 = "UNKNOWN" 


     if Open12345: 
      f.write("%s %s \"%s\"\n" % (scanTime,hostname,Type12345)) 
     if not Open12345 and Open135: 
      f.write("%s %s \"%s\"\n" % (scanTime,hostname,"NO_OfficeScan")) 
     if Open12345 and not Open135: 
      f.write("%s %s \"%s\"\n" % (scanTime,hostname,"Non-Windows with 12345")) 

f.close() 

我會還探討了Dikei和Ignacio Vazquez-Abrams提供的xpath想法。

謝謝大家!

+2

爲什麼不使用XPath表達式來看看如果你關心存在的節點? –

回答

2

這應該是很容易使用XPath

from lxml import etree 
d = etree.parse("10.233.85.0.22.xml") 

d.xpath('//port[@portid="12345"]/service/@name') // return name of service in portid=12345 
d.xpath('//port[@portid="12345"]/service/@product') // return product in port with portid=12345 
+0

我只是看看[lxml](http://lxml.de/xpathxslt.html)文檔來理解這個新概念(對我來說)。這看起來很有趣,但返回的是包含所有服務名稱(或產品)的列表 - 與它們來自的地點無關,特別是主機。我將挖掘xpath中的相關命名,看看這是否有幫助。感謝指針。 – WoJ

相關問題