2017-04-25 128 views
0

我無法解析用美麗的湯KML文件(XML)。此代碼段應該有我的樣品中的每一個級別LXML返回非零數量的迭代2頁的表XML解析器返回0和數量應爲3解析KML與美麗的湯

from bs4 import BeautifulSoup 

url = "sample.kml" 

with open(url,'r') as page: 

    soup = BeautifulSoup(page, "lxml") 

    tables = soup.find_all('table') 
    print(len(tables)) 

    for table in tables:  
     rows = table.find_all('tr') 

     for row in rows:  
      cols = row.find_all('td') 

該第一樣本腳本返回使用2個表代替3- lxml和0與XML解析器。

soup = BeautifulSoup(page, "xml") 

    placemark = soup.find_all('Placemark') 
    print(len(placemark)) 

    for place in placemark: 

     tables = place.find_all('table') 
     print(len(tables)) 

     for table in tables:  
      rows = table.find_all('tr') 

      for row in rows:  
       cols = row.find_all('td') 

穿越我最初開始尋找這LEN(表)返回2我知道是假的應該是表樹和約92,000psi讓我發現了另一個標籤,開始通過樹步進這是(返回正確的計數),然後試圖找到每個標籤中他們全部返回零的行和列,這讓我感到吃驚。我打得四處不同的解析器最終確定XML不過是一個合適的仍然是無法找到表的正確量儘管能夠使用re.search找到他們或崇高的文本搜索,這繼而導致我去檢查它的方法可能已被封裝但無濟於事。我很困難,似乎找不到使用find_all(「TAG」)方法訪問92,000個表的方法。有什麼建議麼?

樣品KML

<?xml version="1.0" encoding="UTF-8"?> 
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www.opengis.net/kml/2.2" xmlns:atom="http://www.w3.org/2005/Atom"> 
<Document id="laaSECS" xsi:schemaLocation="http://www.opengis.net/kml/2.2 http://schemas.opengis.net/kml/2.2.0/ogckml22.xsd http://www.google.com/kml/ext/2.2 http://code.google.com/apis/kml/schema/kml22gx.xsd"> 
    <name>laaSECS</name> 
    <Snippet maxLines="0"></Snippet> 
    <Style id="PolyStyle00"> 
     <LabelStyle> 
      <color>00000000</color> 
      <scale>0</scale> 
     </LabelStyle> 
     <LineStyle> 
      <color>ff7f5555</color> 
      <width>0.2</width> 
     </LineStyle> 
     <PolyStyle> 
      <color>ffc5d9fa</color> 
      <fill>0</fill> 
     </PolyStyle> 
    </Style> 
    <Style id="PolyStyle000"> 
     <LabelStyle> 
      <color>00000000</color> 
      <scale>0</scale> 
     </LabelStyle> 
     <LineStyle> 
      <color>ff7f5555</color> 
      <width>0.2</width> 
     </LineStyle> 
     <PolyStyle> 
      <color>ffc5d9fa</color> 
      <fill>0</fill> 
     </PolyStyle> 
    </Style> 
    <StyleMap id="PolyStyle001"> 
     <Pair> 
      <key>normal</key> 
      <styleUrl>#PolyStyle00</styleUrl> 
     </Pair> 
     <Pair> 
      <key>highlight</key> 
      <styleUrl>#PolyStyle000</styleUrl> 
     </Pair> 
    </StyleMap> 
    <Folder id="FeatureLayer0"> 
     <name>laaSECS</name> 
     <Snippet maxLines="0"></Snippet> 
     <Placemark id="ID_00000"> 
      <name>AL</name> 
      <Snippet maxLines="0"></Snippet> 
      <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> 

<head> 

<META http-equiv="Content-Type" content="text/html"> 

<meta http-equiv="content-type" content="text/html; charset=UTF-8"> 

</head> 

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;"> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px"> 

<tr style="text-align:center;font-weight:bold;background:#9CBCE2"> 

<td>AL</td> 

</tr> 

<tr> 

<td> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px"> 

<tr> 

<td>FID</td> 

<td>0</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>STATE</td> 

<td>AL</td> 

</tr> 

<tr> 

<td>MER</td> 

<td>25</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>TWP</td> 

<td>22</td> 

</tr> 

<tr> 

<td>TDIR</td> 

<td>N</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>RNG</td> 

<td>4</td> 

</tr> 

<tr> 

<td>RDIR</td> 

<td>W</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>SEC</td> 

<td>24</td> 

</tr> 

<tr> 

<td>MODDATE</td> 

<td>20050311</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>DATUM</td> 

<td>NAD27</td> 

</tr> 

<tr> 

<td>SOURCE</td> 

<td>WhiteStar</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>MTR</td> 

<td>25 22.0N 4.0W</td> 

</tr> 

</table> 

</td> 

</tr> 

</table> 

</body> 

</html>]]></description> 
      <styleUrl>#PolyStyle001</styleUrl> 
      <MultiGeometry> 
       <Polygon> 
        <outerBoundaryIs> 
         <LinearRing> 
          <coordinates> 
           -88.35570867858526,32.86011073571817,0 -88.35570870147141,32.86253443065814,0 -88.35597594524225,32.86011537400984,0 -88.35570867858526,32.86011073571817,0 
          </coordinates> 
         </LinearRing> 
        </outerBoundaryIs> 
       </Polygon> 
      </MultiGeometry> 
     </Placemark> 
     <Placemark id="ID_00001"> 
      <name>AL</name> 
      <Snippet maxLines="0"></Snippet> 
      <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> 

<head> 

<META http-equiv="Content-Type" content="text/html"> 

<meta http-equiv="content-type" content="text/html; charset=UTF-8"> 

</head> 

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;"> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px"> 

<tr style="text-align:center;font-weight:bold;background:#9CBCE2"> 

<td>AL</td> 

</tr> 

<tr> 

<td> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px"> 

<tr> 

<td>FID</td> 

<td>1</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>STATE</td> 

<td>AL</td> 

</tr> 

<tr> 

<td>MER</td> 

<td>25</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>TWP</td> 

<td>22</td> 

</tr> 

<tr> 

<td>TDIR</td> 

<td>N</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>RNG</td> 

<td>4</td> 

</tr> 

<tr> 

<td>RDIR</td> 

<td>W</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>SEC</td> 

<td>25</td> 

</tr> 

<tr> 

<td>MODDATE</td> 

<td>20050311</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>DATUM</td> 

<td>NAD27</td> 

</tr> 

<tr> 

<td>SOURCE</td> 

<td>WhiteStar</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>MTR</td> 

<td>25 22.0N 4.0W</td> 

</tr> 

</table> 

</td> 

</tr> 

</table> 

</body> 

</html>]]></description> 
      <styleUrl>#PolyStyle001</styleUrl> 
      <MultiGeometry> 
       <Polygon> 
        <outerBoundaryIs> 
         <LinearRing> 
          <coordinates> 
           -88.35597594524225,32.86011537400984,0 -88.3567389068841,32.85292852502473,0 -88.35768486975799,32.84508568993779,0 -88.35570853700197,32.84511675513796,0 -88.35570867858526,32.86011073571817,0 -88.35597594524225,32.86011537400984,0 
          </coordinates> 
         </LinearRing> 
        </outerBoundaryIs> 
       </Polygon> 
      </MultiGeometry> 
     </Placemark> 
     <Placemark id="ID_00002"> 
      <name>AL</name> 
      <Snippet maxLines="0"></Snippet> 
      <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> 

<head> 

<META http-equiv="Content-Type" content="text/html"> 

<meta http-equiv="content-type" content="text/html; charset=UTF-8"> 

</head> 

<body style="margin:0px 0px 0px 0px;overflow:auto;background:#FFFFFF;"> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-collapse:collapse;padding:3px 3px 3px 3px"> 

<tr style="text-align:center;font-weight:bold;background:#9CBCE2"> 

<td>AL</td> 

</tr> 

<tr> 

<td> 

<table style="font-family:Arial,Verdana,Times;font-size:12px;text-align:left;width:100%;border-spacing:0px; padding:3px 3px 3px 3px"> 

<tr> 

<td>FID</td> 

<td>2</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>STATE</td> 

<td>AL</td> 

</tr> 

<tr> 

<td>MER</td> 

<td>25</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>TWP</td> 

<td>22</td> 

</tr> 

<tr> 

<td>TDIR</td> 

<td>N</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>RNG</td> 

<td>4</td> 

</tr> 

<tr> 

<td>RDIR</td> 

<td>W</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>SEC</td> 

<td>36</td> 

</tr> 

<tr> 

<td>MODDATE</td> 

<td>20050311</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>DATUM</td> 

<td>NAD27</td> 

</tr> 

<tr> 

<td>SOURCE</td> 

<td>WhiteStar</td> 

</tr> 

<tr bgcolor="#D4E4F3"> 

<td>MTR</td> 

<td>25 22.0N 4.0W</td> 

</tr> 

</table> 

</td> 

</tr> 

</table> 

</body> 

</html>]]></description> 
      <styleUrl>#PolyStyle001</styleUrl> 
      <MultiGeometry> 
       <Polygon> 
        <outerBoundaryIs> 
         <LinearRing> 
          <coordinates> 
           -88.35768486975799,32.84508568993779,0 -88.35843183642189,32.83843382961495,0 -88.35914980106479,32.83165897171819,0 -88.35908878782671,32.83049899571662,0 -88.35570839957039,32.83056244880483,0 -88.35570853700197,32.84511675513796,0 -88.35768486975799,32.84508568993779,0 
          </coordinates> 
         </LinearRing> 
        </outerBoundaryIs> 
       </Polygon> 
      </MultiGeometry> 
     </Placemark> 
     <Placemark id="ID_00003"> 
      <name>AL</name> 
      <Snippet maxLines="0"></Snippet> 
      <description><![CDATA[<html xmlns:fo="http://www.w3.org/1999/XSL/Format" xmlns:msxsl="urn:schemas-microsoft-com:xslt"> 

鏈接到originalfile KML FILE

+3

請提供XML的展示問題的一小部分。一個175 MB文件不符合[mcve]的一部分! – miken32

+0

OK @ miken32我把它帶到了500行 –

回答

2

你的問題的根源是你必須與它嵌套的HTML文檔的XML文檔。試圖解析整個事情HTML不工作,因爲HTML文檔似乎被存儲爲標籤。因此,雖然這是有效的XML,但它甚至不是遠程有效的HTML。

爲了解決這個問題,我解析整個文檔作爲XML,提取的每個HTML部分(字符串),然後解析該HTML部分爲HTML。需要注意的是,有點混亂,lxml是一種HTML解析器,但lxml-xml是一個XML解析器。

from bs4 import BeautifulSoup as Soup 

with open('sample.kml') as data: 
    kml_soup = Soup(data, 'lxml-xml') # Parse as XML 

descriptions = kml_soup.find_all('description') 
for description in descriptions: 
    html_soup = Soup(description.text, 'lxml') # Parse as HTML 
    tables = html_soup.find_all('table') 
    print(len(tables)) 
    for table in tables: 
     rows = table.find_all('tr') 

     for row in rows: 
      cols = row.find_all('td') 
      ... 

對於您所提供的樣品,有六個表。上面的代碼打印「2」三次,所以它找到了所有六個。

+0

Wow Thankyou,你把這個嵌套結構帶給了你什麼,我不認爲我會想出這個@ supersam654 –

+1

@TylerCowan你可以看到HTML存儲在CDATA部分,就XML而言,它只是文本。這就是爲什麼需要單獨解析HTML的原因。 – miken32

+0

@ miken32感謝這一直非常有幫助 –