2015-12-30 45 views
1

我擁有具有不同節點和屬性的巨大xml文件。我用grep -c來計算具有特定類型的產品。這裏是我迄今所做的:根據元素屬性在命令行中拆分xml文件

grep -c "</products>" products.xml // output : 200023 

grep -c '<product type="cloths"' products.xml // output : 8039 

所以我需要提取與型布料如在new.xml文件樹中的所有產品無所有其他屬性,這樣我可以導入new.xml文件導入數據庫:

<?xml version="1.0"?> 
<!DOCTYPE catalog SYSTEM "catalog.dtd"> 
<catalog> 
    <product type="cloths" product_image="cardigan.jpg"> 
     <catalog_item gender="Men's"> 
     <item_number>QWZ5671</item_number> 
     <price>39.95</price> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     </catalog_item> 
     <catalog_item gender="Women's"> 
     <item_number>RRX9856</item_number> 
     <price>42.50</price> 
     <size description="Small"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Extra Large"> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     </catalog_item> 
    </product> 
</catalog> 
+1

XSLT對此非常理想。這對你來說是一種選擇嗎? – kjhughes

+0

不幸的是,我沒有XSLT的龐大的文件,我有。不知道是否有任何方法來生成這樣的文件!對於XML世界來說很抱歉。謝謝 – Mtaly

+0

爲您的任務編寫XSLT會很簡單。如果你有XSLT代碼,你能運行XSLT嗎? – kjhughes

回答

0

從顯示0​​樣品和應用的問題命令行標記,它看起來像你想要一個純命令行的解決方案。一種選擇是使用xmllint來運行XPath查詢,該查詢只選擇類型爲「布料」的產品。

> xmllint --xpath /catalog/product[@type=\"cloths\"] products.xml 
<product type="cloths" product_image="cardigan.jpg"> 
     <catalog_item gender="Men's"> 
     <item_number>QWZ5671</item_number> 
     <price>39.95</price> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     </catalog_item> 
     <catalog_item gender="Women's"> 
     <item_number>RRX9856</item_number> 
     <price>42.50</price> 
     <size description="Small"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
     </size> 
     <size description="Medium"> 
      <color_swatch image="red_cardigan.jpg">Red</color_swatch> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Large"> 
      <color_swatch image="navy_cardigan.jpg">Navy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     <size description="Extra Large"> 
      <color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch> 
      <color_swatch image="black_cardigan.jpg">Black</color_swatch> 
     </size> 
     </catalog_item> 
    </product> 

但請注意,這不會產生格式良好的XML文檔。它只是一個包含XPath查詢所選內容的節點集。但是,您可以將其包裝在一些額外的腳本中以生成完整的XML文檔。

printf '<?xml version="1.0"?>\n' > cloths.xml 
printf '<!DOCTYPE catalog SYSTEM "catalog.dtd">\n' >> cloths.xml 
printf '<catalog>\n' >> cloths.xml 
xmllint --xpath /catalog/product[@type=\"cloths\"] products.xml >> cloths.xml 
printf '\n</catalog>\n' >> cloths.xml 

我省略了這些命令之間的錯誤檢查以簡化代碼示例。

您還提到輸入XML文件很大。取決於多大,這種方法在內存消耗方面可能無法很好地擴展。如果這是一個問題,那麼您可能需要採取更多的流式處理方法來解決問題,一次讀取輸入文檔的一小部分並逐步處理。這可能會超出命令行領域和定製編碼領域。流式XML API的一個例子是Java中的StAX。這是一個tutorial

+0

謝謝。但xmllint不起作用它顯示'XPath集是空的 '我已經用它來找出問題是什麼!沒有運氣。 – Mtaly

+0

@Maly,'XPath設置爲空'表示XPath查詢執行但沒有匹配任何內容,所以結果集爲空。如果您的XML與您在問題中給出的內容完全相同,則不會發生這種情況。看到這個[GitHub gist](https://gist.github.com/cnauroth/ffc95773560612e5bcb1)成功演示了該命令。腳本嵌入了我期待的作爲heredoc的XML,所以您將獲得可預測的結果。如果您的XML實際上與您描述的有所不同,那麼您可能需要調整XPath查詢。 –