-4

我有一個大的文件無法打開：提取屬性和數據與引入nokogiri

... more here 

<my_element attr1='123'> 
... a lot of text and elements here 
</my_element> 

<my_element attr1='33'> 
... a lot of text and elements here 
</my_element> 

... more here

更新：

我試圖this：

#!/usr/bin/ruby 
require "rubygems" 
require "nokogiri" 
require "debugger" 
require "awesome_print" 

file = ARGV[0] 
reader = Nokogiri::XML::Reader(File.open(file)) 
reader.each do |node| 
    if node.name == "PATDOC" 
    debugger 
    break 
    end 
end

但node.attributes回報{}，如何我可以從元素中提取屬性和內部文本嗎？

來源

2013-03-08 juanpastas

改爲使用XML解析器。它會讓你的生活更輕鬆。 – squiguy 2013-03-08 17:05:56

我有一個很長的文件，我甚至無法打開，我可以使用哪個解析器？我在OS X – juanpastas 2013-03-08 17:09:50

定義「大」和「長」。大約60 MB壓縮的 – 2013-03-10 16:49:58

那麼你可以用awk來完成它...但推薦的方法是一個XML解析器（XPath，無論）。無論如何：

awk 'BEGIN {FS="</*my_element[^>]+>"} {print $2, $3}' INPUTFILE

注意：這不是一個完美的解決方案，例如，它真的取決於你的整個輸入文件。它所做的是將字段分隔符設置爲標記，並從文件中打印第二個和第三個「列」。您可能需要修改它。

來源

2013-03-08 17:09:58

我不明白這個，它似乎提取所有'my_element'元素 – juanpastas 2013-03-08 17:17:07

通常我們使用Nokogiri來讀取整個文件並將其作爲DOM進行處理。我包了示例XML中的另一個節點，使之有效的XML，並使用一個CSS訪問只是因爲他們更容易閱讀：

require 'nokogiri' 

doc = Nokogiri::XML(<<EOT) 
<xml> 
    <my_element attr1='123'> a lot of text and elements here </my_element> 
    <my_element attr1='33'> a lot of text and elements here </my_element> 
</xml> 
EOT 

doc.search('my_element').map{ |n| 
    [ n['attr1'], n.children.text ] 
}

它看起來像：

[ 
    [0] [ 
     [0] "123", 
     [1] " a lot of text and elements here " 
    ], 
    [1] [ 
     [0] "33", 
     [1] " a lot of text and elements here " 
    ] 
]

如果您不能以這種方式使用它：

來源

2013-03-10 16:47:36

提取屬性和數據與引入nokogiri

更新：

回答

相關問題