2012-01-04 71 views
3

我想解析一個非常大的文件240Mb,並且必須通過SAX來避免在內存中加載文件。如何使用SAX和Nokogiri?

我的XML看起來像:

<?xml version="1.0" encoding="utf-8"?> 
<hotels> 
    <hotel> 
    <hotelId>1568054</hotelId> 
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName> 
    <hotelName>"Der Obere Wirt" zum Queri</hotelName> 
    <rating>3</rating> 
    <cityId>34633</cityId> 
    <cityFileName>Andechs</cityFileName> 
    <cityName>Andechs</cityName> 
    <stateId>212</stateId> 
    <stateFileName>Bavaria</stateFileName> 
    <stateName>Bavaria</stateName> 
    <countryCode>DE</countryCode> 
    <countryFileName>Germany</countryFileName> 
    <countryName>Germany</countryName> 
    <imageId>51498149</imageId> 
    <Address>Georg Queri Ring 9</Address> 
    <minRate>85.9800</minRate> 
    <currencyCode>EUR</currencyCode> 
    <Latitude>48.009423000000</Latitude> 
    <Longitude>11.214504000000</Longitude> 
    <NumberOfReviews>16</NumberOfReviews> 
    <ConsumerRating>4.25</ConsumerRating> 
    <PropertyType>0</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1658359</hotelId> 
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName> 
    <hotelName>"Seclusions" of Yallingup</hotelName> 
    <rating>4</rating> 
    <cityId>72257</cityId> 
    <cityFileName>Yallingup</cityFileName> 
    <cityName>Yallingup</cityName> 
    <stateId>172</stateId> 
    <stateFileName>Western_Australia</stateFileName> 
    <stateName>Western Australia</stateName> 
    <countryCode>AU</countryCode> 
    <countryFileName>Australia</countryFileName> 
    <countryName>Australia</countryName> 
    <imageId>53234107</imageId> 
    <Address>58 Zamia Grove</Address> 
    <minRate>218.1825</minRate> 
    <currencyCode>AUD</currencyCode> 
    <Latitude>-33.691192000000</Latitude> 
    <Longitude>115.061938999999</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>3</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1491947</hotelId> 
    <hotelFileName>1_Melrose_Blvd</hotelFileName> 
    <hotelName>#1 Melrose Blvd</hotelName> 
    <rating>5</rating> 
    <cityId>964</cityId> 
    <cityFileName>Johannesburg</cityFileName> 
    <cityName>Johannesburg</cityName> 
    <stateId/> 
    <stateFileName/> 
    <stateName/> 
    <countryCode>ZA</countryCode> 
    <countryFileName>South_Africa</countryFileName> 
    <countryName>South Africa</countryName> 
    <imageId>46777171</imageId> 
    <Address>1 Melrose Boulevard Melrose Arch</Address> 
    <minRate/> 
    <currencyCode>ZAR</currencyCode> 
    <Latitude>-26.135656000000</Latitude> 
    <Longitude>28.067751000000</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>9</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1726938</hotelId> 
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName> 
    <hotelName>#1 Value Inn Clovis</hotelName> 
    <rating>2</rating> 
    <cityId>28538</cityId> 
    <cityFileName>Clovis_New_Mexico</cityFileName> 
    <cityName>Clovis (New Mexico)</cityName> 
    <stateId>32</stateId> 
    <stateFileName>New_Mexico</stateFileName> 
    <stateName>New Mexico</stateName> 
    <countryCode>US</countryCode> 
    <countryFileName>United_States</countryFileName> 
    <countryName>United States</countryName> 
    <imageId/> 
    <Address>1720 Mabry</Address> 
    <minRate/> 
    <currencyCode>USD</currencyCode> 
    <Latitude>34.396549224853</Latitude> 
    <Longitude>-103.182769775390</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>2</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities> 
    </hotel> 
</hotels> 

我試過這段代碼:

class Wikihandler < Nokogiri::XML::SAX::Document 

    def initialize 
    # do one-time setup here, called as part of Class.new 
    end 

    def start_element(name, attributes = []) 
    # check the element name here and create an active record object if appropriate 
    if name == 'hotel' 
    a = Hash[*attributes] 
    puts attributes 
    # more business... 
    end 
    end 

    def characters(s) 
    # save the characters that appear here and possibly use them in the current tag object 
    end 

    def end_element(name) 
    # check the tag name and possibly use the characters you've collected 
    # and save your activerecord object now 
    end 

end 

parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new) 
parser.parse_file('HotelCombinedXml/Hotels_All.xml') 

我可以訪問標記的標籤,但我怎麼能訪問其內容?

回答

9

Wikihandler#characters將顯示內容。你可以這樣做:

class MyDocument < Nokogiri::XML::SAX::Document 
    attr_accessor :is_name 

    def initialize 
    @is_name = false 
    end 

    def end_document 
    puts "the document has ended" 
    end 

    def start_element name, attributes = [] 
    @is_name = name.eql?("hotelName") 
    end 

    def characters string 
    string.strip! 
    if @is_name and !string.empty? 
     puts "Name: #{string}" 
    end 
    end 
end 

不過,如果你想使您的生活更輕鬆,我建議你檢查出sax-machine。它爲Nokogiri的SAX解析器添加了一些不錯的功能和(IMHO)友好的界面。以下是一些示例代碼和規格:

require "sax-machine" 
require "rspec" 

XML = <<XML 
<?xml version="1.0" encoding="utf-8"?> 
<hotels> 
    <hotel> 
    <hotelId>1568054</hotelId> 
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName> 
    <hotelName>"Der Obere Wirt" zum Queri</hotelName> 
    <rating>3</rating> 
    <cityId>34633</cityId> 
    <cityFileName>Andechs</cityFileName> 
    <cityName>Andechs</cityName> 
    <stateId>212</stateId> 
    <stateFileName>Bavaria</stateFileName> 
    <stateName>Bavaria</stateName> 
    <countryCode>DE</countryCode> 
    <countryFileName>Germany</countryFileName> 
    <countryName>Germany</countryName> 
    <imageId>51498149</imageId> 
    <Address>Georg Queri Ring 9</Address> 
    <minRate>85.9800</minRate> 
    <currencyCode>EUR</currencyCode> 
    <Latitude>48.009423000000</Latitude> 
    <Longitude>11.214504000000</Longitude> 
    <NumberOfReviews>16</NumberOfReviews> 
    <ConsumerRating>4.25</ConsumerRating> 
    <PropertyType>0</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1658359</hotelId> 
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName> 
    <hotelName>"Seclusions" of Yallingup</hotelName> 
    <rating>4</rating> 
    <cityId>72257</cityId> 
    <cityFileName>Yallingup</cityFileName> 
    <cityName>Yallingup</cityName> 
    <stateId>172</stateId> 
    <stateFileName>Western_Australia</stateFileName> 
    <stateName>Western Australia</stateName> 
    <countryCode>AU</countryCode> 
    <countryFileName>Australia</countryFileName> 
    <countryName>Australia</countryName> 
    <imageId>53234107</imageId> 
    <Address>58 Zamia Grove</Address> 
    <minRate>218.1825</minRate> 
    <currencyCode>AUD</currencyCode> 
    <Latitude>-33.691192000000</Latitude> 
    <Longitude>115.061938999999</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>3</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1491947</hotelId> 
    <hotelFileName>1_Melrose_Blvd</hotelFileName> 
    <hotelName>#1 Melrose Blvd</hotelName> 
    <rating>5</rating> 
    <cityId>964</cityId> 
    <cityFileName>Johannesburg</cityFileName> 
    <cityName>Johannesburg</cityName> 
    <stateId/> 
    <stateFileName/> 
    <stateName/> 
    <countryCode>ZA</countryCode> 
    <countryFileName>South_Africa</countryFileName> 
    <countryName>South Africa</countryName> 
    <imageId>46777171</imageId> 
    <Address>1 Melrose Boulevard Melrose Arch</Address> 
    <minRate/> 
    <currencyCode>ZAR</currencyCode> 
    <Latitude>-26.135656000000</Latitude> 
    <Longitude>28.067751000000</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>9</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities> 
    </hotel> 
    <hotel> 
    <hotelId>1726938</hotelId> 
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName> 
    <hotelName>#1 Value Inn Clovis</hotelName> 
    <rating>2</rating> 
    <cityId>28538</cityId> 
    <cityFileName>Clovis_New_Mexico</cityFileName> 
    <cityName>Clovis (New Mexico)</cityName> 
    <stateId>32</stateId> 
    <stateFileName>New_Mexico</stateFileName> 
    <stateName>New Mexico</stateName> 
    <countryCode>US</countryCode> 
    <countryFileName>United_States</countryFileName> 
    <countryName>United States</countryName> 
    <imageId/> 
    <Address>1720 Mabry</Address> 
    <minRate/> 
    <currencyCode>USD</currencyCode> 
    <Latitude>34.396549224853</Latitude> 
    <Longitude>-103.182769775390</Longitude> 
    <NumberOfReviews>0</NumberOfReviews> 
    <ConsumerRating>0</ConsumerRating> 
    <PropertyType>2</PropertyType> 
    <ChainID>0</ChainID> 
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities> 
    </hotel> 
</hotels> 
XML 

class Hotel 
    include SAXMachine 
    element :hotelId, :as => :id 
    element :hotelName, :as => :name 
end 

class Wikihandler 
    include SAXMachine 
    elements :hotel, :as => :hotels, :class => Hotel 
end 

describe Wikihandler do 
    before(:all) do 
    @parser = Wikihandler.new 
    @parser.parse XML 
    end 

    it "should parse the proper number of hotels" do 
    @parser.hotels.count.should eq 4 
    end 

    it "should parse the hotel id of each entry" do 
    @parser.hotels[0].id.should eq "1568054" 
    end 

    it "should parse the hotel name of each entry" do 
    @parser.hotels[0].name.should eq '"Der Obere Wirt" zum Queri' 
    end 
end 
+0

謝謝您的幫助! – Sebastien 2012-01-11 10:11:21

+0

Sax機器仍會嘗試首先讀取整個文檔,這對於較大的文件不起作用。 :( – unflores 2013-12-03 14:30:22

+0

你救了我的一天!非常感謝! – sadfuzzy 2015-03-26 09:05:21