2014-01-13 104 views
0

我從Nokogiri :: XML :: Reader上使用Xml :: Parser從XML文件中提取條目。我想只抓住「Property/PropertyID/Identification ['OrganizationName'=='northsteppe']」的標籤,但無法找出正確的語法來完成此操作,這裏是我一直在構建的整個耙子任務接下來是一個樣本節點,其中包含所有信息和標籤。任何指導將不勝感激。在特定節點上只抓取具有特定屬性值的條目

================ UPDATE ===============

我解析該文件正在使用中的拉open-uri,因爲它來自外部來源,我只是在本地機器上使用舊版本的硬拷貝,以便在開發過程中加快速度,因爲文件大小爲300MB +。我試圖使用一個SAX解析器,但是這個邏輯似乎有點複雜,我真的能夠掌握髮生了什麼,並且遇到了同樣的問題,這限制了我只抓住那些'northsteppe'作爲Identification標籤中的OrganizationName,我說過,我選擇使用當前的方法嘗試相同的任務,我能夠抓住幾乎所有我需要的信息,我只是錯過了上面提到的最後一部分。

===============抵達儘可能具體=============

所以,我覺得好像描述的確切我正在嘗試預成型的任務將有助於消除任何缺失的空白。任務如下。

<Identification>標記中的OraganizationName ='northsteppe'的XML文件中抓取每個屬性,然後分別獲取與每個屬性相關的所有相應信息並將其插入散列。在將單個財產的所有信息收集並放入該散列之後,需要將其作爲單獨條目上載到數據庫,該數據庫已按照其需要的方式構建。一旦該屬性被插入到數據庫中,則耙取任務將移動到Property的下一個條目,該條目符合<Identification>標記中具有OrganizationName ='northsteppe'的規範並重復該過程,直到滿足上述列表中的所有屬性規格已插入到數據庫中。這樣做的目的是爲了讓我可以快速搜索Northsteppe屬性的數據,而無需使用XML文件中的每個屬性將系統陷入困境。

最終,我將使用open-uri從該文件的外部源中提取文件,並運行一個cron作業,每6小時執行一次這個rake任務並替換數據庫。

================= CODE =================

namespace :db do 

# RAKE TASK DESCRIPTION 
desc "Fetch property information and insert it into the database" 

# RAKE TASK NAME  
task :print_properties => :environment do 

    require 'rubygems' 
    require 'nokogiri' 

    module Xml 
     class Parser 
     def initialize(node, &block) 
      @node = node 
      @node.each do 
      self.instance_eval &block 
      end 
     end 

     def name 
      @node.name 
     end 

     def inner_xml 
      @node.inner_xml.strip 
     end 

     def is_start? 
      @node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT 
     end 

     def is_end? 
      @node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT 
     end 

     def attribute(attribute) 
      @node.attribute(attribute) 
     end 

     def for_element(name, &block) 
      return unless self.name == name and is_start? 
      self.instance_eval &block 
     end 

     def inside_element(name=nil, &block) 
      return if @node.self_closing? 
      return unless name.nil? or (self.name == name and is_start?) 

      name = @node.name 
      depth = @node.depth 

      @node.each do 
      return if self.name == name and is_end? and @node.depth == depth 
      self.instance_eval &block 
      end 
     end 
     end 
    end 


    Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do 
     inside_element 'Property' do 

      # OPEN AND PARSE THE <PropertyID> TAG 
      inside_element 'PropertyID' do 

       inside_element 'Identification' do 
        puts attribute_nodes() 
       end 

       # OPEN AND PARSE THE <Address> TAG 
       inside_element 'Address' do 
        for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end 
        for_element 'City' do puts "City: #{inner_xml}" end 
        for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end 
       end 

      for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end 
      end 

      # OPEN AND PARSE THE <Information> TAG 
      inside_element 'Information' do 
       for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end 
       inside_element 'Rents' do 
        for_element 'StandardRent' do puts "Rent: #{inner_xml}" end 
       end 
      end 

      inside_element 'Fee' do 
       for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end 
      end 

      inside_element 'ILS_Identification' do 
       for_element 'Latitude' do puts "Latitude: #{inner_xml}" end 
       for_element 'Longitude' do puts "Longitude: #{inner_xml}" end 
      end 

     end 
    end 

end #END INSERT_PROPERTIES TASK 

end #END NAMESPACE 

和樣品該XML -

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
<PropertyID> 
    <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/> 
    <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/> 
    <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName> 
    <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite> 
    <Address AddressType="property"> 
    <Description>Address of Available Listing</Description> 
    <AddressLine1>1689 N 4th St </AddressLine1> 
    <City>Columbus</City> 
    <State>OH</State> 
    <PostalCode>43201</PostalCode> 
    <Country>US</Country> 
    </Address> 
    <Phone PhoneType="office"> 
    <PhoneNumber>(614) 299-4110</PhoneNumber> 
    </Phone> 
    <Email>[email protected]</Email> 
</PropertyID> 
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate"> 
    <Latitude>39.997694</Latitude> 
    <Longitude>-82.99903</Longitude> 
    <LastUpdate Month="11" Day="11" Year="2013"/> 
</ILS_Identification> 
<Information> 
    <StructureType>Standard</StructureType> 
    <UnitCount>1</UnitCount> 
    <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription> 
    <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription> 
    <Rents> 
    <StandardRent>2000.00</StandardRent> 
    </Rents> 
    <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL> 
</Information> 
<Fee> 
    <ProrateType>Standard</ProrateType> 
    <LateType>Standard</LateType> 
    <LatePercent>0</LatePercent> 
    <LateMinFee>0</LateMinFee> 
    <LateFeePerDay>0</LateFeePerDay> 
    <NonRefundableHoldFee>0</NonRefundableHoldFee> 
    <AdminFee>0</AdminFee> 
    <ApplicationFee>30.00</ApplicationFee> 
    <BrokerFee>0</BrokerFee> 
</Fee> 
<Deposit DepositType="Security Deposit"> 
    <Amount AmountType="Actual"> 
    <ValueRange Exact="2000.00" Currency="USD"/> 
    </Amount> 
</Deposit> 
<Policy> 
    <Pet Allowed="false"/> 
</Policy> 
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <Description/> 
    <UnitCount>1</UnitCount> 
    <RentableUnits>1</RentableUnits> 
    <TotalSquareFeet>0</TotalSquareFeet> 
    <RentableSquareFeet>0</RentableSquareFeet> 
</Phase> 
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <Description/> 
    <UnitCount>1</UnitCount> 
    <SquareFeet>0</SquareFeet> 
</Building> 
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <UnitCount>1</UnitCount> 
    <Room RoomType="Bedroom"> 
    <Count>4</Count> 
    <Comment/> 
    </Room> 
    <Room RoomType="Bathroom"> 
    <Count>1</Count> 
    <Comment/> 
    </Room> 
    <SquareFeet Min="0" Max="0"/> 
    <MarketRent Min="2000" Max="2000"/> 
    <EffectiveRent Min="2000" Max="2000"/> 
</Floorplan> 
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Units> 
    <Unit> 
     <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/> 
     <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName> 
     <UnitBedrooms>4</UnitBedrooms> 
     <UnitBathrooms>1.0</UnitBathrooms> 
     <MinSquareFeet>0</MinSquareFeet> 
     <MaxSquareFeet>0</MaxSquareFeet> 
     <SquareFootType>internal</SquareFootType> 
     <UnitRent>2000.00</UnitRent> 
     <MarketRent>2000.00</MarketRent> 
     <Address AddressType="property"> 
     <AddressLine1>1689 N 4th St </AddressLine1> 
     <City>Columbus</City> 
     <PostalCode>43201</PostalCode> 
     <Country>US</Country> 
     </Address> 
    </Unit> 
    </Units> 
    <Availability> 
    <VacateDate Month="7" Day="23" Year="2014"/> 
    <VacancyClass>Unoccupied</VacancyClass> 
    <MadeReadyDate Month="7" Day="23" Year="2014"/> 
    </Availability> 
    <Amenity AmenityType="Other"> 
    <Description>All new stainless steel appliances! Refinished hardwood floors</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Ceramic tile</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Ceiling fans</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Wrap-around porch</Description> 
    </Amenity> 
    <Amenity AmenityType="Dryer"> 
    <Description>Free Washer and Dryer</Description> 
    </Amenity> 
    <Amenity AmenityType="Washer"> 
    <Description>Free Washer and Dryer</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>off-street parking available</Description> 
    </Amenity> 
</ILS_Unit> 
<File Active="true" FileID="820982141"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src> 
    <Width>360</Width> 
    <Height>300</Height> 
    <Rank>1</Rank> 
</File> 
<File Active="true" FileID="820982145"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>2</Rank> 
</File> 
<File Active="true" FileID="820982149"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>3</Rank> 
</File> 
<File Active="true" FileID="820982152"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>4</Rank> 
</File> 
<File Active="true" FileID="820982155"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>5</Rank> 
</File> 
<File Active="true" FileID="820982157"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>6</Rank> 
</File> 
<File Active="true" FileID="820982160"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>7</Rank> 
</File> 
    </Property> 
+0

http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/ 我還發現,在閱讀大型XML時,牛比nokogiri快5倍。 另外我有一個包裝器,它只是讓你用ox來搜索大的xml,允許你迭代指定的元素。 https://gist.github.com/amolpujari/5966431 –

回答

0

於是我發現瞭解決方案是在一個叫做Saxerator(https://github.com/soulcutter/saxerator)小寶石。它沒有Nokogiri(謝謝),SAX解析,具有優秀的文檔,運行速度超快。我鼓勵任何未來需要使用SAX解析器的人去調查這個小小的寶石(雙關語意圖),並減輕處理所有可怕的Nokogiri文檔的負擔。我的問題的解決方案如下,位於我的seeds.rb文件。

require 'saxerator' 

parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config| 
    config.put_attributes_in_hash! 
    config.symbolize_keys! 
end 


parser.for_tag(:Property).each do |property| 
    if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe' 
     property_attributes = { 
      street_address:  property[:PropertyID][:Address][:AddressLine1], 
      city:    property[:PropertyID][:Address][:City], 
      zipcode:   property[:PropertyID][:Address][:PostalCode], 
      short_description: property[:PropertyID][:MarkertName], 
      long_description: property[:Information][:LongDescription], 
      rent:    property[:Information][:Rents][:StandardRent], 
      application_fee: property[:Fee][:ApplicationFee], 
      vacancy_status:  property[:ILS_Unit][:Availability][:VacancyClass], 
      month_available: property[:ILS_Unit][:Availability][:MadeReadyDate][:Month], 
      latitude:   property[:ILS_Identification][:Latitude], 
      longitude:   property[:ILS_Identification][:Longitude] 

     } 

     if Property.create! property_attributes 
      puts "wahoo" 
     else 
      puts "nope" 
     end 
    end 
end 

============== UPDATE =================

所以我實際上改寫了這個任務做工作好多了,只是想分享下來,因爲任何人都會遇到這個問題 - 這是我的種子.rb文件

require 'saxerator' 
require 'open-uri' 
@company_name = 'northsteppe' 
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config| 
    config.put_attributes_in_hash! 
    config.symbolize_keys! 
end 
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all 
puts "PULLING RELEVENT XML ENTERIES" 
@@count = 0 
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w') 
properties = [] 
parser.for_tag(:Property).each do |property| 
    print '*' 
    if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name 
     properties << property 
     @@count = @@count +1 
    end 
    # break if @@count == 417 
end 
file.write(properties.to_xml) 
file.close 
puts "ADDING PROPERTIES TO THE DATABASE" 
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml") 
doc = Nokogiri::XML(nsr_properties) 
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property| 
    print '.' 
    @images =[] 
    property.xpath("File/File").each do |image| 
     @images << image.at_xpath("Src/text()").to_s 
    end 
    @amenities = [] 
    property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity| 
     @amenities << amenity.at_xpath("Description/text()").to_s 
    end 
    information = { 
     "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s, 
     "city" => property.at_xpath("PropertyID/Address/City/text()").to_s, 
     "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s, 
     "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s, 
     "long_description" => property.at_xpath("Information/LongDescription/text()").to_s, 
     "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s, 
     "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s, 
     "bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s, 
     "bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s, 
     "vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s, 
     "month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s, 
     "latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s, 
     "longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s, 
     "images" => @images, 
     "amenities" => @amenities 
    } 
    Property.create!(information) 
end 
puts "DONE, WAHOO" 
1

試試這個一開始:

require 'nokogiri' 

doc = Nokogiri::XML(File.read('test.xml')) 
doc.search('*[OrganizationName="northsteppe"]') 
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>] 

要做出什麼引入nokogiri發現有點更具可讀性:

puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml } 
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/> 
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/> 

我發現使用CSS通常比XPath更具可讀性。在這種情況下,這是一種折騰。


...實際的文件是300MB和裝載在DOM崩潰的服務器。

如果您的服務器無法處理文件大小,那麼您最好的選擇是SAX解析器,它的內存效率與您所能獲得的一樣高效。下面是使用示例XML一個簡單的例子:

require 'nokogiri' 

class MyDocument < Nokogiri::XML::SAX::Document 
    @@tags = [] 

    def start_element name, attributes = [] 

    attribute_hash = Hash[attributes] 
    if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe') 
     @@tags << { 
     name: name, 
     attributes: attribute_hash 
     } 
    end 
    end 

    def tags 
    @@tags 
    end 
end 

doc = MyDocument.new 

# Create a new parser 
parser = Nokogiri::XML::SAX::Parser.new(doc) 

# Feed the parser some XML 
parser.parse(File.open('test.xml')) 

doc.tags 
# => [{:name=>"Identification", 
#  :attributes=> 
#  {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8", 
#  "OrganizationName"=>"northsteppe", 
#  "IDType"=>"property"}}, 
#  {:name=>"Identification", 
#  :attributes=> 
#  {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e", 
#  "OrganizationName"=>"northsteppe", 
#  "IDType"=>"Company"}}] 
+0

不幸的是,這種方法不能正常工作,因爲實際文件是300MB,並且在DOM中加載導致服務器崩潰。 :/ –

+0

你沒有提到*非常*重要的信息。你已經把所有的約束都放在了你的問題中。不要讓我們一塊一塊地弄清楚。 –

+0

我的確道歉,我不是故意忽略那條信息。我已經添加了上述問題的兩個更新以儘可能具體。我已經運行了你的代碼,並且確實將所有的Identification標籤都拉到了OrganizationName ='northsteppe'的位置,這是我在使用SAX之前已經能夠做到的。 :)也許上面的更新將闡明我正在努力完成的確切過程,而不是我只是要求拼圖的部分,並試圖找出其餘部分(已證明這一特定任務不成功)。 –

相關問題