2011-04-07 68 views
1

我使用Nokogiri和Ruby 1.9.2解析XML文件。在閱讀Descriptions(下面)之前,一切似乎都正常工作。文本正在被截斷。輸入的文字:爲什麼Nokogiri會截斷這個元素?

<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen. 

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color. 

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value> 

但不是我越來越:

g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car. 

注意到它開始於g.這是離開過一半以上。

下面是完整的XML文件:

<?xml version="1.0" encoding="utf-8"?> 
<Hotel> 
    <HotelID>1040900</HotelID> 
    <HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName> 
    <HotelName>Copthorne Hotel Aberdeen</HotelName> 
    <CityID>10</CityID> 
    <CityFileName>Aberdeen</CityFileName> 
    <CityName>Aberdeen</CityName> 
    <CountryCode>GB</CountryCode> 
    <CountryFileName>United_Kingdom</CountryFileName> 
    <CountryName>United Kingdom</CountryName> 
    <StarRating>4</StarRating> 
    <Latitude>57.146068572998</Latitude> 
    <Longitude>-2.111680030823</Longitude> 
    <Popularity>1</Popularity> 
    <Address>122 Huntly Street</Address> 
    <CurrencyCode>GBP</CurrencyCode> 
    <LowRate>36.8354</LowRate> 
    <Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities> 
    <NumberOfReviews>239</NumberOfReviews> 
    <OverallRating>3.95</OverallRating> 
    <CleanlinessRating>3.98</CleanlinessRating> 
    <ServiceRating>3.98</ServiceRating> 
    <FacilitiesRating>3.83</FacilitiesRating> 
    <LocationRating>4.06</LocationRating> 
    <DiningRating>3.93</DiningRating> 
    <RoomsRating>3.68</RoomsRating> 
    <PropertyType>0</PropertyType> 
    <ChainID>92</ChainID> 
    <Checkin>14</Checkin> 
    <Checkout>12</Checkout> 
    <Images> 
    <Image>19305754</Image> 
    <Image>19305755</Image> 
    <Image>19305756</Image> 
    <Image>19305757</Image> 
    <Image>19305758</Image> 
    <Image>19305759</Image> 
    <Image>19305760</Image> 
    <Image>19305761</Image> 
    <Image>19305762</Image> 
    <Image>19305763</Image> 
    <Image>19305764</Image> 
    <Image>19305765</Image> 
    <Image>19305766</Image> 
    <Image>19305767</Image> 
    <Image>37102984</Image> 
    </Images> 
    <Descriptions> 
    <Description> 
     <Name>General Description</Name> 
     <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen. 

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color. 

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value> 
    </Description> 
    <Description> 
     <Name>LocationDescription</Name> 
     <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value> 
    </Description> 
    </Descriptions> 
</Hotel> 

這裏是我的Ruby程序:

require 'rubygems' 
require 'nokogiri' 
require 'ap' 
include Nokogiri 

class Hotel < Nokogiri::XML::SAX::Document 

    def initialize 
     @h = {} 
     @h["Images"] = Array.new([]) 
     @h["Descriptions"] = Array.new([]) 
     @desc = {} 
    end 

    def end_document 
     ap @h 
     puts "Finished..." 
    end 

    def start_element(element, attributes = []) 
     @element = element 

    @desc = {} if element == "Description" 
    end 

    def end_element(element, attributes = [])  
     @h["Images"] << @characters if element == "Image" 
    @desc["Name"] = @characters if element == "Name" 
    if element == "Value" 
     @desc["Value"] = @characters 
     @h["Descriptions"] << @desc 
    end 

    @h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element 
    end 

    def characters(string) 
     @characters = string 
    end 
end 

# Create a new parser 
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new) 

# Feed the parser some XML 
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb')) 

感謝

回答

0

我剝離下來的XML,因爲它有很多不必要的節點爲了這個問題。下面是我的文字後怎麼走了樣:

#!/usr/bin/env ruby 
# encoding: UTF-8 

xml =<<EOT 
<?xml version="1.0" encoding="utf-8"?> 
<Hotel> 
    <Descriptions> 
    <Description> 
     <Name>General Description</Name> 
     <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen. 

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color. 

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value> 
    </Description> 
    <Description> 
     <Name>LocationDescription</Name> 
     <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value> 
    </Description> 
    </Descriptions> 
</Hotel> 
EOT 

require 'nokogiri' 

doc = Nokogiri::XML(xml) 
puts doc.search('Value').map{ |n| n.text } 

隨着輸出樣本:

的國敦享有接近一些酒吧,餐館和其他改道的位置。這家阿伯丁酒店位於城市的西區,距離許多參與觀光或購物的機會大約一英里。阿伯丁國際機場距離Copthorne Hotel酒店約有10英里。

Copthorne Aberdeen酒店共有89間客房。每間客房都提供直撥電話服務,熨褲機,咖啡和茶設施以及帶浴袍和洗浴用品的私人浴室。房間顏色淺。

Hotel Copthorne Aberdeen酒店爲客人提供一間餐廳,客人可以在正式的環境中享用餐點。對於更舒適的客人,客人可以在酒店酒吧享用飲品和便餐。這家酒店確實提供商務服務,並有現場會議室。酒店還爲乘坐私家車的客人提供安全的停車設施。 阿伯丁首屈一指的四星級酒店位於市中心,緊鄰聯合街和主要商業和娛樂區。距香港仔火車站10分鐘車程,距國際機場僅10-20分鐘車程。

這個故意只追蹤Value節點。修改示例以抓取圖像節點也很簡單。

現在,幾個問題:爲什麼使用SAX模式?傳入的XML是否可以合理地放入主機的RAM中?如果沒有,使用DOM,因爲它更容易使用。

當我第一次運行它時,Ruby告訴我invalid multibyte char (US-ASCII),這意味着它不喜歡XML中的某些東西。我通過添加# encoding行來解決這個問題。我使用的是Ruby 1.9.2,這使得處理這些事情更容易。

我正在使用CSS訪問器進行搜索。 Nokogiri允許XPath和CSS,所以你可以隨意放縱你的XML解析心的願望,無論你想要什麼。

+0

沒有真正的理由爲什麼我使用SAX。除了我有200k這些解析。 :-)會喜歡使用DOM將XML文件轉換爲Ruby對象的示例 – cbmeeks 2011-04-07 00:47:17

+0

@cbmeeks,「會喜歡使用DOM將XML文件轉換爲Ruby對象的示例」爲什麼?通過一個DOM來翻譯真是太簡單了,並且抓住了我從來不想用XML來對象轉換器所需的東西。我曾經在Perl中使用它們,並且聽說Rails能夠做到這一點,但我只是看不到這一點;我寫了一些使用Nokogiri解析許多RDF/RSS/Atom提要的大型應用程序,並且它毫不費力地處理了這個工作。 – 2011-04-07 00:51:23

+0

我想我最後的經驗是使用Java DOM解析器,這是一個痛苦。我也會尋找一些DOM Ruby教程 – cbmeeks 2011-04-07 00:57:39

0

我遇到了類似的問題,這裏是真正的解釋:

def characters(string) 
    @characters = string 
end 

實際上應該是這樣的:

def start_element(element, attributes = [])  
    #...(other stuff)... 

    # Reset/initialize @characters 
    @characters = "" 
end 

def characters(string) 
    @characters += string 
end 

的理由是,標籤的內容可以在實際上被分割成多個文本節點,如下所述:http://nokogiri.org/Nokogiri/XML/SAX/Document.html

該方法可能被稱爲m多次給出一個連續的字符串。只有

文本正文的最後一段中被捕獲,因爲遇到一個文本節點的每個時間(即,characters方法被調用)它取代代替追加到它的@characters內容。