我使用的是DMOZ的list of url topics,其中包含一些具有包含下劃線的主機名的網址。替代URI.parse,允許主機名包含下劃線
例如:
608 <ExternalPage about="http://outer_heaven4.tripod.com/index2.htm">
609 <d:Title>The Outer Heaven</d:Title>
610 <d:Description>Information and image gallery of McFarlane's action figures for Trigun, Akira, Tenchi Muyo and other Japanese Sci-Fi animations.</d:Description>
611 <topic>Top/Arts/Animation/Anime/Collectibles/Models_and_Figures/Action_Figures</topic>
612 </ExternalPage>
雖然這個網址將在Web瀏覽器中工作(或者,至少,它在我的:P),it's not legal according to the standard:
主機名可能不包含其他字符,如下劃線字符(_),
當試圖解析這樣的URL時會導致錯誤URI.parse
:
[2] pry(main)> require 'uri'
=> true
[3] pry(main)> URI.parse "http://outer_heaven4.tripod.com/index2.htm"
URI::InvalidURIError: the scheme http does not accept registry part: outer_heaven4.tripod.com (or bad hostname?)
from ~/.rvm/rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/uri/generic.rb:213:in `initialize'
是否有替代品URI.parse
我可以使用它具有較低的嚴格性,而不只是滾動自己的?