2012-11-12 87 views
-1

我需要找到使用ruby open-uri的兩個網站之間的距離。使用尋找href的正則表達式<a> open-uri ruby​​

def check(url) 
    site = open(url.base_url) 
    link = %r{^<([a])([^"]+)*([^>]+)*(?:>(.*)<\/\1>|\s+\/>)$} 
    site.each_line {|line| puts $&,$1,$2,$3,$4 if (line=~link)} 
    p url.links 
end 

查找鏈接無法正常工作。任何想法爲什麼?

+3

一點都沒有,不知道什麼樣的結構'url'擁有,或你的錯誤的。 – Thilo

回答

1

我看到幾個問題與此正則表達式:

  • 它不一定是一個空間,必須在空標籤來結尾的斜線之前的情況,但你的正則表達式需要它

  • 你的正則表達式是非常繁瑣和多餘

嘗試,而不是下面的,它會提取你的URL出<一個>個標籤:

link = /<a \s # Start of tag 
    [^>]*  # Some whitespace, other attributes, ... 
    href="  # Start of URL 
    ([^"]*)  # The URL, everything up to the closing quote 
    "   # The closing quotes 
    /x   # We stop here, as regular expressions wouldn't be able to 
       # correctly match nested tags anyway 
3

如果你想找到a標籤href參數,使用正確的工具,這是不是經常一個正則表達式。更有可能你應該使用HTML/XML解析器。

Nokogiri是選擇用Ruby解析器:

require 'nokogiri' 
require 'open-uri' 

doc = Nokogiri.HTML(open('http://www.example.org/index.html')) 
doc.search('a').map{ |a| a['href'] } 

pp doc.search('a').map{ |a| a['href'] } 
# => [ 
# => "/", 
# => "/domains/", 
# => "/numbers/", 
# => "/protocols/", 
# => "/about/", 
# => "/go/rfc2606", 
# => "/about/", 
# => "/about/presentations/", 
# => "/about/performance/", 
# => "/reports/", 
# => "/domains/", 
# => "/domains/root/", 
# => "/domains/int/", 
# => "/domains/arpa/", 
# => "/domains/idn-tables/", 
# => "/protocols/", 
# => "/numbers/", 
# => "/abuse/", 
# => "http://www.icann.org/", 
# => "mailto:[email protected]?subject=General%20website%20feedback" 
# => ]