2016-10-09 41 views
1

我試圖創建一個在下面的url中找到的所有鏈接的數組。使用page.scan(URI.regexp)URI.extract(page)不僅僅是返回url。使用ruby和正則表達式掃描url的網頁

我該如何獲得網址?

require 'net/http' 
require 'uri' 

uri = URI("https://gist.github.com/JsWatt/59f4b8ce6bbf0c7e4dc7") 
page = Net::HTTP.get(uri) 
p page.scan(URI.regexp) 
p URI.extract(page) 

回答

1

如果你只是試圖從文本文件鏈接(<a href="...">要素),那麼它似乎更好地分析它與真正引入nokogiri HTML,然後提取鏈接是這樣的:

require 'nokogiri' 
require 'open-uri' 

# Parse the raw HTML text 
doc = Nokogiri.parse(open('https://gist.githubusercontent.com/JsWatt/59f4b8ce6bbf0c7e4dc7/raw/c340b3fbcab7923e52e5b50165432b6e5f2e3cf4/for_scraper.txt')) 

# Extract all a-elements (HTML links) 
all_links = doc.css('a') 

# Sort + weed out duplicates and empty links 
links = all_links.map { |link| link.attribute('href').to_s }.uniq. 
     sort.delete_if { |h| h.empty? } 

# Print out some of them 
puts links.grep(/store/) 

http://store.steampowered.com/app/214590/ 
http://store.steampowered.com/app/218090/ 
http://store.steampowered.com/app/220780/ 
http://store.steampowered.com/app/226720/ 
... 
相關問題