2014-07-07 55 views
0

我正在開發一個Rails應用程序,它搜索和擦除Reddit和Twitter,以返回查詢的相關標題。我試圖添加黑客新聞作爲額外的來源。我最初將使用Mechanize與HackerNews搜索頁面進行交互,但我似乎無法取得任何進展。我的第一個想法是讓這個工作在Ruby中,這樣我可以更好地理解如何構建到我的rails應用程序中。基本上我想獲取搜索結果標題和網址。這是我迄今爲止的,但我不知道如何繼續在Ruby中這樣做。搜索和刮黑客新聞 - 紅寶石

require 'mechanize' 

    agent = Mechanize.new 
    mech_page = agent.get('https://hn.algolia.com/') 
    form = mech_page.forms.first 
    form['q']= "ruby" 
    agent.submit(form) 

任何想法或方向將非常感激!

更新7:30 pm EST ========================================= ============

這好像又回到我要找刮谷歌時:

require 'mechanize' 

    mechanize = Mechanize.new 
    url = "https://www.google.com" 
    page = mechanize.get(url) 
    form = page.forms.first 
    form['q'] = 'Ruby' 
    page = form.submit 
    page.search('.r a').each do |link| 
     puts link.text.strip 
    end 

和回報:

"Ruby Programming Language" 
    "Ruby (programming language) - Wikipedia, the free encyclopedia" 
    "Ruby on Rails" 
    "Ruby | Codecademy" 
    "Ruby-Doc.org: Documenting the Ruby Language" 
    "RubyInstaller for Windows" 
    "Downloads - RubyInstaller" 
    "Images for Ruby" 
    "Learn Ruby with the Neo Ruby Koans" 
    "Programming Ruby: The Pragmatic Programmer's ... - Ruby-Doc.org" 

但類似刮碼https://hn.algolia.com/ ...

require 'mechanize' 

    mechanize = Mechanize.new 
    url = "https://hn.algolia.com/" 
    page = mechanize.get(url) 
    form = page.forms.first 
    form['q'] = 'Ruby' 
    page = form.submit 
    page.search('.title a').each do |link| 
     puts link.text.strip 
    end 

...即使結果顯示在運行查詢後的實際頁面上,也不會返回任何結果。任何想法如何我可以刮研究結果?檢查元素揭示類名稱爲「標題」,這是父母給的一個標籤

+0

您需要縮小到特定問題的範圍。 –

+0

你期望看到什麼?你取得了什麼? –

+0

已更新。感謝您的見解。 –

回答

0

試試這個:

page = form.submit 

現在檢查頁面,這樣做從IRB或撬,並找出如何得到你想要的

+0

澄清問題上面。 –

+0

檢查頁面。檢查page.uri以確保發生所有重定向。檢查page.body,確保你看到了你的期望。 – pguardiario

2

你應該寧願放棄一試的API(http://hn.algolia.com/api),或使用RSS(http://news.ycombinator.com/rss & https://news.ycombinator.com/bigrss

您的代碼不工作,因爲點擊率是什麼LOA代碼在JavaScript中。您應該使用API​​而不用任何HTML解析,就像那樣:

require 'open-uri' 
require 'json' 
JSON.parse(open("https://hn.algolia.com/api/v1/search_by_date?query=ruby&tags=story").read)['hits'].map { |h| h['title'] } 

["Learning Ruby on Rails – the resources crossroads", "Rubygems dependency API is down", "Sr. UI Engineer", "Immutability in Ruby: Part 2", "Immutability in Ruby: Part 1", "A collection of awesome Ruby libraries, tools, frameworks and software", "Elixir vs. Ruby Showdown – Phoenix vs. Rails", "Ask HN: Website to trade programming skills?", "This Kid Made An App That Exposes Sellout Politicians", "Ruby Queue Pop with Timeout", "Exploring Ruby’s Regular Expression Algorithm", "What should you learn together with Ruby on Rails", "25 Great Talks from the Atlanta Ruby Users Group", "What's the best way to do Business Analytics for MongoDB data?", "Ruby on Rails Internship", "Will Ruby on Rails be better for fast deployment than Ruby?", "Ask HN: Making Front End Work Suck Less?", "AngularJS with Ruby on Rails by David Bryant Copeland", "Ask HN: Path to become a Product Manager?", "Awesome Ruby"] 
+0

添加上面的更多細節。 –

+0

我不知道如何解析API搜索網址返回的數據。例如,要查詢'ruby',我會使用url https://hn.algolia.com/api/v1/search_by_date?query=ruby,但是我會用我的查詢參數定義的插值變量替換ruby。想法? –