2012-08-10 22 views
0

我是Ruby的新手和我的第一個腳本任務,我被要求編寫一個網頁抓取腳本來抓取GoDaddy的DNS列表元素。GoDaddy的Ruby webscrape腳本

在抓取鏈接時遇到問題,然後我需要關注鏈接。我需要從下面的「GoToSecondaryDNS」js元素中獲取鏈接。我使用機械化和引入nokogiri:

<td class="listCellBorder" align="left" style="width:170px;"> 
      <div style="padding-left:4px;"> 
      <div id="gvZones21divDynamicDNS"></div> 
      <div id="gvZones21divMasterSlave" cicode="41022" onclick="GoToSecondaryDNS('iwanttoscrapethislink.com',0)" class="listFeatureButton secondaryDNSNoPremium" onmouseover="ShowSecondaryDNSAd(this, event);" onmouseout="HideAdInList(event);"></div> 
      <div id="gvZones21divDNSSec" cicode="41023" class="listFeatureButton DNSSECButtonNoPremium" onmouseover="ShowDNSSecAd(this, event);" onmouseout="HideAdInList(event);" onclick="UpgradeLinkActionByID('gvZones21divDNSSec'); return false;" useClick="true" clickObj="aDNSSecUpgradeClicker"></div> 
      <div id="gvZones21divVanityNS" onclick="GoToVanityNS('iwanttoscrapethislink.com',0)" class="listFeatureButton vanityNameserversNoPremium" onmouseover="ShowVanityNSAd(this, event);" onmouseout="HideAdInList(event);"></div> 
      <div style="clear:both;"></div> 
      </div> 
     </td> 

我怎樣才能湊鏈接「iwanttoscrapethislink.com」,然後用的onclick互動跟隨鏈接,並使用Ruby下頁內容拼湊而成的?

到目前爲止,我有一個簡單的開始代碼:

require 'rubygems' 
require 'mechanize' 
require 'open-uri' 




def get_godaddy_data(url) 


     web_agent = Mechanize.new 

     result = nil 

     ### login to GoDaddy admin 


     page = web_agent.get('https://dns.godaddy.com/Default.aspx?sa=') 

     ## there is only one form and it is the first form on thepage 
     form = page.forms.first 
     form.username = 'blank' 
     form.password = 'blank' 

     ## form.submit 
     web_agent.submit(form, form.buttons.first) 

    site_name = page.css('div.gvZones21divMasterSlave onclick td') 
     ### export dns zone data 

     page = web_agent.get('https://dns.godaddy.com/ZoneFile.aspx?zone=' + site_name + '&zoneType=0&refer=dcc') 
     form = page.forms[3] 
     web_agent.submit(form, form.buttons.first).save(uri.host + 'scrape.txt') 

     ## end 

    end 

    ### read export file 
    ##return File.open(uri.host + 'scrape.txt', 'rb') { |file| file.read } 
    end 


    def scrape_dns(url) 

    site_name = page.css('div.gvZones21divMasterSlave onclick td') 
    LIST_URL = "https://dns.godaddy.com/ZoneFile.aspx?zone=" + site_name + '&zoneType=0&refer=dcc" 
    page = Nokogiri::HTML(open(LIST_URL)) 

#not sure how to scrape onclick urls and then how to click through to continue scraping on the second page for each individual DNS 

end 

回答

1

你不能用「onclick」交互,因爲引入nokogiri不是一個JavaScript引擎。

您可以提取內容,然後將其用作後續Web請求的URL。假設doc包含解析HTML:

doc.at('div[onclick^="GoToSecondaryDNS"]')['onclick'] 

會給你爲onclick參數的值。 ^=的意思是「找到以單詞」,所以,讓我們排除其他<div>標籤與onclick參數和返回值:

"GoToSecondaryDNS('iwanttoscrapethislink.com',0)" 

使用一個簡單的正則表達式​​將讓你的主機名:

doc.at('div[onclick^="GoToSecondaryDNS"]')['onclick'][/'(.+)'/,1] 
=> "iwanttoscrapethislink.com" 

其餘的內容,比如如何訪問Mechanize內部的Nokogiri文檔,以及如何創建新的URL等,都可供您瞭解。

+0

謝謝你讓我朝着正確的方向前進。我會看看我是否至少可以先返回鏈接並更新此主題。 – Lynn 2012-08-14 17:39:15

+0

@Lynn,請參閱http://stackoverflow.com/a/2114744/128421瞭解更多信息。 – 2012-08-14 18:59:43