2016-05-21 25 views
1

我正在製作一個Ruby Web抓取器來收集一些信息。 在,我想刮頁面的HTML中,每3個跨度等於:Ruby - Scraper連接字符串

<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a> 
      <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div> 
      <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 
<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a> 
      <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div> 
      <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 
<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a> 
      <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div> 
      <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 

然而,一些條款沒有最後期限(以「更多細節」)

現在,我一直在用這個代碼:

#first loop to find the title 
page.css('a.item-link').each do |line| 
    puts line.text 
end 
#Second loop to find the price 
page.css('span.item-price').each do |line| 
    puts line.text 
end 
#third loop to find the details 
page.css('span.item-detail').each do |line| 
    line.text 
end 

我使用的是引入nokogiri寶石和開放URI來檢索和分析文件。

如何連接3個跨度(有些文章在「item-detail」類中只有兩個跨度)並將其打印在屏幕中?

我所需的輸出是:

title 1 
title 2 
title 3 
200€ 
300€ 
500€ 
T2 
T5 
T1 
20 m² 
50 m² 
100 m² 
more details 1 
" " 
more details 3 

一些文章沒有第三跨度(有「更多N」),所以如果這是我將打印「的情況下」。我的目標是將結果寫入.csv文件

+1

請編輯您的問題,包括你所需的輸出。 –

+0

也許看看'reduce' – Michael

+0

好吧,我發現問題已經改變,現在有預期的輸出。預期的輸出與輸入HTML完全不相關。這個例子中沒有標題,價格或細節,所以根本沒有辦法令人滿意地回答這個問題。請提供合適的(現實的)輸入,以及您希望輸入的哪些部分與輸出的哪些部分相匹配的示例。正如我所說,問題的方式不是無法回答。 –

回答

1

這是用於示例輸入的代碼,儘管我必須稍微修改輸入XML以包含在單個HTML節點(<document>)中才能正確可分析:

require "nokogiri" 

html = <<HTML 
<document> 
<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a> 
      <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div> 
      <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 
<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a> 
      <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div> 
      <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 
<article> 
    <div class="item item_contains_branding" data-adid="1234567"> 
     <div class="clearfix" style="display: block;"> 
     <div class="item-multimedia "> 
      ... 
     </div> 
     <div class="item-info-container"> 
      <div class="logo-branding"> 
      ... 
      </div> 
        <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a> 
      <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div> 
      <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span> 
       <p class="item-description">description...</p> 
      <div class="item-toolbar clearfix"> 
      ... 
      </div> 
     </div> 
     </div> 
    </div> 
</article> 
</document> 
HTML 

page = Nokogiri::XML(html) 
articles = page.css('article') 

articles.each do |article| 
    article.css('a.item-link').each do |link| 
    puts "#{link[:title]}" 
    end 
end 

articles.each do |article| 
    article.css('span.item-price').each do |price| 
    puts "#{price.text}" 
    end 
end 

articles.each do |article| 
    detail_spans = article.css('span.item-detail') 
    puts "#{detail_spans[0].text}" 
end 

articles.each do |article| 
    detail_spans = article.css('span.item-detail') 
    puts "#{detail_spans[1].text}" 
end 

articles.each do |article| 
    detail_spans = article.css('span.item-detail') 
    puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }" 
end 

此代碼檢索article元件的陣列,然後使用每個物品元件的陣列範圍的其他查詢對包含內的元件英寸這使得能夠對單個元素值進行精細報告。

最終的item-detail查詢使用元素檢測來確定如何在存在可能不存在的元素的情況下輸出值。其他查詢可能需要這種技術,具體取決於實際的HTML文檔內容。

這些結果如下:

title 1 
title 2 
title 3 
200€ 
300€ 
500€ 
T2 
T5 
T1 
20 m² 
50 m² 
100 m² 
more details 1 
" " 
more details 3 
+0

與您的方法輸出變得更好,但我想分開「項目細節」和打印所有文章的所有第一跨度,接下來的所有文章的第二跨度,如果文章有第三跨度打印它,如果它不打印「」 –

+0

@ absint0o我能夠讓Nokogiri解析器按照您所描述的工作。額外的細節非常有幫助。謝謝!查看答案,並讓我知道你是否有疑問或需要澄清。 –

+0

非常感謝您的幫助!現在我收到錯誤: imov_scrap.rb:28:在'塊中

':未定義的方法'text'爲零:NilClass(NoMethodError) –