2016-01-14 76 views
9

我想解析一個表,但我不知道如何從中保存數據。我想將數據保存每行排的樣子:如何用Nokogiri解析HTML表格?

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 

樣品列表:

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
     . 
     . 
     . 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
    </table> 
EOT 

我刮的代碼是:

doc = Nokogiri::HTML(open(html), nil, 'UTF-8') 
    tables = doc.css('div.open') 

    @tablesArray = [] 

    tables.each do |table| 
    title = table.css('tr[1] > th').text 
    cell_data = table.css('tr > td').text 
    raw_name = table.css('tr > th').text 
    @tablesArray << Table.new(cell_data, raw_name) 
    end 

    render template: 'scrape_krasecology' 
    end 
    end 

當我嘗試顯示HTML頁面中的數據看起來像所有的列名都以同樣的方式存儲在一個數組元素中,並且所有數據都以相同的方式存儲。

+1

請降低你的代碼需要說明問題的最低限度。在問題本身*中提供一個最小的HTML *示例,它也演示了這個問題。不要要求我們去頁面提取HTML或建立必要的周邊代碼來測試你的。閱讀「[問]」,「[mcve]」和http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/ –

+0

@錫人感謝。我更新了我的代碼。相信現在看起來好多了? – verrom

+0

一般信息爲一般人尋找這個主題:http://ruby.bastardsbook.com/chapters/web-crawling/ – benjamin

回答

11

問題的關鍵在於對多個結果調用#text將返回每個單獨元素的#text的串聯。

讓我們檢查了每個步驟做:

# Finds all <table>s with class open 
# I'm assuming you have only one <table> so 
# you don't actually have to loop through 
# all tables, instead you can just operate 
# on the first one. If that is not the case, 
# you can use a loop the way you did 
tables = doc.css('table.open') 

# The text of all <th>s in <tr> one in the table 
title = table.css('tr[1] > th').text 

# The text of all <td>s in all <tr>s in the table 
# You obviously wanted just the <td>s in one <tr> 
cell_data = table.css('tr > td').text 

# The text of all <th>s in all <tr>s in the table 
# You obviously wanted just the <th>s in one <tr> 
raw_name = table.css('tr > th').text 

現在我們知道了什麼是錯的,這裏是一個可能的解決方案:

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>1001</td> 
      <td>1002</td> 
      <td>1003</td> 
      <td>1004</td> 
      <td>1005</td>   
     </tr> 
     <tr> 
      <th>Raw name 2</th> 
      <td>2001</td> 
      <td>2002</td> 
      <td>2003</td> 
      <td>2004</td> 
      <td>2005</td>   
     </tr> 
     <tr> 
      <th>Raw name 3</th> 
      <td>3001</td> 
      <td>3002</td> 
      <td>3003</td> 
      <td>3004</td> 
      <td>3005</td>   
     </tr> 
    </table> 
EOT 

doc = Nokogiri::HTML(html, nil, 'UTF-8') 

# Fetches only the first <table>. If you have 
# more than one, you can loop the way you 
# originally did. 
table = doc.css('table.open').first 

# Fetches all rows (<tr>s) 
rows = table.css('tr') 

# The column names are the first row (shift returns 
# the first element and removes it from the array). 
# On that row we get the text of each individual <th> 
# This will be Table name, Column name 1, Column name 2... 
column_names = rows.shift.css('th').map(&:text) 

# On each of the remaining rows 
text_all_rows = rows.map do |row| 

    # We get the name (<th>) 
    # On the first row this will be Raw name 1 
    # on the second - Raw name 2, etc. 
    row_name = row.css('th').text 

    # We get the text of each individual value (<td>) 
    # On the first row this will be 1001, 1002, 1003... 
    # on the second - 2001, 2002, 2003... etc 
    row_values = row.css('td').map(&:text) 

    # We map the name, followed by all the values 
    [row_name, *row_values] 
end 

p column_names # => ["Table name", "Column name 1", "Column name 2", 
       #  "Column name 3", "Column name 4", "Column name 5"] 
p text_all_rows # => [["Raw name 1", "1001", "1002", "1003", "1004", "1005"], 
       #  ["Raw name 2", "2001", "2002", "2003", "2004", "2005"], 
       #  ["Raw name 3", "3001", "3002", "3003", "3004", "3005"]] 

# If you want to combine them 
text_all_rows.each do |row_as_text| 
    p column_names.zip(row_as_text).to_h 
end # => 
    # {"Table name"=>"Raw name 1", "Column name 1"=>"1001", "Column name 2"=>"1002", "Column name 3"=>"1003", "Column name 4"=>"1004", "Column name 5"=>"1005"} 
    # {"Table name"=>"Raw name 2", "Column name 1"=>"2001", "Column name 2"=>"2002", "Column name 3"=>"2003", "Column name 4"=>"2004", "Column name 5"=>"2005"} 
    # {"Table name"=>"Raw name 3", "Column name 1"=>"3001", "Column name 2"=>"3002", "Column name 3"=>"3003", "Column name 4"=>"3004", "Column name 5"=>"3005"} 
+0

謝謝,這有幫助! – verrom

2

你所需輸出是無稽之談:

['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 
# ~> -:1: Invalid octal digit 
# ~> ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 

我假設你想要引用的數字。

剝離,保持代碼的工作,並減少HTML更易於管理的例子,東西之後再運行它:

require 'nokogiri' 

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
     </tr> 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
     </tr> 
    </table> 
EOT 


doc = Nokogiri::HTML(html) 
tables = doc.css('table.open') 

tables_data = [] 

tables.each do |table| 
    title = table.css('tr[1] > th').text # !> assigned but unused variable - title 
    cell_data = table.css('tr > td').text 
    raw_name = table.css('tr > th').text 
    tables_data << [cell_data, raw_name] 
end 

導致:

tables_data 
# => [["2,0940,0172,0940,017", 
#  "Table nameColumn name 1Column name 2Raw name 1Raw name 5"]] 

的第一件事注意你是不是在使用title,雖然你指定它。例如,當您清理代碼時可能發生這種情況。

css,如searchxpath,返回一個NodeSet,類似於一個節點數組。當您在使用節點集或textinner_text返回連接成一個字符串中每個節點的文本:

獲取包含的所有Node對象的內部文本。

這是它的行爲:

require 'nokogiri' 

doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>') 

doc.css('p').text # => "foobar" 

相反,你應該遍歷找到的每個節點,並單獨提取其文本。這部分內容的很多倍SO:

doc.css('p').map{ |node| node.text } # => ["foo", "bar"] 

這可以簡化爲:

doc.css('p').map(&:text) # => ["foo", "bar"] 

見 「How to avoid joining all text from Nodes when scraping」 也。

文檔說這個約contenttextinner_text一個節點時:

返回此節點的內容。

相反,你需要的單個節點的文本之後去:

require 'nokogiri' 

html = <<EOT 
    <table class="open"> 
     <tr> 
      <th>Table name</th> 
      <th>Column name 1</th> 
      <th>Column name 2</th> 
      <th>Column name 3</th> 
      <th>Column name 4</th> 
      <th>Column name 5</th> 
     </tr> 
     <tr> 
      <th>Raw name 1</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
     <tr> 
      <th>Raw name 5</th> 
      <td>2,094</td> 
      <td>0,017</td> 
      <td>0,098</td> 
      <td>0,113</td> 
      <td>0,452</td>   
     </tr> 
    </table> 
EOT 


tables_data = [] 

doc = Nokogiri::HTML(html) 

doc.css('table.open').each do |table| 

    # find all rows in the current table, then iterate over the second all the way to the final one... 
    table.css('tr')[1..-1].each do |tr| 

    # collect the cell data and raw names from the remaining rows' cells... 
    raw_name = tr.at('th').text 
    cell_data = tr.css('td').map(&:text) 

    # aggregate it... 
    tables_data += [raw_name, cell_data] 
    end 
end 

現在導致:

tables_data 
# => ["Raw name 1", 
#  ["2,094", "0,017", "0,098", "0,113", "0,452"], 
#  "Raw name 5", 
#  ["2,094", "0,017", "0,098", "0,113", "0,452"]] 

你能弄清楚如何要挾引用數爲小數接受到Ruby,或者你想要的操作內部數組。

+0

非常感謝您的回答和解答!答案非常有用,幫助了我! – verrom

0

我假設你從這裏借用了一些代碼或者任何其他相關的參考資料(或者我很抱歉添加了錯誤的參考) - http://quabr.com/34781600/ruby-nokogiri-parse-html-table

但是,如果你想捕捉所有的行,你可以更改以下密碼 -

希望這有助於你解決你的問題。

doc = Nokogiri::HTML(open(html), nil, 'UTF-8') 

# We need .open tr, because we want to capture all the columns from a specific table's row 

@tablesArray = doc.css('table.open tr').reduce([]) do |array, row| 
    # This will allow us to create result as this your illustrated one 
    # ie. ['Raw name 1', 2,094, 0,017, 0,098, 0,113, 0,452] 
    array << row.css('th, td').map(&:text) 
end 

render template: 'scrape_krasecology' 

最良好的祝願