我有一個Ruby腳本,遍歷項目列表。對於每個項目,它遍歷HTML表格,收集每行的td
文本並將其添加到數組中。如何處理循環中的空白數組元素?
問題是,當該表對於該特定項目爲空時,它會向我的二維數組添加一個空數組,然後在嘗試使用該數組將數據插入到SQL中時導致錯誤數據庫。我怎樣才能防止空數組被追加到我的數組的開始?
projects.each do |project_id|
url = "http://myurl.com/InventoryMaster.aspx?Qtr=%s&Client=%s" % [qtr,project_id[1]]
page = Nokogiri::HTML(open(url))
table = page.at('my_table')
rows = Array.new
table.search('tr').each do |tr|
cells = Array.new
tr.search('td').each do |cell|
cells.push(cell.text.gsub(/\r\n?/, "").strip)
end
# add the project id to the cells array, and get ride of other array elements I don't need.
cells.insert(1, project_id[0])
cells.slice!(11, 6)
cells.delete_at(8)
cells.delete_at(2)
cells.delete_at(0)
rows.push(cells)
end
# first row in the array in the html table is headers. get rid of those.
rows.shift
# last row in the html table is the footers. get rid of those too.
rows.pop
p rows
end
這裏是我解析HTML,按要求:
<table id="ctl00_MainContent_gvSearchResults" cellspacing="1" cellpadding="1"
border="1" style="color:Black;background-color:LightGoldenrodYellow;border-color:Tan;
border-width:1px;border-style:solid;" rules="cols">
<caption></caption>
<tbody>
<tr style="background-color:Tan;font-weight:bold;">
#I don't need the headers.
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
<th scope="col"></th>
</tr>
<tr style="font-family:arial,tahoma;font-size:Smaller;">
<td>not needed</td>
<td>not needed</td>
<td>needed</td>
<td align="right">needed</td>
<td>needed</td>
<td>needed</td>
<td>needed</td>
<td>needed</td>
<td>not needed</td>
<td>needed</td>
#I don't need any of the remaining td's in this row either.
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td align="right"></td>
<td></td>
</tr>
#this row is the footer, and it isn't needed either.
<tr style="background-color:Tan;">
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
一旦我分析的表,我需要在項目的ID,這是一部分加包含在projects
數組中的鍵值對。
顯示了一些HTML,使您的問題完整。有了這個,我們可以很容易地向您展示如何正確解析,而不是在之後嘗試掃描。 –
'table = page.at('my_table')後,如果table.children.size <= 1'(檢查my_table是空白的東西),那麼應該跳過空表 – bjhaid
@Tian Man - 我添加了我的html表格。我應該提到,我需要解析的最後3個td是日期,需要解析爲mm-dd-yyyy。我剛剛意識到,當日期的一天部分是單個數字時,我也對此腳本有問題。 – hyphen