2013-11-20 61 views
0

我有一個CSV與一些文件名和日期:紅寶石:問題解析CSV,並通過行循環

"doc_1.doc", "date1" 
"doc_2.doc", "date2" 
"doc_5.doc", "date5" 

的問題是,有許多空白文件號之間,如:doc_2doc_5

我正在嘗試編寫一個腳本來解析CSV,並通過比較每行並填寫必要的空白處填充空白處。

例如在這個例子中,它會增加

"doc_3.doc", "date copied from date2" 
"doc_4.doc", "date copied from date2" 

我想因爲我想學習的語言寫在Ruby中這個腳本,並明確我誤解的方式Ruby的循環工作,因爲它不是典型的「爲「循環在PHP等

在這裏,人們經常使用的是到目前爲止我的代碼,用循環任何幫助自身將不勝感激!

#!/usr/bin/env ruby 

require 'csv' 

# Load file 
csv_fname = './upload-list-docs.csv' 

# Parsing function 
def parse_csv(csv_fname) 
    uploads = [] 
    last_number = 0 

    # Regex to find number in doc_XXX.YYY 
    regex_find_number = /(?<=\_)(.*?)(?=\.)/ 

    csv_content = CSV.read(csv_fname) 

    # Skip header row 
    csv_content.shift 

    csv_content.each do |row| 
     current_number = row[0].match regex_find_number 
     current_date = row[1] 
     last_date = current_date 

     until last_number == current_number do 
      uploads << [last_number, last_date] 
      last_number += 1 
     end 
    end 

    return uploads 
end 

puts parse_csv(csv_fname) 

和一些示例CSV

"file_name","date" 
"doc_1.jpg","2011-05-11 09:16:05.000000000" 
"doc_3.doc","2011-05-11 10:10:36.000000000" 
"doc_4.doc","2011-05-11 10:17:19.000000000" 
"doc_6.doc","2011-05-11 10:58:35.000000000" 
"doc_7.pdf","2011-05-11 11:16:22.000000000" 
"doc_8.pdf","2011-05-11 11:19:29.000000000" 
"doc_9.docx","2011-05-11 11:40:03.000000000" 
"doc_13.pdf","2011-05-11 12:26:32.000000000" 
"doc_14.docx","2011-05-11 12:34:50.000000000" 
"doc_15.doc","2011-05-11 12:40:12.000000000" 
"doc_16.doc","2011-05-11 13:03:11.000000000" 
"doc_17.doc","2011-05-11 13:03:58.000000000" 
"doc_19.pdf","2011-05-11 13:25:07.000000000" 
"doc_20.rtf","2011-05-11 13:34:26.000000000" 
"doc_21.rtf","2011-05-11 13:35:25.000000000" 
"doc_24.doc","2011-05-11 13:49:02.000000000" 
"doc_25.doc","2011-05-11 14:05:04.000000000" 
"doc_26.pdf","2011-05-11 14:18:26.000000000" 
"doc_27.rtf","2011-05-11 14:30:19.000000000" 
"doc_28.doc","2011-05-11 14:33:13.000000000" 
"doc_29.jpg","2011-05-11 15:07:27.000000000" 
"doc_30.doc","2011-05-11 15:22:30.000000000" 
"doc_31.doc","2011-05-11 15:31:07.000000000" 
"doc_34.doc","2011-05-11 15:51:56.000000000" 
"doc_35.doc","2011-05-11 15:55:15.000000000" 
"doc_36.doc","2011-05-11 16:06:46.000000000" 
"doc_38.wps","2011-05-11 16:21:08.000000000" 
"doc_39.doc","2011-05-11 16:30:57.000000000" 
"doc_40.doc","2011-05-11 16:41:55.000000000" 
"doc_43.JPG","2011-05-11 17:03:40.000000000" 
"doc_46.doc","2011-05-11 17:28:13.000000000" 
"doc_51.doc","2011-05-11 17:50:34.000000000" 
"doc_52.doc","2011-05-11 18:03:13.000000000" 
"doc_53.doc","2011-05-11 18:43:48.000000000" 
"doc_54.doc","2011-05-11 18:54:45.000000000" 
"doc_55.doc","2011-05-11 19:31:03.000000000" 
"doc_56.doc","2011-05-11 19:31:23.000000000" 
"doc_57.doc","2011-05-11 20:17:38.000000000" 
"doc_59.jpg","2011-05-11 20:22:55.000000000" 
"doc_61.pdf","2011-05-11 21:14:52.000000000" 
+0

當你運行這段代碼時會發生什麼? –

+0

你會得到一個無限循環,對吧? –

+0

是的,無限循環,因爲'current_number'永遠不會改變。 – waffl

回答

1

一個面向對象的方法。請注意,我這樣做時,我還以爲你要充滿[doc_X.doc, date]空白,而不是[X, date] - 對此,因爲它需要在@file_name更多的正則表達式這種做法是比較合適的。現在這可能有點冗長,但仍然可行,而且非常可讀。

require 'csv' 

class Upload 

    attr_reader :file_number, :date 

    def initialize(file_name_or_number, date) 
    @date = date 
    @file_number = if file_name_or_number.is_a?(String) 
        file_name_or_number[/_(\d+)\./, 1].to_i 
        else 
        file_name_or_number 
        end 
    end 

    def to_a 
    [@file_number, @date] 
    end 
end 

class UploadCollection 

    attr_reader :uploads 

    def initialize(input_file) 
    # Slice off all but the first element 
    input_data = CSV.read(input_file)[1..-1] 
    # Create an array of Upload objects and sort by file number 
    @uploads = input_data 
        .map { |row| Upload.new(row[0], row[1]) } 
        .sort_by(&:file_number) 
    end 

    def fill_blanks! 
    # Get the smallest and largest file number 
    # (they're sorted this way, remember) 
    min, max = @uploads.first.file_number, @uploads.last.file_number 
    # Create an array of all numbers between min and max, and 
    # remove those elements already representing a file number 
    missing = (min..max).to_a - @uploads.map(&:file_number) 
    missing.each do |num| 
     # Explaining how this works makes my head ache. Check out the 
     # docs for Array#insert. 
     @uploads.insert(num - 1, Upload.new(num, @uploads[num-2].date)) 
    end 

    # Non-ambiguous return value 
    true 
    end 

    def to_a 
    @uploads.map(&:to_a) 
    end 

    def write_csv(file_path) 
    CSV.open(file_path, 'wb') do |csv| 
     csv << ['file_number', 'date'] # Headers 
     to_a.each { |u| csv << u } 
    end 
    end 
end 

file = 'fnames.csv' 
collection = UploadCollection.new(file) 
collection.fill_blanks! 
puts collection.to_a 
collection.write_csv('out.csv') 
+0

這工作完美,我只是修改了'def行 [@file_name,@date] end'到'def row [@file_number,@date] end'來獲取唯一的數字列表。我現在唯一的最後一個問題是,如果有任何機會,你可以幫忙,是如何將數組輸出到CSV文件? – waffl

+0

用新方法修正代碼。 – SLD

+0

哇,簡單。謝謝:) – waffl

0

的問題是不與(應該被改變成>=如上面已經所述預留危險==)的循環,但與來自提取的整數正則表達式匹配。

current_number = row[0].match(regex_find_number)[0].to_i 
+0

不幸的是,這似乎並不奏效,因爲我認爲問題確實在循環中。我想,'each'函數不會順序地重複CSV,這就是爲什麼循環是無限的。 – waffl

1

這裏是我會寫代碼:

require 'csv' 
csv_fname = './upload-list-docs.csv' 

# Create a structure to get some easy methods: 
Myfile = Struct.new(:name,:date){ 
    def number 
    name[/(?<=\_)(.*?)(?=\.)/].to_i 
    end 
    def next_file 
    Myfile.new(name.gsub(/(?<=\_)(.*?)(?=\.)/){|num|num.next}, date) 
    end 
} 

# Read the content and add it to and array: 
content = CSV.read(csv_fname)[1..-1].map{|data| Myfile.new(*data)} 

# Add first entry to an result array: 
result = [content.shift] 

until content.empty? 

# Get new file: 
new_file = content.shift 

# Fill up with new files until we hit next file: 
files_between = new_file.number - result.last.number 
unless files_between == 1 
    (files_between - 1).times do 
    result << result.last.next_file 
    end 
end 

# Add next file: 
result << new_file 

end 

# Map result back to array: 
result.map!(&:to_a) 
+0

這可以很好地工作,你能解釋我怎樣才能得到輸出只是輸出? (正則表達式的結果,而不是前綴和後綴?)我試圖修改next_file構造'Myfile.new(name.gsub(/(?<= \ _)(。*?)( ?= \。)/){| num | num.next},日期)'但沒有任何運氣。 – waffl

+0

不確定你的意思。也許不是'string.match(regexp)',你應該使用'string [regexp]'。它將返回一個字符串而不是匹配對象。如果這不是你想要的,你需要重新解釋你的問題。 – hirolau

+0

嗯,而不是'doc_1.jpg,date',我想它只是'1,date' – waffl