2017-02-12 28 views
0

這裏解析多行文字是一個小例子:與模式

02-09-17 1:01 PM - Some User (Add comments) 
Hello, 

How are you? 

Regards, 

02-09-17 3:29 PM - Another User (Add comments) 
Hey, 

Thanks, all is fine. 

Some another text here. 

02-09-17 4:30 AM - Just a User (Add comments) 
some text 
with 
multiline 

我想分析和處理這三點意見。這是最好的方法是什麼?

這樣的嘗試正則表達式 - http://www.rubular.com/r/k1CHJ1STTD但與/m標誌的問題。沒有多行標誌的正則表達式 - 無法捕捉評論的「主體」。

還試圖通過正則表達式來拆分:

text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/) 
=> ["", 
"02-09-17 1:01 PM - Some User (Add comments)", 
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n", 
"02-09-17 3:29 PM - Another User (Add comments)", 
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text  here.\n" + "\n", 
"02-09-17 4:30 AM - Just a User (Add comments)", 
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n", 
"02-09-17 5:29 PM - Another User (Add comments)", 
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n", 
"02-09-17 6:30 AM - Just a User (Add comments)", 
"\n" + "some text\n" + "with\n" + "multiline\n"] 

但這不是舒適的解決方案。

理想我想正則表達式有三個捕獲或兩個小組比賽,例如:

1. 02-09-17 1:01 PM 
2. Some User (Add comments) 
3. Hello, 

How are you? 

Regards, 

爲每個評論,或者評論陣列:

[['02-09-17 1:01 PM - Some User (Add comments) Hello, 

How are you? 

Regards,'],[...]] 

任何想法?謝謝。

回答

2

您可以使用兩個split S(一個用於整個字符串,併爲每個塊)保持簡單:

text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) } 

您還可以使用scan方法,但它多了幾分挑剔:

text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/) 
+0

謝謝很多,好的解決方案! –

0

您可以使用正則表達式:

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$) 
  • (\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM))第一組,日期和時間相匹配。日期必須包含三個數字,用短劃分隔開,後跟使用AM/PM的時間
  • (.*?)\r?\n((?:.|\r?\n)+?)將用戶名直到第一個換行符(\r?\n)與第二組相匹配。之後,包括換行符在內的任何內容都會匹配並構建第三組評論。
  • 這是行不通的,因爲它會處理一切從評論到文件作爲註釋的結束的開始。因此,您需要選擇下一個日期/時間格式,以便它停在那裏。您可以通過在評論和匹配非貪婪之後重複日期/時間格式來做到這一點,但是這將包括已經在當前匹配中的下一個日期時間,並因此在下一個匹配中排除它(這將導致跳過每一個第二場比賽)。爲了規避這種情況,您可以使用積極的預測:(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)。此後匹配一個數字,但不包含在比賽中。最後的評論必須在字符串$的末尾結束。
  • 您需要使用全局標誌/g但絕不能使用多行標誌/g,因爲評論的匹配越過多行。

這是一個活生生的例子:https://regex101.com/r/o63GQE/2

+0

非常感謝您的幫助! –

1

slice_before可能會更容易比一個巨大的scan理解,而且它具有保持模式的優勢(split中刪除)

data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block| 
    time, user = block.shift.strip.split(' - ') 
    [time, user, block.join.strip] 
end 

p data 
# [["02-09-17 1:01 PM", 
# "Some User (Add comments)", 
# "Hello,\n\nHow are you?\n\nRegards,"], 
# ["02-09-17 3:29 PM", 
# "Another User (Add comments)", 
# "Hey,\n\nThanks, all is fine.\n\nSome another text here."], 
# ["02-09-17 4:30 AM", 
# "Just a User (Add comments)", 
# "some text\nwith\nmultiline"]]