與模式

這裏解析多行文字是一個小例子：與模式

02-09-17 1:01 PM - Some User (Add comments) 
Hello, 

How are you? 

Regards, 

02-09-17 3:29 PM - Another User (Add comments) 
Hey, 

Thanks, all is fine. 

Some another text here. 

02-09-17 4:30 AM - Just a User (Add comments) 
some text 
with 
multiline

我想分析和處理這三點意見。這是最好的方法是什麼？

這樣的嘗試正則表達式 - http://www.rubular.com/r/k1CHJ1STTD但與/m標誌的問題。沒有多行標誌的正則表達式 - 無法捕捉評論的「主體」。

還試圖通過正則表達式來拆分：

text_above.split(/^(\d{1,2}-\d{1,2}-\d{2} \d{1,2}:\d{1,2} [AP]M - .+ \(Add comments\))/) 
=> ["", 
"02-09-17 1:01 PM - Some User (Add comments)", 
"\n" + "Hello,\n" + "\n" + "How are you?\n" + "\n" + "Regards,\n" + "\n", 
"02-09-17 3:29 PM - Another User (Add comments)", 
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text  here.\n" + "\n", 
"02-09-17 4:30 AM - Just a User (Add comments)", 
"\n" + "some text\n" + "with\n" + "multiline\n" + "\n", 
"02-09-17 5:29 PM - Another User (Add comments)", 
"\n" + "Hey,\n" + "\n" + "Thanks, all is fine.\n" + "\n" + "Some another text here.\n" + "\n", 
"02-09-17 6:30 AM - Just a User (Add comments)", 
"\n" + "some text\n" + "with\n" + "multiline\n"]

但這不是舒適的解決方案。

理想我想正則表達式有三個捕獲或兩個小組比賽，例如：

1. 02-09-17 1:01 PM 
2. Some User (Add comments) 
3. Hello, 

How are you? 

Regards,

爲每個評論，或者評論陣列：

[['02-09-17 1:01 PM - Some User (Add comments) Hello, 

How are you? 

Regards,'],[...]]

任何想法？謝謝。

來源

2017-02-12 prosto.vint

您可以使用兩個split S（一個用於整個字符串，併爲每個塊）保持簡單：

text.split(/\n\n(?=\d\d-)/).map { |m| m.split(/ - |\n/, 3) }

您還可以使用scan方法，但它多了幾分挑剔：

text.scan(/([\d-]+[^-]+) - (.*)\n(.*(?>\n.*)*?(?=\n\n\d\d-|\z))/)

來源

2017-02-12 13:53:09

謝謝很多，好的解決方案！ –

您可以使用正則表達式：

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM)) - (.*?)\r?\n((?:.|\r?\n)+?)(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)

(\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM))第一組，日期和時間相匹配。日期必須包含三個數字，用短劃分隔開，後跟使用AM/PM的時間
(.*?)\r?\n((?:.|\r?\n)+?)將用戶名直到第一個換行符（\r?\n）與第二組相匹配。之後，包括換行符在內的任何內容都會匹配並構建第三組評論。
這是行不通的，因爲它會處理一切從評論到文件作爲註釋的結束的開始。因此，您需要選擇下一個日期/時間格式，以便它停在那裏。您可以通過在評論和匹配非貪婪之後重複日期/時間格式來做到這一點，但是這將包括已經在當前匹配中的下一個日期時間，並因此在下一個匹配中排除它（這將導致跳過每一個第二場比賽）。爲了規避這種情況，您可以使用積極的預測：(?=\r?\n\d{2}-\d{2}-\d{2} \d{1,2}:\d{2} (?:AM|PM) - |$)。此後匹配一個數字，但不包含在比賽中。最後的評論必須在字符串$的末尾結束。
您需要使用全局標誌/g但絕不能使用多行標誌/g，因爲評論的匹配越過多行。

這是一個活生生的例子：https://regex101.com/r/o63GQE/2

來源

2017-02-12 13:55:43

非常感謝您的幫助！ –

slice_before可能會更容易比一個巨大的scan理解，而且它具有保持模式的優勢（split中刪除）

data = text.each_line.slice_before(/^\d\d\-\d\d\-\d\d/).map do |block| 
    time, user = block.shift.strip.split(' - ') 
    [time, user, block.join.strip] 
end 

p data 
# [["02-09-17 1:01 PM", 
# "Some User (Add comments)", 
# "Hello,\n\nHow are you?\n\nRegards,"], 
# ["02-09-17 3:29 PM", 
# "Another User (Add comments)", 
# "Hey,\n\nThanks, all is fine.\n\nSome another text here."], 
# ["02-09-17 4:30 AM", 
# "Just a User (Add comments)", 
# "some text\nwith\nmultiline"]]

來源

2017-02-12 14:28:31

回答

相關問題