爲什麼這個正則表達式命名組捕獲錯誤的文本？

有人有一些見解，爲什麼命名組ref_id在regex1包含Some address: loststreet 4在下面的捕獲？爲什麼這個正則表達式命名組捕獲錯誤的文本？

我希望它只是loststreet 4，我不明白爲什麼它不是。以下代碼來自IRB會議。

我已經考慮了字符串編碼：

str1 = <<eos 
Burp 
FirstName: Al Bundy 
Ref person: 
Some address: loststreet 4 
Some other address: loststreet 4 
Zip code: 
eos 
# => "Burp\nFirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4\nZip code:\n" 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1) 
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4"> 

str1.encoding 
# => #<Encoding:UTF-8> 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/miu 
# => /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 

str1.match(regex1) 
# => #<MatchData "FirstName: Al Bundy\nRef person:\nSome address: loststreet 4\nSome other address: loststreet 4" name:"Al Bundy" ref_id:"Some address: loststreet 4" other:"loststreet 4">

來源

2013-08-19 pbb2

（「參考人後：「），它可以匹配一個換行符\n（當參數無效時）。通過[^\S\n]?更換（你必須做的所有\s?是不能換行相同。）

（請注意，每個參數後，您使用.*去下一個，通過.*?這是懶惰的更換，以避免太多回溯）

來源

2013-08-19 15:02:22

我錯過了\ s與[\ t \ r \ n \ f]相同的事實， – pbb2

使用MatchData#[]獲得特定的組串：因爲你在你的正則表達式寫一個可選\s?

str1 = <<eos 
Burp 
FirstName: Al Bundy 
Ref person: 
Some address: loststreet 4 
Some other address: loststreet 4 
Zip code: 
eos 

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some other address: (?<other>[^\n]*)/mi 
matched = str1.match(regex1) 

matched['name'] # => "Al Bundy" 
matched['other'] # => "loststreet 4"

來源

2013-08-19 15:00:35 falsetru

編寫代碼的目標之一是使其可維護。使其可維護涉及使其易於閱讀和理解那些遵循該守則的人。

正則表達式通常是一個維護的噩夢，根據我的經驗，往往可以減少其複雜性，或完全取代，以提出同樣有用的代碼。解析這類文本是什麼時候不使用複雜模式的一個很好的例子。

我會做這種方式：

str1 = <<eos 
Burp 
FirstName: Al Bundy 
Ref person: 
Some address: loststreet 4 
Some other address: loststreet 4 
Zip code: 
eos 

def get_value(s) 
    _, value = s.split(':') 
    value.strip if value 
end 

rows = str1.split("\n") 
firstname   = get_value(rows[1]) # => "Al Bundy" 
ref_person   = get_value(rows[2]) # => nil 
some_address  = get_value(rows[3]) # => "loststreet 4" 
some_other_address = get_value(rows[4]) # => "loststreet 4" 
zip_code   = get_value(rows[5]) # => nil

拆分的文本行，並挑選出需要的數據。

可以使用map爲更簡潔的減少：

firstname, ref_person, some_address, some_other_address, zip_code = rows[1..-1].map{ |s| get_value(s) } 
firstname   # => "Al Bundy" 
ref_person   # => nil 
some_address  # => "loststreet 4" 
some_other_address # => "loststreet 4" 
zip_code   # => nil

如果你絕對必須有一個正則表達式，只是爲了有一個正則表達式，然後把它簡化並隔離其任務。儘管可以編寫一個可以跨越多行的正則表達式，但是隨着它跳過並捕獲文本，到達那裏會很痛苦，隨着文本的增長，它會變得越來越脆弱，並且如果傳入的文本更改，它可能會中斷。通過降低其複雜性你就更有可能避免脆弱，會讓你的代碼更健壯：

def get_value(s) 
    s[/^([^:]+):(.*)/] 
    name, value = $1, $2 
    value.strip! if value 

    [name.downcase.tr(' ', '_'), value] 
end 

data_hash = Hash[ 
    str1.split("\n").select{ |s| s[':'] }.map{ |s| get_value(s) } 
] 
data_hash # => {"firstname"=>"Al Bundy", "ref_person"=>"", "some_address"=>"loststreet 4", "some_other_address"=>"loststreet 4", "zip_code"=>""}

來源

2013-08-19 16:11:03

它看起來像你的正則表達式缺少一些零件。請嘗試：

regex1 = /FirstName:\s?(?<name>[^\n]*).*Ref person:\s?(?<ref_id>[^\n]*).*Some address:\s?(?<address>[^\n]*).*Some other address:\s?(?<other>[^\n]*)/mi

使用擴展模式，使得它更容易：

regex1 = %r{ 
    FirstName:\s?(?<name>[^\n]*).* 
    Ref\ person:\s?(?<ref_id>[^\n]*).* 
    Some\ address:\s?(?<address>[^\n]*).* 
    Some\ other\ address:\s?(?<other>[^\n]*) 
}xmi

只要確保逃避常規空格。

來源

2013-08-19 20:04:50 jaeheung

爲什麼這個正則表達式命名組捕獲錯誤的文本？

回答

相關問題