2013-06-12 177 views
4

我試圖根據停用詞的列表將Ruby中的字符串拆分爲更小的子字符串或短語。當我直接定義正則表達式模式時,split方法起作用;然而,當我試圖通過在split方法本身內進行評估來定義模式時,它不起作用。使用正則表達式在Ruby中分割字符串中的字符串

實際上,我想讀取停用詞的外部文件並用它來分割我的句子。所以,我希望能夠從外部文件構建模式,而不是直接指定它。我還注意到,當我使用'pp'與'puts'時,我得到了非常不同的行爲,我不知道爲什麼。我在Windows上使用Ruby 2.0和Notepad ++。

require 'pp' 
str = "The force be with you."  
pp str.split(/(?:\bthe\b|\bwith\b)/i) 
=> ["", " force be ", " you."] 
pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?) 
=> ["force be", "you."] 

上面的最後一個數組是我期望的結果。然而,這並不以下工作:

require 'pp' 
stop_array = ["the", "with"] 
str = "The force be with you." 
pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")" 
puts pattern 
=> (?thwit) 
puts str.split(/#{pattern}/i) 
=> The force be with you. 
pp pattern 
=> "(?:\bthe\b|\bwith\b)" 
pp str.split(/#{pattern}/i) 
=> ["The force be with you."] 

更新:使用下面的評論,我修改了原來的腳本。我也創建了一個方法來分割字符串。

require 'pp' 

class String 
     def splitstop(stopwords=[]) 
     stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i 
     return split(stopwords_regex).collect(&:strip).reject(&:empty?) 
     end 
end 

stop_array = ["the", "with", "over"] 

pp "The force be with you.".splitstop stop_array 
=> ["force be", "you."] 
pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array 
=> ["quick brown fox jumps", "lazy dog."] 
+1

'/(?:\的意見書\ C | \ bwith \ B)/'比較好寫的'/ \ B(:該|用?)\ B /'。 –

回答

3

我會做這種方式:

/(?:#{ Regexp.union(stop_array) })/i 
=> /(?:(?-mix:the|with))/i 

嵌入式(?-mix:圈:當使用Regexp.union

str = "The force be with you."  
stop_array = %w[the with] 
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i 
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."] 

,它要提防所產生的實際模式是非常重要的關閉模式內的不區分大小寫的標誌,這可以打破模式,導致它抓住錯誤的東西。相反,你必須告訴引擎只返回樣式,無標誌:

/(?:#{ Regexp.union(stop_array).source })/i 
=> /(?:the|with)/i 

這也是爲什麼pattern = "(?:\bthe\b|\bwith\b)"不起作用:

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i 

紅寶石看到"\b"作爲退格字符。而是使用:

pattern = "(?:\\bthe\\b|\\bwith\\b)" 
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i 
0

你要掩蓋反斜線:

"\\b#{i}\\b" 

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")" 

和次要改進/簡化:

pattern = "\\b(?:" + stop_array.join("|") + ")\\b" 

然後:

str.split(/#{pattern}/i) # => ["", " force be ", " you."] 

如果您的停止名單很短,我認爲這是正確的做法。

+0

使用生成的模式,顯示這將如何解決OP的問題。 –

0
stop_array = ["the", "with"] 
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i}) 

"The force be with you.".split(re) # => 
[ 
    "", 
    "force be", 
    "you." 
] 
0
s = "the force be with you." 
stop_words = %w|the with is| 
# dynamically create a case-insensitive regexp 
regexp = Regexp.new stop_words.join('|'), true 
result = [] 
while(match = regexp.match(s)) 
    word = match.pre_match unless match.pre_match.empty? 
    result << word 
    s = match.post_match 
end 
# the last unmatched content, if any 
result << s 
result.compact!.map(&:strip!) 

pp result 
=> ["force be", "you."] 
相關問題