2014-09-21 36 views
0

我目前正在使用Ruby中的Markov chain text generator應用程序,它接收文本的正文(「語料庫」),然後基於該正文生成新文本。我現在需要解決的問題是編寫一個Regexp,它將返回包含我指定的單詞數的數組。我想在這裏做的是抓取一定數量的單詞(由用戶指定),但在整個字符串中多次。如何使用RegExp獲取指定數量的帶有特殊字符的單詞?

去掉另一個我見過的應用程序,我正在使用類似/(([.,?"();\-!':—^\w]+){#{depth}})/的內容,其中#{depth}插值了我一次需要的單詞數。這應該一次抓住兩個單詞,同時允許一個特殊字符的子集,這就是讓我感覺到的那一塊。所以總的問題是這樣的:如何動態地指定我想要的單詞數量(用空格分隔),同時還允許這些單詞中的一系列特殊字符?

這是我目前有:

# Regex 
@match_regex = /(([.,?"();\-!':—^\w]+){2})/ 
s = input.scan(@match_regex).to_a 
puts s.inspect 

# Input 
Within weeks they planned a meeting. She sent him poetry along with her itinerary, 
having worked in a business meeting to excuse the opportunity. He prepared flowers 
and a banner of welcome on his hearth. 

# Output - seems to be grabbing last word again for some reason 
[["Within weeks ", "weeks "], ["they planned ", "planned "], ["a meeting. ", "meeting. "], 
["She sent ", "sent "], ["him poetry ", "poetry "], ["along with ", "with "], 
["her itinerary, ", "itinerary, "], ["having worked ", "worked "], ["in a ", "a "], 
["business meeting ", "meeting "], ["to excuse ", "excuse "], 
["the opportunity. ", "opportunity. "], ["He prepared ", "prepared "], ["flowers and ", "and "], 
["a banner ", "banner "], ["of welcome ", "welcome "], ["on his ", "his "]] 

# Desired output. I'm not picky if it has trailing spaces or not as I can always trim that 
["Within weeks", "they planned", "a meeting.", "She sent", "him poetry", "along with", 
"her itinerary," "having worked", "in a", "business meeting", "to excuse", "the opportunity.", 
"He prepared", "flowers and", "a banner", "of welcome", "on his"] 

任何幫助將不勝感激。謝謝!

回答

0

在正則表達式中,每組括號都會創建一個捕獲組,並且對於在您的輸入中找到的每個匹配項,Ruby都會返回這些組的列表。

你有兩組括號:第一個圍繞整個表達式,第二個圍繞每個單詞(注意對於重複捕獲組(例如(foo){x})只返回最後一個實例)。因此每個比賽有兩個項目清單。

要得到你想要的,你需要刪除這些捕獲組。第一組可以簡單地被刪除,第二組你可以使它成爲非捕獲組,爲此你可以用?:開始括號。 因此你想表達的是:

@match_regex = /(?:[.,?"();\-!':—^\w]+){2}/

0

如果我正確理解你的問題,我認爲這應該爲你工作:在special_chars

def split_it(text, num_words, special_chars) 
    text.scan(/(?:[\w#{special_chars}]+(?:\s+|$)){#{num_words}}/) 
end 

text =<<_ 
Within weeks they planned a meeting. She sent him poetry along with her itinerary, 
having worked in a business meeting to excuse the opportunity. He prepared flowers 
and a banner of welcome on his hearth. 
_ 

special_chars = ".,?\"();\\-!':" 

split_it(text, 2, special_chars) 
    #=> ["Within weeks ", "they planned ", "a meeting. ", "She sent ", "him poetry ", 
    # "along with ", "her itinerary,\n", "having worked ", "in a ", 
    # "business meeting ", "to excuse ", "the opportunity. ", "He prepared ", 
    # "flowers\nand ", "a banner ", "of welcome ", "on his "] 
split_it(text, 3, special_chars) 
    #=> ["Within weeks they ", "planned a meeting. ", "She sent him ", 
    # "poetry along with ", "her itinerary,\nhaving ", "worked in a ", 
    # "business meeting to ", "excuse the opportunity. ", "He prepared flowers\n", 
    # "and a banner ", "of welcome on "] 

\\-。如果您有-\-它將出現在正則表達式的括號中作爲-,並且Ruby會預期您正在定義一個範圍,並將引發異常。額外的反斜槓導致\-出現在方括號之間,告訴Ruby它是文字-。 @Amadan指出,如果-位於字符串的開頭或末尾,則不需要擒縱。

馬爾可夫鏈?嗯。

+0

處理'-'的另一種方法是確保它是方括號中的第一個或最後一個字符;這樣,它將表示文字短劃線而不是範圍,即使沒有逃脫。 – Amadan 2014-09-22 00:48:24

相關問題