2016-04-29 77 views
-1

我需要從包含特定單詞的文本塊中提取句子。這一個我有:正則表達式來選擇特定長度的句子

[A-Z][^\\.;\\?\\!]*(word)[^\\.;\\?\\!]* 

但我也需要這個句子是一個特定的長度,比如說30到250個符號。我知道這似乎很容易,但我不知道該怎麼做。

所以輸入可以是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple! A full Reference & Help is available in the Library, or watch the video Tutorial hosted by Media Temple which are so amazingly awesome that just looking at the name I get a boner instantly, and I am really serious right now, it's that exciting if you didn't get it. 

以上案文包含2句:一個是76碼,另一個是266兩者都包含字託管,這將是我們的選擇的話。所以正則表達式應該匹配第一句。輸出應該是:

Welcome to RegExr v2.1 by gskinner.com, proudly **hosted** by Media Temple 

在此先感謝。

+1

什麼是輸入:

與您的數據只是測試? – sweaver2112

+0

這是一個相當困難的問題,特別是因爲我們沒有任何上下文。請提供您的文本塊的樣子。困難的一個例子:美國的縮寫,例如美國。 – lmo

+0

這也很困難,因爲R中的正則表達式的能力非常有限。你可能會更好地檢查它找到的匹配的長度。 – 4castle

回答

1

我假設你使用英文文本進行解析。

您可以使用NLP庫文本分割成句子,然後只需要那些含有word和特定的長度。我使用了海明威傳記摘錄自維基百科,並使用「1970」一詞來提取,然後再應用第二個grep以限制其長度。

> require(tm) 
> require(openNLP) 
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.") 
> sentence.boundaries <- annotate(text, sentence_token_annotator) 
> sentences <- text[sentence.boundaries] 
> sentences 
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."                                 
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."                                          
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."                      
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."                             
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."               
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE) 
> with_word 
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]" 
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."                                       
> with_word[grep("^.{30,100}$", with_word)] 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 

在你的情況下,用自己的文字和{30,250}限制量詞得到公正那些你需要的句子。

注意,有可能到grep你需要1個操作的句子,但你會需要一個超前更復雜的PCRE正則表達式:

> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE) 
> my_sent 
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75." 

"(?s)(?=.{30,100}$).*1940.*$"正則表達式將需要串有30〜 100(設定自己的極限)字符從開始到結束,字符串應該包含1940詞(注意,如果你的字中包含的特殊的正則表達式元字符,它們必須用\\轉義)。

> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE) 
> with_word 
[1] "proudly hosted by Media Temple!" 
+1

Dear Wiktor。這真是太棒了!對這篇文章的幾點評論讓我相信,R可能不是使用自然語言文本的最佳工具。但是Ka-Boom!你給我看了一個合適的插件。而我的問題的整個解決方案。這表明我甚至沒有接近結果。非常感謝你!我懇請你原諒這篇文章中的模糊解釋。但是你確切地理解它。順便說一下,我花了5個小時的解決方法來安裝openNLP,我不得不降級Java並做很多其他事情,這就是爲什麼我遲到了這個答覆。謝謝你,過上美好的生活:P – Denis

0

您可以使用positive lookahead

(?=[\p{Any}]{30,250}.*) 
+0

我請你原諒,但你可能注意到我並不擅長正則表達式。我不太明白在這個特定的例子中我能如何使用積極的向前看。我們的情況是什麼? – Denis

+0

積極的前瞻將確保正則表達式的下一個內容必須與正向預見組相匹配。讓我看看你最新的問題。 –

+0

我們如何知道第一句話的結尾?他們是分開的還是在同一條線上? –