我假設你使用英文文本進行解析。
您可以使用NLP庫文本分割成句子,然後只需要那些含有word
和特定的長度。我使用了海明威傳記摘錄自維基百科,並使用「1970」一詞來提取,然後再應用第二個grep
以限制其長度。
> require(tm)
> require(openNLP)
> text <- as.String("Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939. In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript. The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans. The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war. Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights. For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway. Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75.")
> sentence.boundaries <- annotate(text, sentence_token_annotator)
> sentences <- text[sentence.boundaries]
> sentences
[1] "Ernest Hemingway wrote For Whom the Bell Tolls in Havana, Cuba; Key West, Florida; and Sun Valley, Idaho in 1939."
[2] "In Cuba, he lived in the Hotel Ambos-Mundos where he worked on the manuscript."
[3] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[4] "The characters in the novel include those who are purely fictional, those based on real people but fictionalized, and those who were actual figures in the war."
[5] "Set in the Sierra de Guadarrama mountain range between Madrid and Segovia, the action takes place during four days and three nights."
[6] "For Whom the Bell Tolls became a Book of the Month Club choice, sold half a million copies within months, was nominated for a Pulitzer Prize, and became a literary triumph for Hemingway."
[7] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word = grep("1940", sentences, fixed = TRUE, value = TRUE)
> with_word
[1] "The novel was finished in July 1940 and published in October.It is based on Hemingway's experiences during the Spanish Civil War and features an American protagonist, named Robert Jordan, who fights with Spanish soldiers for the Republicans.[8]"
[2] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
> with_word[grep("^.{30,100}$", with_word)]
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
在你的情況下,用自己的文字和{30,250}
限制量詞得到公正那些你需要的句子。
注意,有可能到grep你需要1個操作的句子,但你會需要一個超前更復雜的PCRE正則表達式:
> my_sent <- grep("(?s)(?=.{30,100}$).*1940.*$", sentences, value = TRUE, perl = TRUE)
> my_sent
[1] "Published on 21 October 1940, the first edition print run was 75,000 copies priced at $2.75."
的"(?s)(?=.{30,100}$).*1940.*$"
正則表達式將需要串有30〜 100(設定自己的極限)字符從開始到結束,字符串應該包含1940
詞(注意,如果你的字中包含的特殊的正則表達式元字符,它們必須用\\
轉義)。
> with_word = grep("(?s)^(?=.{30,250}$).*\\bhosted\\b.*$", sentences, perl = TRUE, value = TRUE)
> with_word
[1] "proudly hosted by Media Temple!"
什麼是輸入:
與您的數據只是測試? – sweaver2112
這是一個相當困難的問題,特別是因爲我們沒有任何上下文。請提供您的文本塊的樣子。困難的一個例子:美國的縮寫,例如美國。 – lmo
這也很困難,因爲R中的正則表達式的能力非常有限。你可能會更好地檢查它找到的匹配的長度。 – 4castle