將字符串拆分爲行和句子，但忽略縮寫

有一些字符串內容，我必須拆分。首先，我需要將字符串內容分成幾行。將字符串拆分爲行和句子，但忽略縮寫

這是我該怎麼辦：

str.split('\n').forEach((item) => { 
    if (item) { 
     // TODO: split also each line into sentences 

     let  data  = { 
        type : 'item', 
        content: [{ 
         content : item, 
         timestamp: Math.floor(Date.now()/1000) 
        }] 
       }; 

     // Save `data` to DB 
    } 
});

但現在我還需要每一行分成句子。我對此的困難是正確分割它。因此我會使用.（點和空格）來分割線條。但也有縮略語的數組，不應分割線：

cont abbr = ['vs.', 'min.', 'max.']; // Just an example; there are 70 abbrevations in that array

...而且有幾個規則：

任何數量和網點或單個字母和點也應該被忽略，因爲分割字符串：1.，2.，30.，A.，b.
大寫和小寫應該被忽略：Max. Lorem ipsum不應被分裂。 Lorem max. ipsum。

例

const str = 'Just some examples:\nThis example has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar.';

的該結果應該是四個數據對象：

{ type: 'item', content: [{ content: 'Just some examples:', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'This example has min. 2 lines.', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'Max. 10 lines.', timestamp: 123 }] } 
{ type: 'item', content: [{ content: 'There are some words: 1. Foo and 2. bar.', timestamp: 123 }] }

來源

2016-10-06 user3142695

你可能，可能的話，可以用一個正則表達式來做到這一點（我做不到，但並不意味着這是不可能的），但它會寫一個野獸並保持。我建議使用一個非常寬鬆的正則表達式來掃描字符串中的潛在匹配，然後在上下文中對照像您所描述的一組規則對它們進行評估。它仍然很複雜，但至少應該更易於閱讀和排除故障。另外，如果你正在分裂自然語言文本，不要忽視''你好，我是Sue，「她說。「這是一個字符串？」她問。「這是。」'和'我喜歡'字符串'這樣的單位。' – Palpatim

可以首先檢測串中的縮寫和numberings，並更換每一個虛擬字符串點。在將剩下的點分開後，可以恢復原始點。一旦你有了句子，你就可以像在原始代碼中一樣將每一個換行換行。

更新的代碼允許在縮寫中使用多個點（如p.o.和s.v.p.所示）。

var i, j, strRegex, regex, abbrParts; 
 
const DOT = "_DOT_"; 
 
const abbr = ["p.o.", "s.v.p.", "vs.", "min.", "max."]; 
 

 
var str = 'Just some examples:\nThis example s.v.p. has min. 2 lines. Max. 10 lines. There are some words: 1. Foo and 2. bar. And also A. p.o. professional letters.'; 
 

 
console.log("String: " + str); 
 

 
// Replace dot in abbreviations found in string 
 
for (i = 0; i < abbr.length; i++) { 
 
    abbrParts = abbr[i].split("."); 
 
    strRegex = "(\\W*" + abbrParts[0] + ")"; 
 
    for (j = 1; j < abbrParts.length - 1; j++) { 
 
     strRegex += "(\\.)(" + abbrParts[j] + ")"; 
 
    } 
 
    strRegex += "(\\.)(" + abbrParts[abbrParts.length - 1] + "\\W*)"; 
 
    regex = new RegExp(strRegex, "gi"); 
 
    str = str.replace(regex, function() { 
 
     var groups = arguments; 
 
     var result = groups[1]; 
 
     for (j = 2; j < groups.length; j += 2) { 
 
      result += (groups[j] === "." ? DOT + groups[j+1] : ""); 
 
     } 
 
     return result; 
 
    }); 
 
} 
 

 
// Replace dot in numbers found in string 
 
str = str.replace(/(\W*\d+)(\.)/gi, "$1" + DOT); 
 

 
// Replace dot in letter numbering found in string 
 
str = str.replace(/(\W+[a-zA-Z])(\.)/gi, "$1" + DOT); 
 

 
// Split the string at dots 
 
var parts = str.split("."); 
 

 
// Restore dots in sentences 
 
var sentences = []; 
 
regex = new RegExp(DOT, "gi"); 
 
for (i = 0; i < parts.length; i++) { 
 
    if (parts[i].trim().length > 0) { 
 
     sentences.push(parts[i].replace(regex, ".").trim() + "."); 
 
     console.log("Sentence " + (i + 1) + ": " + sentences[i]); 
 
    } 
 
}

來源

2016-10-06 20:45:27 ConnorsFan

我忘記了一種縮寫：它們可以有兩個點，比如'p.o.'。這將爲你的代碼創建一個字符串'professional'，一個新的字符串'p_DOT_ofession' - 不應該。我該如何改進你的代碼？ – user3142695

今天晚些時候（下班後）我會回覆你的。 – ConnorsFan

'new RegExp（「（\ W *」+ abbrParts [0] +「）（\。）（」+ abbrParts [1] +「\ W *）」，「gi」）包含反斜槓的常見錯誤。 '「\。」是一個無效的轉義序列。 –

將字符串拆分爲行和句子，但忽略縮寫

回答

相關問題