匹配兩個字符串，其中某些文本是可選的以匹配？

我想寫一個簡單的Java函數，將採取語言輸入列表，看看我從數據庫查詢獲得的匹配。我的數據庫中的所有字符串都進行了歸一化處理，以便於搜索。這是一個例子。匹配兩個字符串，其中某些文本是可選的以匹配？

研究室A想有以下任何一種語言輸入（它們是由管道字符|分隔）參與者：

{English | English, Spanish | Spanish}

換句話說，這個實驗室可以採取參與者或者是單語英語，單語西班牙語，或雙語英語和西班牙語。這非常簡單 - 如果他們的數據庫結果返回"English"或"English, Spanish"或"Spanish"，我的函數將找到一個匹配項。

然而，我的數據庫還會標記參與者是否只有某種語言的最小語言輸入（使用~字符）。

"English, ~Spanish" = participant hears English and a little Spanish 
"English, ~Spanish, Russian" = participant hears English, Russian, and a little Spanish

這是我遇到麻煩的地方。我想匹配"English, ~Spanish"與"English"和"English, Spanish"。

我正在考慮刪除/隱藏標記爲~的語言，但是如果有一個研究實驗室只需要{English, Spanish}，那麼"English, ~Spanish"即使應該也不會匹配。

我也想不出如何使用正則表達式來完成這項任務。任何幫助將不勝感激！

來源

2012-06-15 LeoPardus

所以你的問題是你不知道你應該用什麼來匹配「英語，西班牙語」？ – xvatar

不，函數需要採取任意語言輸入列表並確定查詢結果是否匹配。我只用英語和西班牙語爲例。如果我得到了輸入{Russian |英語}，然後可能的匹配是：「俄語」，「英語」，「俄語，〜德語」，「俄語，〜西班牙語，〜意大利語」等。 – LeoPardus

問題是，這是一種糟糕的方式來使用正則表達式第一名。您的數據庫未正確歸一化。您不應該使用逗號分隔的多個值列表，而應該有多個單值記錄。上述的正則表達式解決方案將是a）非常複雜b）因此很難保持c）緩慢。嘗試修復你的數據庫，你可以用一個基本的SELECT語句解決這個問題。 – Tomalak

試試這個

\b(English[, ~]+Spanish|Spanish|English)\b

代碼

try { 
    if (subjectString.matches("(?im)\\b(English[, ~]+Spanish|Spanish|English)\\b")) { 
     // String matched entirely 
    } else { 
     // Match attempt failed 
    } 
} catch (PatternSyntaxException ex) { 
    // Syntax error in the regular expression 
}

說明

"\\b" +    // Assert position at a word boundary 
"(" +    // Match the regular expression below and capture its match into backreference number 1 
         // Match either the regular expression below (attempting the next alternative only if this one fails) 
     "English" +   // Match the characters 「English」 literally 
     "[, ~]" +   // Match a single character present in the list 「, ~」 
     "+" +    // Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
     "Spanish" +   // Match the characters 「Spanish」 literally 
    "|" +    // Or match regular expression number 2 below (attempting the next alternative only if this one fails) 
     "Spanish" +   // Match the characters 「Spanish」 literally 
    "|" +    // Or match regular expression number 3 below (the entire group fails if this one fails to match) 
     "English" +   // Match the characters 「English」 literally 
")" + 
"\\b"     // Assert position at a word boundary

UPDATE

一個更廣義的形式是這樣的：

(?-i)\b([A-Z][a-z]+[, ~]+[a-z]+|[A-Z][a-z]+)\b

順便說一句，你這樣做可能會搞砸了，因爲這個模式將要匹配所有大寫的單詞。在生成RegEx模式時使用此語法可能會有更好的選擇。

(A[, ~]+B|A|B)

其中A，B將是語言的名稱。我認爲這將是一個更好的方法。

來源

2012-06-15 04:13:21 Cylian

謝謝，這看起來很有希望。我忘記解釋說我們並不總是僅僅匹配英語和西班牙語。我試圖弄清楚如何不用硬編碼「英文」和「西班牙文」。但如果您有任何進一步的見解，請讓我知道！ – LeoPardus

同樣在上面的例子中，「〜英語，西班牙語」也需要匹配{英語，西班牙語}，但如果您在開頭添加可選的[〜] +，那麼它也會匹配「〜英語，西班牙語」，它不應該！ – LeoPardus

@Cylian這是一個無法通過任何*合理的*正則表達式解決的問題。即使你的「一般化方法」也是錯誤的。四種語言呢？那麼不同的命令呢？什麼是空白？ OP的整個方法註定了。修復數據庫是解決這個問題的方法，沒有其他任何幫助。 – Tomalak

匹配兩個字符串，其中某些文本是可選的以匹配？

回答

相關問題