如何替換文本中的字符串列表，其中有些字符串是其他子字符串？

我有一個文本包含一些單詞，我想標記，並且要標記的單詞包含在列表中。問題是這些單詞中的一些是其他單詞的子串，但我想從列表中標記最長的識別字符串。如何替換文本中的字符串列表，其中有些字符串是其他子字符串？

例如，如果我的文本是「foo和bar與foo bar不同」。並且我的列表包含「foo」，「bar」和「foo bar」，結果應該是「[tag] foo [/ tag]和[tag] bar [/ tag]與[tag] foo bar [/ tag] 「。

String text = "foo and bar are different from foo bar."; 

List<String> words = new ArrayList(); 
words.add("foo"); 
words.add("bar"); 
words.add("foo bar"); 

String tagged = someFunction(text, words);

應該是什麼someFunction的代碼，這樣該字符串taggedText的值是<tag>foo</tag> and <tag>bar</tag> are different from <tag>foo bar</tag>.？

來源

2016-08-25 KevinJio

按長度排序。 –

...您可以使用[Collections.sort（列表，比較器）]（https://docs.oracle.com/javase/7/docs/api/java/util/Collections.html#sort（java .util.List，％20java.util.Comparator））。 –

更換所有與標記匹配的單詞（在我的實施例I中使用| I |作爲標記，其中i對應於所標記的字的索引。）嘗試此方法：

private static String someFunction(String text, List<String> words) { 
     //Container for the tagged strings 
     List<String> tagged = new ArrayList<>(); 

     //Create comparator class for sorting list according to string length 
     Comparator<String> x = new Comparator<String>() { 
      @Override 
      public int compare(String s1, String s2) 
      { 
       if(s1.length() > s2.length()) 
        return -1; 

       if(s2.length() > s1.length()) 
        return 1; 

       return 0; 
      } 
     }; 

     //Sort list 
     Collections.sort(words, x); 

     //Replace all words in the text that matches a word in the word list 
     //Note that we replace the matching word with a marker |0|, |1|, etc... 
     for (int i = 0; i < words.size(); i++) { 
      text = text.replaceAll(words.get(i), "\\|" + i + "\\|"); 
      //Save the matching word and put it between tags 
      tagged.add("<tag>" + words.get(i) + "</tag>"); 
     } 

     //Replace all markers with the tagged words 
     for (int i = 0; i < tagged.size(); i++) { 
      text = text.replaceAll("\\|" + i + "\\|", tagged.get(i)); 
     } 


     return text; 
    }

警告：我假設我的標記'| i |'將永遠不會出現在文字中。將我的標記替換爲您不想出現在文本中的任何符號。這只是一個想法，而不是完美的答案。

來源

2016-08-25 15:42:02 Aaron

使用String的分割方法。並將每個單詞與List進行比較。

String somefunction(String text, List<String> words){ 
    String res = ""; 
    String[] splits = text.split(" "); 
    for(String st: splits){ 
    if(words.contains(st){ 
     res += "<tag>"+st+"<\tag>\n"; 
    } 
    } 
    return res; 
}

來源

2016-08-25 15:10:15 Darpan27

你會想要使用包含每個可能的單詞的正則表達式，以及一個或多個或他們的貪婪匹配。然後，您可以使用正則表達式的匹配結果來獲取每個匹配，並且由於它是貪婪的，每個匹配將是最大長度。正則表達式本身將取決於你的文字以及你對空間的要求，以及foobar是否被認爲是「foo」和「bar」的匹配。

來源

2016-08-25 15:22:13 user1676075

這聽起來像功課，但我會給你一些指示。

如果B是A的子串，如果B不等於A，那麼B的長度必須大於A的長度小也自己說的：

[... ]但我想標記列表中最長的識別字符串。

所以我們必須按照長度排序詞的列表，最長的第一個。我會讓你知道如何做到這一點。您將使用Collections.sort(List<T>, Comparator<? super T>)。

接下來的問題是實際的更換。如果你只是在你所有的字環和使用String.replaceAll(String, String)，你的榜樣最終會是這樣的：

<tag>foo</tag> and <tag>bar</tag> are different from <tag><tag>foo</tag> <tag>bar</tag></tag>.

這是因爲我們將首先圍繞「富巴」，然後我們將圍繞這兩個foo和酒吧再次。謝天謝地，String.replaceAll(String, String)的第一個參數是正則表達式。

訣竅是匹配這個詞，但前提是它還沒有被包圍。但不僅僅是被包圍，領導或落後，因爲它可能是已標記爲<tag>foo bar</tag>的foo。像"(?<!(\\w|>))+" + word + "(?!(\\w|<))+"這樣的東西只有在word還沒有領先的>,<，並且不在另一個詞的中間時才匹配。（我承認，我在正則表達式方面不是很出色，所以我相信這可能會更好）

來源

2016-08-25 16:04:59 bnorm

如何替換文本中的字符串列表，其中有些字符串是其他子字符串？

回答

相關問題