2016-11-18 48 views
0

下面是我的代碼來檢測縮寫及其長表格。代碼循環遍歷文檔中的一行,循環遍歷該行的每個單詞並標識縮寫候選項。然後它再次循環遍歷文檔的每一行以找到縮寫的適當長格式。我的問題是,如果在文檔中多次出現首字母縮略詞,我的輸出包含多個實例。我只想用所有可能的長格式打印縮寫詞一次。這裏是我的代碼:刪除重複鍵值對中的值在列表中

public static void main(String[] args) throws FileNotFoundException 
    { 
     BufferedReader in = new BufferedReader(new FileReader("D:\\Workspace\\resource\\SampleSentences.txt")); 
     String str=null; 
     ArrayList<String> lines = new ArrayList<String>(); 
     String matchingLongForm; 
     List <String> matchingLongForms = new ArrayList<String>() ; 
     List <String> shortForm = new ArrayList<String>() ; 
     Map<String, List<String>> abbreviationPairs = new HashMap<String, List<String>>(); 


     try 
     { 
      while((str = in.readLine()) != null){ 
       lines.add(str); 
      } 
     } 
     catch (IOException e) 
     { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 
     String[] linesArray = lines.toArray(new String[lines.size()]); 




     // document wide search for abbreviation long form and identifying several appropriate matches 
     for (String line : linesArray){ 
      for (String word : (Tokenizer.getTokenizer().tokenize(line))){ 
       if (isValidShortForm(word)){ 
        for (int i = 0; i < linesArray.length; i++){ 
         matchingLongForm = extractBestLongForm(word, linesArray[i]); 
         //shortForm.add(word); 
         if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))){ 
          matchingLongForms.add(matchingLongForm); 

          //System.out.println(matchingLongForm); 
          abbreviationPairs.put(word, matchingLongForms); 
          //matchingLongForms.clear(); 
         } 
        } 

        if (abbreviationPairs != null){ 
         //for(abbreviationPairs.) 
         System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
         abbreviationPairs.clear(); 
         matchingLongForms.clear(); 
         //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew); 
        } 


        else 
         continue; 
       } 
      } 
     } 
    } 

下面是電流輸出:

Abbreviation Pair: {GLBA=[Gramm Leach Bliley act]} 
Abbreviation Pair: {NCUA=[National credit union administration]} 
Abbreviation Pair: {FFIEC=[Federal Financial Institutions Examination Council]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {CFR=[comments for the Report]} 
Abbreviation Pair: {OFAC=[Office of Foreign Assets Control]} 
+0

是'地圖<字符串,請設置> abbreviationPairs'的選項? – bradimus

+0

請注意['Files.readAllLines']的存在(https://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#readAllLines(java.nio.file.Path ,%20java.nio.charset.Charset))。通過重新發明輪子,你正在浪費你的時間......此外,你可以簡單地寫'for(String line:lines){...',而不需要將List的內容複製到數組中。 – Holger

回答

1

您希望縮寫和文本具有關鍵值對。所以你應該使用Map。 地圖不能包含重複鍵;每個鍵可以映射到最多一個值。

問題出在輸出的位置上,而不是在地圖上。 您嘗試在循環中輸出,因此多次顯示地圖。

移動代碼外循環:

if (abbreviationPairs != null){ 
    //for(abbreviationPairs.) 
    System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
    abbreviationPairs.clear(); 
    matchingLongForms.clear(); 
    //System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairsNew); 
} 
+2

更重要的是,在每次循環迭代中清除映射,這使得檢測重複鍵不可能。但無論哪種情況,將打印代碼移出循環都是正確的解決方案。必須小心地爲每個映射創建一個匹配「LongForms」的新列表。那麼'clear()'調用就會過時。 – Holger

+0

非常感謝!我用了你的答案的組合。每當我爲matchingLongForms創建一個新列表時,將打印代碼移到循環外部。 – serendipity

4

嘗試使用java.util.Set來存儲您的匹配短的形式和長形式。從該類的javadoc:

...如果此集合已包含該元素,則該調用將保持集合不變並返回false。結合對構造函數的限制,這可確保集合永遠不會包含重複的元素...

0

這裏的解決方案

感謝code_angel和Holger

移動打印代碼外循環並創建一個新的列表爲每個匹配的LongForm。

for (String line : linesArray){ 
     for (String word : (Tokenizer.getTokenizer().tokenize(line))){ 
      if (isValidShortForm(word)){ 
       for (int i = 0; i < linesArray.length; i++){ 
        matchingLongForm = extractBestLongForm(word, linesArray[i]); 
        List <String> matchingLongForms = new ArrayList<String>() ; 
        if (matchingLongForm != null && !(matchingLongForms.contains(matchingLongForm))&& !(abbreviationPairs.containsKey(word))){ 
         matchingLongForms.add(matchingLongForm); 
         //System.out.println(matchingLongForm); 
         abbreviationPairs.put(word, matchingLongForms); 
         //matchingLongForms.clear(); 
        } 
       } 

      } 
     } 
    } 
    if (abbreviationPairs != null){ 
     System.out.println("Abbreviation Pair:" + "\t" + abbreviationPairs); 
     //abbreviationPairs.clear(); 
     //matchingLongForms.clear(); 

    } 

} 

新的輸出:

Abbreviation Pair: {NCUA=[National credit union administration], FFIEC=[Federal Financial Institutions Examination Council], OFAC=[Office of Foreign Assets Control], MSSP=[Managed Security Service Providers], IS=[Information Systems], SLA=[Service level agreements], CFR=[comments for the Report], MIS=[Management Information Systems], IDS=[Intrusion detection systems], TSP=[Technology Service Providers], RFI=[risk that FIs], EIC=[Examples of in the cloud], TIER=[The institution should ensure], BCP=[Business continuity planning], GLBA=[Gramm Leach Bliley act], III=[It is important], FI=[Financial Institutions], RFP=[Request for proposal]}