2014-12-05 88 views
0

我試圖解析文本(http://pastebin.com/raw.php?i=0wD91r2i)並檢索單詞及其出現次數。但是,我不能在最終輸出中包含專有名詞。我不太清楚如何完成這項任務。確定字符串是否是文本中的專有名詞

我試圖在這個

public class TextAnalysis 
{ 
    public static void main(String[] args) 
    { 
     ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word 
     try 
     { 
      int lineCount = 0; 
      int wordCount = 0; 
      int specialWord = 0; 
      URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i"); 
      Scanner in = new Scanner(reader.openStream()); 
      while(in.hasNextLine()) //while to parse text 
      { 
       lineCount++; 
       String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between 
       wordCount += textInfo.length; 
       for(int i=0; i<textInfo.length; i++) 
       { 
        if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word 
        { 
         specialWord++; 
         continue; 
        } 
        if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue 
        boolean found = false; 
        for(Word word: words) //check whether word already exists in list -- if so add count 
        { 
         if(word.getWord().equals(textInfo[i])) 
         { 
          word.addOccurence(1); 
          word.addLine(lineCount); 
          found = true; 
         } 
        } 
        if(!found) //else add new entry 
        { 
         words.add(new Word(textInfo[i], lineCount, 1)); 
        } 
       } 
      } 
      //adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE 
      for(Word word: words) 
      { 
       for(int i=0; i<words.size(); i++) 
       { 
        if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord())) 
        { 
         words.get(i).addOccurence(word.getOccurence()); 
         words.get(i).addLine(word.getLine()); 
        } 
       } 
      } 

      Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences 
      { 
       public int compare(Word n1, Word n2) 
       { 
        if(n1.getOccurence() < n2.getOccurence()) return 1; 
        else if (n1.getOccurence() == n2.getOccurence()) return 0; 
        else return -1; 
       } 
      }; 
      Collections.sort(words); 
      // Collections.sort(words, occurenceComparator); 
      // ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100)); 
      // Collections.sort(top_words); 
      System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index"); 
      for(Word word: words) 
      { 
       word.setTotalLine(lineCount); 
       System.out.println(word); 
      } 
      System.out.println(wordCount); 
      System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount); 
     } 
     catch(IOException ex) 
     { 
      System.out.println("WEB URL NOT FOUND"); 
     } 
    } 
} 

那種格式化掉,不知道如何正確地做到這一點。

它決定一個單詞是否大寫,如果有單詞的小寫版本,則將數據添加到小寫單詞中。但是,這並不包含文本中不會出現小寫字母的文字,例如「Four」或「Now」。如果不交叉引用字典,我該如何解決這個問題?

編輯:我已經解決了問題MYSELF。

但是,謝謝Wes試圖回答。

+0

除了使用某種字典之外,沒有辦法做到這一點。 – 2014-12-05 23:39:16

+0

我不認爲有可能用一個邏輯來判斷一個單詞是否是專有名詞。 – khelwood 2014-12-05 23:40:04

+0

嗯,我不認爲我必須涵蓋每一個案例,但我相信標點符號(。!?)後的單詞應該被認爲是一般非專有名詞,儘管可能會有一些誤報。我只需要一個適用於特定文本文件的解決方案 – 2014-12-05 23:50:14

回答

1

看起來你的算法似乎是假設任何出現大寫字母的單詞,但不會出現未被大寫的是一個專有名詞。所以如果是這樣的話,那麼你可以使用下面的算法來獲得專有名詞。

//Assume you have tokenized your whole file into a Collection called allWords. 
HashSet<String> lowercaseWords = new HashSet<>(); 
HashMap<String,String> lowerToCap = new HashMap<>(); 
for(String word: allWords) { 
    if (Character.isUpperCase(word.charAt(0))){ 
     lowerToCap.put(word.toLowerCase(),word); 
    } 
    else {  
     lowercaseWords.add(word.toLowerCase); 
    } 
} 

//remove all the words that we've found as capitalized, only proper nouns will be left 
lowercaseWords.removeAll(lowerToCap.keySet()); 
for(String properNounLower:lowercaseWords) { 
    System.out.println("Proper Noun: "+ lowerToCap.get(properNounLower)); 
} 
+1

您還可以使用帶有String.CASE_INSENSITIVE_ORDER構造函數參數的TreeMap,這將消除該小寫字母大寫的地圖 – 2014-12-05 23:58:24

+0

我還沒有在類中學習HashSet或HashMap ,所以我不確定我是否可以利用它。另外,我想知道這是否會解釋只顯示大寫的單詞,例如文本文件中的「四個」。 – 2014-12-06 00:06:58

+0

它應該只打印只顯示大寫的文字。所以應該出現「四」。 – 2014-12-06 00:12:20

相關問題