我試圖解析文本(http://pastebin.com/raw.php?i=0wD91r2i)並檢索單詞及其出現次數。但是,我不能在最終輸出中包含專有名詞。我不太清楚如何完成這項任務。確定字符串是否是文本中的專有名詞
我試圖在這個
public class TextAnalysis
{
public static void main(String[] args)
{
ArrayList<Word> words = new ArrayList<Word>(); //instantiate array list of object Word
try
{
int lineCount = 0;
int wordCount = 0;
int specialWord = 0;
URL reader = new URL("http://pastebin.com/raw.php?i=0wD91r2i");
Scanner in = new Scanner(reader.openStream());
while(in.hasNextLine()) //while to parse text
{
lineCount++;
String textInfo[] = in.nextLine().replaceAll("[^a-zA-Z ]", "").split("\\s+"); //use regex to replace all punctuation with empty char and split words with white space chars in between
wordCount += textInfo.length;
for(int i=0; i<textInfo.length; i++)
{
if(textInfo[i].toLowerCase().matches("the|a|an|and|but|or|by|to|for|of|with|without|chapter|[0-9]+")) //if word matches any special word case, add count of special words then continue to next word
{
specialWord++;
continue;
}
if(!textInfo[i].matches(".*\\w.*")) continue; //also if text matches white space then continue
boolean found = false;
for(Word word: words) //check whether word already exists in list -- if so add count
{
if(word.getWord().equals(textInfo[i]))
{
word.addOccurence(1);
word.addLine(lineCount);
found = true;
}
}
if(!found) //else add new entry
{
words.add(new Word(textInfo[i], lineCount, 1));
}
}
}
//adds data from capital word to lowercase word ATTEMPT AT PROPER NOUNS HERE
for(Word word: words)
{
for(int i=0; i<words.size(); i++)
{
if(Character.isUpperCase(word.getWord().charAt(0)) && word.getWord().toLowerCase().equals(words.get(i).getWord()))
{
words.get(i).addOccurence(word.getOccurence());
words.get(i).addLine(word.getLine());
}
}
}
Comparator<Word> occurenceComparator = new Comparator<Word>() //comparares list based on number of occurences
{
public int compare(Word n1, Word n2)
{
if(n1.getOccurence() < n2.getOccurence()) return 1;
else if (n1.getOccurence() == n2.getOccurence()) return 0;
else return -1;
}
};
Collections.sort(words);
// Collections.sort(words, occurenceComparator);
// ArrayList<Word> top_words = new ArrayList<Word>(words.subList(0,100));
// Collections.sort(top_words);
System.out.printf("%-15s%-15s%s\n", "Word", "Occurences", "Word Distribution Index");
for(Word word: words)
{
word.setTotalLine(lineCount);
System.out.println(word);
}
System.out.println(wordCount);
System.out.printf("%s%.3f\n","The connecting word index is ",specialWord*100.0/wordCount);
}
catch(IOException ex)
{
System.out.println("WEB URL NOT FOUND");
}
}
}
那種格式化掉,不知道如何正確地做到這一點。
它決定一個單詞是否大寫,如果有單詞的小寫版本,則將數據添加到小寫單詞中。但是,這並不包含文本中不會出現小寫字母的文字,例如「Four」或「Now」。如果不交叉引用字典,我該如何解決這個問題?
編輯:我已經解決了問題MYSELF。
但是,謝謝Wes試圖回答。
除了使用某種字典之外,沒有辦法做到這一點。 – 2014-12-05 23:39:16
我不認爲有可能用一個邏輯來判斷一個單詞是否是專有名詞。 – khelwood 2014-12-05 23:40:04
嗯,我不認爲我必須涵蓋每一個案例,但我相信標點符號(。!?)後的單詞應該被認爲是一般非專有名詞,儘管可能會有一些誤報。我只需要一個適用於特定文本文件的解決方案 – 2014-12-05 23:50:14