我想根據其內容將某些數據分類到不同的類中。我使用樸素貝葉斯分類器做了它,我得到了一個輸出作爲它所屬的最佳類別。但是現在我想將除了訓練集之外的新聞分類爲「其他」類。除了訓練數據之外,我不能手動將每個/每個數據添加到某個類中,因爲它擁有大量的其他類別。那麼是否有任何方法來分類其他數據?使用LingPipe將數據與樸素貝葉斯進行分類
private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };
private static int NGRAM_SIZE = 6;
public static void main(String[] args) throws ClassNotFoundException, IOException {
DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
for (int i = 0; i < CATEGORIES.length; ++i) {
File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
if (!classDir.isDirectory()) {
String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
System.out.println(msg); // in case exception gets lost in shell
throw new IllegalArgumentException(msg);
}
String[] trainingFiles = classDir.list();
for (int j = 0; j < trainingFiles.length; ++j) {
File file = new File(classDir, trainingFiles[j]);
String text = Files.readFromFile(file, "ISO-8859-1");
System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
Classification classification = new Classification(CATEGORIES[i]);
Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
classifier.handle(classified);
}
}
}
不確定你在問什麼。您的訓練集僅與C1,C2,C3類別進行比較,並且您想將其分爲4類:C1,C2,C3,其他? – amit
我強烈建議拿鉛筆,並確保您瞭解需要做什麼計算。您面臨的挑戰與代碼沒有任何關係,但通過計算,所以您的問題可能最適合http://stats.stackexchange.com/如果您需要任何計算幫助,請參閱下面的註釋:http: //www.inf.ed.ac.uk/teaching/courses/inf2b/lectureSchedule.html – matcheek
@matcheek我認爲這個問題實際上是關於LingPipe圖書館的,而不是關於幼稚bayes本身。 –