2012-12-06 89 views
2

看了很多這方面的例子,到目前爲止沒有運氣。我想分類自由文本。如何測試Weka文本分類(FilteredClassifier)

  1. 配置文本分類器。 (FilteredClassifier使用StringToWordVector和LIBSVM)
  2. 訓練分類(加入大量的文件,過濾後的文本串)
  3. 序列化FilteredClassifier到磁盤,退出程序

再後來

  1. 加載序列化的FilteredClassifier
  2. 分類的東西!

當我嘗試從磁盤讀取數據並對其進行分類時,它就會正常工作。所有文檔和示例都顯示了同時構建的培訓列表和測試列表,在我的案例中,我試圖在事實之後構建測試列表。

單獨使用FilteredClassifier並不足以創建一個與原始訓練集具有相同「字典」的測試實例,那麼如何保存我需要在以後分類的所有內容?

http://weka.wikispaces.com/Use+WEKA+in+your+Java+code只是說「從某處加載的實例」,並沒有說任何關於使用類似的字典。

ClassifierFramework cf = new WekaSVM(); 
if (!cf.isTrained()) { 
    train(cf); // Train, save to disk 
    cf = new WekaSVM(); // reloads from file 
} 
cf.test("this is a test"); 

結束投擲

java.lang.ArrayIndexOutOfBoundsException: 2 
at weka.core.DenseInstance.value(DenseInstance.java:332) 
at weka.filters.unsupervised.attribute.StringToWordVector.convertInstancewoDocNorm(StringToWordVector.java:1587) 
at weka.filters.unsupervised.attribute.StringToWordVector.input(StringToWordVector.java:688) 
at weka.classifiers.meta.FilteredClassifier.filterInstance(FilteredClassifier.java:465) 
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:495) 
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:70) 
at ratchetclassify.lab.WekaSVM.test(WekaSVM.java:125) 

回答

0

系列化你Instances保持訓練有素的數據-similar字典的定義 - 當你在你的序列化分類:

Instances trainInstances = ... // 

Instances trainHeader = new Instances(trainInstances, 0); 
trainHeader.setClassIndex(trainInstances .classIndex()); 

OutputStream os = new FileOutputStream(fileName); 
ObjectOutputStream objectOutputStream = new ObjectOutputStream(os); 
objectOutputStream.writeObject(classifier); 
if (trainHeader != null) 
    objectOutputStream.writeObject(trainHeader); 
objectOutputStream.flush(); 
objectOutputStream.close(); 

要desialize:

Classifier classifier = null; 
Instances trainHeader = null; 

InputStream is = new BufferedInputStream(new FileInputStream(fileName)); 
ObjectInputStream objectInputStream = new ObjectInputStream(is); 
classifier = (Classifier) objectInputStream.readObject(); 
try { // see if we can load the header 
    trainHeader = (Instances) objectInputStream.readObject(); 
} catch (Exception e) { 
} 
objectInputStream.close(); 

使用trainHeader創建新的Instance

int numAttributes = trainHeader.numAttributes(); 
double[] vals = new double[numAttributes]; 

for (int i = 0; i < numAttributes - 1; i++) { 
    Attribute attribute = trainHeader.attribute(i); 

    //If your attribute is nominal or string:  
    double value = attribute.indexOfValue(myStrVal); //get myStrVal from your source 

    //If your attribute is numeric 
    double value = myNumericVal; //get myNumericVal from your source 

    vals[i] = value; 
} 

vals[numAttributes] = Instance.missingValue(); 

Instance instance = new Instance(1.0, vals); 
instance.setDataset(trainHeader); 
return instance;