使用Weka分類作者博客性別

我想在Java中使用Weka將作者的博客分類爲男性或女性寫的。我創建了一個名爲Weka的類，它定義了要在訓練集中使用的屬性，然後調用一個方法從Excel表中加載所有已知數據。該文件中的數據的組織是這樣的：每行有細胞0博客文字，然後一個M或F小區1使用Weka分類作者博客性別

博客文M 更多的文字˚F

我也是按照本教程一點點Weka Java Tutorial

當我運行程序時，我開始在eclipse中的控制檯窗口中看到文本，但是突然間出現一個紅色的錯誤，指出「值沒有爲給定的名義屬性定義！我不太清楚爲什麼會發生這種情況。文本正在逐行改變，所以我認爲不可能定義所有名義屬性。任何人都可以看到我在做什麼錯誤或愚蠢在這裏？我將不勝感激任何幫助。我一直堅持這個幾個小時。

CODE：「給定名義屬性沒有定義值」

public class Weka 
{ 
    static FastVector fvWekaAttributes; 
    static Instances isTrainingSet; 
    static Classifier cModel; 

    public static void main(String[] args) throws Exception 
    { 



     // Declaring attributes 
     Attribute stringAttribute = new Attribute("text", (FastVector) null); 

     // Declaring a class attribute along with values 
     FastVector fastVClassVal = new FastVector(2); 
     fastVClassVal.addElement("M"); 
     fastVClassVal.addElement("F"); 

     Attribute classAttribute = new Attribute("theClass", fastVClassVal); 

     // Declaring the feature vector 
     fvWekaAttributes = new FastVector(2); 
     fvWekaAttributes.addElement(stringAttribute); 
     fvWekaAttributes.addElement(classAttribute); 

     // create the training set 
     isTrainingSet = new Instances("Rel", fvWekaAttributes, 10); 

     // set class index 
     isTrainingSet.setClassIndex(1); 

     // create however many instances is in my excel file 
     // and add it to the training set in a loop. 
     Weka.LoadExcelWorkBook(isTrainingSet); 
     Weka.TestSetWork(); 

    } 

    public static void TestSetWork() throws Exception 
    { 
     // test the model 
     Evaluation testing = new Evaluation(isTrainingSet); 
     testing.evaluateModel(cModel, isTrainingSet); 

     // printing the results.... 
     String strSummary = testing.toSummaryString(); 
     System.out.println(strSummary); 

     // get confusion matrix. 

     double[][] cmMatrix = testing.confusionMatrix(); 
     for (int i = 0; i < cmMatrix.length; i++) 
     { 
      for (int col = 0; col < cmMatrix.length; col++) 
      { 
       System.out.print(cmMatrix[i][col]); 
       System.out.print("|"); 
      } 
      System.out.println(); 
     } 

    } 

    public static void LoadExcelWorkBook(Instances trainingSet) 
      throws Exception 
    { 
     System.out.println("LOADING EXCEL WORKBOOK!!!"); 
     Workbook wb = null; 
     // opening excel file. 

     try 
     { 
      wb = WorkbookFactory 
        .create(new File("C://blog-gender-dataset.xlsx")); 

     } catch (IOException ieo) 
     { 
      ieo.printStackTrace(); 
     } 

     // opening worksheet. 
     Sheet sheet = wb.getSheetAt(0); 

     StringToWordVector filter = new StringToWordVector(); 
     filter.setInputFormat(isTrainingSet); 

     Instances dataFiltered = Filter.useFilter(isTrainingSet, filter); 

     for (Row row : sheet) 
     { 

      Cell textCell = row.getCell(0); 
      Cell MFCell = row.getCell(1); 

      String blogText = textCell.getStringCellValue(); 
      String MFIndicator = MFCell.getStringCellValue(); 
      System.out.println("TEXT FROM EXCEL " + blogText); 
      Instance iText = new Instance(2); 

      iText.setValue((Attribute) fvWekaAttributes.elementAt(0), tweetText); 
      iText.setValue((Attribute) fvWekaAttributes.elementAt(1), 
        MFIndicator); 

      isTrainingSet.add(iText); 

      cModel = (Classifier) new J48(); 
      cModel.buildClassifier(dataFiltered); 

     } 
    } 

}

來源

2013-08-30 Tastybrownies

爲什麼使用excel來存儲數據，而不是csv或arff？爲什麼要自己寫讀取數據代碼，而不是使用weka fileReader - ArffLoader。我不認爲你使用博客文本作爲一個屬性是一個好主意，首先嚐試分詞，並使用單詞作爲輸入屬性。在將數據提供給分類器之前，您可能需要先進行屬性選擇。 – criszhao

在這裏使用CSV並不是一個好的選擇，感覺博客文本本身包含逗號。除非文字像它應該的那樣「圍繞」。我已經嘗試將excel保存爲CSV並在Weka中打開，並且遇到了問題。在將它提供給分類器之前，通過屬性選擇是什麼意思？謝謝。 – Tastybrownies

對不起，屬性選擇實際上是功能選擇。您應該在博客文本中分割單詞，並將單詞用作輸入屬性。逗號不會是一個問題。假設你的博客文字是「這是一個例子，」在分詞後，你可能會得到「這是一個例子\ n1,1,1,2 \ n」作爲csv內容。 – criszhao

在您構建的實例中到達時，預期的數據碰巧具有其他值，而不是您在arff @attribute部分爲給定名義屬性定義的值。例如，您將預期值定義爲「M」或「F」，但您讀取的值可能爲空（N/A）等。解決方案是嚴格驗證數據，調試/跟蹤您加載該屬性的錯誤發生的位置，並將該值添加到該屬性的可能值 - 或者，如果在您的情況下系統地顯示該屬性，請將該屬性定義爲具有更通用的類型（字符串，數字，.. ）。

來源

2015-11-24 16:37:12

使用Weka分類作者博客性別

回答

相關問題