2013-06-22 112 views

回答

1

你正在尋找的缺失片叫做「單詞矢量」。基本上你必須創建一個新的示例集,其中一個屬性將代表一個單詞。對於給定的示例(即文檔),該屬性的(數字)值將顯示該文檔對該詞的「重要性」。

一個幼稚的方法是使用文檔中單詞的計數,但通常您應該使用TD-IDF(術語頻率逆文檔頻率),它也將考慮整個文檔語料庫。

要在RapidMiner中執行此操作,您必須安裝文本挖掘擴展並使用諸如「從數據處理文檔」或「從文件處理文檔」等操作符。請記住,對於文本挖掘,您需要執行更多預處理步驟,例如創建令牌,刪除停用詞(幾乎可以在所有文檔中找到的常用詞,因此不是很有用),並使用詞的詞幹(so 「單詞」和「單詞」將被平等對待)。

這裏是一個小例子:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<process version="5.3.009"> 
    <context> 
    <input/> 
    <output/> 
    <macros/> 
    </context> 
    <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process"> 
    <process expanded="true"> 
     <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="75"> 
     <parameter key="text" value="I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only."/> 
     </operator> 
     <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="165"> 
     <parameter key="text" value="The missing piece you are looking for is called &quot;word vector&quot;. Basically you have to create a new example set for which the attributes will represent the words. For a given example (i.e. a document) the (numerical) value for this attribute will show the &quot;importance&quot; of this word for this document. &#10;&#10;A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well.&#10;&#10;To do this in RapidMiner you have to install the text mining extension and use operators like &quot;Process Documents from Data&quot; or &quot;Process Documents from Files&quot;. Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so &quot;word&quot; and &quot;words&quot; will be treated equally).&#10;&#10;Here is a small example:"/> 
     </operator> 
     <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="75"> 
     <process expanded="true"> 
      <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/> 
      <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/> 
      <operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/> 
      <connect from_port="document" to_op="Tokenize" to_port="document"/> 
      <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> 
      <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/> 
      <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/> 
      <portSpacing port="source_document" spacing="0"/> 
      <portSpacing port="sink_document 1" spacing="0"/> 
      <portSpacing port="sink_document 2" spacing="0"/> 
     </process> 
     </operator> 
     <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/> 
     <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/> 
     <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> 
     <portSpacing port="source_input 1" spacing="0"/> 
     <portSpacing port="sink_result 1" spacing="0"/> 
     <portSpacing port="sink_result 2" spacing="0"/> 
    </process> 
    </operator> 
</process> 

BTW:也有與RapidMiner YouTube上的幾個相當不錯的文本挖掘教程。

+0

感謝您的回覆。您可以請參閱我的一些很好的文本挖掘教程,我可以找到如何使用SVM用於使用單詞矢量進行多標籤/多分類的示例。 – kailash

+0

由於SVM提供二項式輸出(2個值),它如何提供多個類別值? – kailash

1

這個問題可能相當古老,但也許有更多像我這樣的人在那裏,只是試驗Rapidminer,希望能解決完全相同的問題。

我猜想第一部分關於處理文本一般使用Rapidminer的插件「文本挖掘擴展」已被maerch一段時間的正確解釋。但考慮到kailash的評論,主要問題似乎是二項SVM模型和多項式輸入/標籤集之間的不兼容。

實際的餵養SVM模型是通過添加元運算符「二項分類的多項式」作爲SVM的包裝。它可以多次合併輸入類(以可以使用「分類策略」參數選擇的方式),以便總是有兩個輸入組並將它們提供給SVM,直到可以推導出組合結果。那麼最終的模型就可以處理多個類。

下面的過程段說明了一個SVM(默認參數)與它的Poly2Bi-打包機:

<process expanded="true"> 
    <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.3.015" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="112" y="120"> 
     <parameter key="classification_strategies" value="1 against all"/> 
     <parameter key="random_code_multiplicator" value="2.0"/> 
     <parameter key="use_local_random_seed" value="false"/> 
     <parameter key="local_random_seed" value="1992"/> 
     <process expanded="true"> 
      <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.015" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210"> 
       <parameter key="kernel_cache" value="200"/> 
       <parameter key="C" value="0.0"/> 
       <parameter key="convergence_epsilon" value="0.001"/> 
       <parameter key="max_iterations" value="100000"/> 
       <parameter key="scale" value="true"/> 
       <parameter key="L_pos" value="1.0"/> 
       <parameter key="L_neg" value="1.0"/> 
       <parameter key="epsilon" value="0.0"/> 
       <parameter key="epsilon_plus" value="0.0"/> 
       <parameter key="epsilon_minus" value="0.0"/> 
       <parameter key="balance_cost" value="false"/> 
       <parameter key="quadratic_loss_pos" value="false"/> 
       <parameter key="quadratic_loss_neg" value="false"/> 
      </operator> 
      <connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/> 
      <connect from_op="SVM (Linear)" from_port="model" to_port="model"/> 
      <portSpacing port="source_training set" spacing="0"/> 
      <portSpacing port="sink_model" spacing="0"/> 
     </process> 
    </operator> 
    <connect from_port="training" to_op="Polynominal by Binominal Classification" to_port="training set"/> 
    <connect from_op="Polynominal by Binominal Classification" from_port="model" to_port="model"/> 
    <portSpacing port="source_training" spacing="0"/> 
    <portSpacing port="sink_model" spacing="0"/> 
    <portSpacing port="sink_through 1" spacing="0"/> 
</process> 

注意,當操作者Poly2Bi以這種方式使用的RapidMiner(至少)版本5.3.015抱怨在驗證操作員的培訓區域內部,並且在測試區域有一個Performance操作員。將出現性能運算符的錯誤消息:

標籤和預測必須是相同類型,但分別是多項式和標稱。

但在RapidMiner論壇,他們point out,這似乎是一個無用警告,你可以忽略。就我而言,這個過程也運行良好。