從字典中創建文檔 - 術語矩陣

我正在嘗試預處理一個文本文件，其中每行都是文檔的雙字母詞，並且在該文檔中的頻率。這裏是每行的一個示例：從字典中創建文檔 - 術語矩陣

i_like 1 you_know 2 .... not_good 1

我設法建立從整個語料庫中的字典。現在我想逐行閱讀語料庫並創建詞典，創建文檔項矩陣，這樣矩陣中的每個元素（i，j）就是文檔「i」中詞項「j」的頻率。

2012-06-05 Angel

我不確定我明白，文檔的名稱在哪裏？或者每個文檔都有一個文本文件？ – MiMo

文本文件的每一行代表一個文檔（因此，整個文本文件是一個文集）並且每個文檔的格式都是我在上面的例子中寫的。希望現在清楚 – Angel

創建使用字典的每個字產生一個整數指數的函數：

Dictionary<string, int> m_WordIndexes = new Dictionary<string, int>(); 

int GetWordIndex(string word) 
{ 
    int result; 
    if (!m_WordIndexes.TryGet(word, out result)) { 
    result = m_WordIndexes.Count; 
    m_WordIndexes.Add(word, result); 
    } 
    return result; 
}

結果矩陣爲：

List<List<int>> m_Matrix = new List<List<int>>();

處理的文本文件的每一行產生的一排matrix：

List<int> ProcessLine(string line) 
{ 
    List<int> result = new List<int>(); 
    . . . split the line in a sequence of word/number of occurences . . . 
    . . . for each word/number of occurences . . .{ 
    int index = GetWordIndex(word);  
    while (index > result.Count) { 
     result.Add(0); 
    } 
    result.Insert(index, numberOfOccurences); 
    } 
    return result; 
}

您一次只讀一行文本文件，呼叫ProcessLine()，並將結果列表添加到m_Matrix中。

來源

2012-06-05 14:11:20 MiMo

謝謝MiMo，實際上字典太大了，我決定創建稀疏矩陣來提高效率，但我使用瞭解決方案背後的想法。謝謝 – Angel

@Anglel：不客氣 – MiMo

從字典中創建文檔 - 術語矩陣

回答

相關問題