從術語頻率計數（數字）重新創建歌詞（單詞）

我試圖從術語頻率計數「重新創建」音樂歌詞。我有兩個源數據文件。第一個簡單列出了我正在使用的歌詞語料庫中的5000個最常用術語，從大多數使用的（1）到最少使用（5000）的順序排列。第二個文件是歌詞庫本身，由超過20萬首歌曲組成。從術語頻率計數（數字）重新創建歌詞（單詞）

每個「歌曲」是逗號分隔的字符串，如下所示：「SONGID1，SONGID2,1：13,2：10,4：6,7：15，....」其中前兩個條目是歌曲的ID標籤，然後是歌詞（冒號左邊的數字）和歌曲中使用的詞語的次數（冒號右邊的數字）。在上面的例子中，這意味着在給定的歌曲中，「I」（5000個最常用術語中的第一個條目「1」）出現13次，而「the」（第二常用術語）出現10次，等等。

我想要做的就是從這個「termID：termCount」格式轉到實際「重新創建」原始（儘管是混亂）歌詞，其中我將冒號左邊的數字設置爲實際詞條，然後列出這些術語在術語計算在冒號右側的情況下是適當的次數。再次，使用上面的簡短示例，我的首選結果輸出爲：「SONGID1，SONGID2，I I I I I I I I I I I I I I I I I I I the the the the the and the and and and and and ...」等等。謝謝！

來源

2013-12-09 user3084485

也許以下（未經測試）會激勵你。你沒有說如何你想要輸出，所以你可能想要更改print() s文件寫入或什麼。

//assumes that each word is on its own line, sorted from most to least common 
String[] words = loadStrings("words.txt"); 

//two approaches: 
//loadStrings() again, but a lot of memory usage for big files. 
//buffered reader, which is more complicated but works well for large files. 
BufferedReader reader = createReader("songs.txt"); 
String line = reader.readLine(); 
while(line != null){ 
    String[] data = line.split(","); 
    print(data[0] + ", " + data[1]); //the two song IDs 
    for(int i = 2; i < data.length; i++){ 
    String[] pair = data[i].split(":"); 
    // inelegant, but clear. You may have to subtract 1, if 
    // the words index from 1 but the array indexes from 0 
    for(int j = 0; j < int(pair[1]); j++) 
     print(words[int(pair[0])] + " "); 
    } 
    println(); 
    line = reader.readLine(); 
} 
reader.close();

來源

2013-12-10 00:33:46 kevinsa5

從術語頻率計數（數字）重新創建歌詞（單詞）

回答

相關問題