從文本塊中提取關鍵短語的最佳方式是什麼?我正在編寫一個工具來執行關鍵字提取:something like this。我發現了幾個用於Python和Perl的庫來提取n-gram,但是我在Node中編寫這個庫,所以我需要一個JavaScript解決方案。如果沒有任何現有的JavaScript庫,有人可以解釋如何做到這一點,所以我可以自己寫嗎?從文本中提取關鍵短語(1-4個字節)
9
A
回答
15
我喜歡這個主意,所以我實現它:見下文(描述性註釋包括)。
預覽在:http://fiddle.jshell.net/WsKMx/
/*@author Rob W, created on 16-17 September 2011, on request for Stackoverflow (http://stackoverflow.com/q/7085454/938089)
* Modified on 17 juli 2012, fixed IE bug by replacing [,] with [null]
* This script will calculate words. For the simplicity and efficiency,
* there's only one loop through a block of text.
* A 100% accuracy requires much more computing power, which is usually unnecessary
**/
var text = "A quick brown fox jumps over the lazy old bartender who said 'Hi!' as a response to the visitor who presumably assaulted the maid's brother, because he didn't pay his debts in time. In time in time does really mean in time. Too late is too early? Nonsense! 'Too late is too early' does not make any sense.";
var atLeast = 2; // Show results with at least .. occurrences
var numWords = 5; // Show statistics for one to .. words
var ignoreCase = true; // Case-sensitivity
var REallowedChars = /[^a-zA-Z'\-]+/g;
// RE pattern to select valid characters. Invalid characters are replaced with a whitespace
var i, j, k, textlen, len, s;
// Prepare key hash
var keys = [null]; //"keys[0] = null", a word boundary with length zero is empty
var results = [];
numWords++; //for human logic, we start counting at 1 instead of 0
for (i=1; i<=numWords; i++) {
keys.push({});
}
// Remove all irrelevant characters
text = text.replace(REallowedChars, " ").replace(/^\s+/,"").replace(/\s+$/,"");
// Create a hash
if (ignoreCase) text = text.toLowerCase();
text = text.split(/\s+/);
for (i=0, textlen=text.length; i<textlen; i++) {
s = text[i];
keys[1][s] = (keys[1][s] || 0) + 1;
for (j=2; j<=numWords; j++) {
if(i+j <= textlen) {
s += " " + text[i+j-1];
keys[j][s] = (keys[j][s] || 0) + 1;
} else break;
}
}
// Prepares results for advanced analysis
for (var k=1; k<=numWords; k++) {
results[k] = [];
var key = keys[k];
for (var i in key) {
if(key[i] >= atLeast) results[k].push({"word":i, "count":key[i]});
}
}
// Result parsing
var outputHTML = []; // Buffer data. This data is used to create a table using `.innerHTML`
var f_sortAscending = function(x,y) {return y.count - x.count;};
for (k=1; k<numWords; k++) {
results[k].sort(f_sortAscending);//sorts results
// Customize your output. For example:
var words = results[k];
if (words.length) outputHTML.push('<td colSpan="3" class="num-words-header">'+k+' word'+(k==1?"":"s")+'</td>');
for (i=0,len=words.length; i<len; i++) {
//Characters have been validated. No fear for XSS
outputHTML.push("<td>" + words[i].word + "</td><td>" +
words[i].count + "</td><td>" +
Math.round(words[i].count/textlen*10000)/100 + "%</td>");
// textlen defined at the top
// The relative occurence has a precision of 2 digits.
}
}
outputHTML = '<table id="wordAnalysis"><thead><tr>' +
'<td>Phrase</td><td>Count</td><td>Relativity</td></tr>' +
'</thead><tbody><tr>' +outputHTML.join("</tr><tr>")+
"</tr></tbody></table>";
document.getElementById("RobW-sample").innerHTML = outputHTML;
/*
CSS:
#wordAnalysis td{padding:1px 3px 1px 5px}
.num-words-header{font-weight:bold;border-top:1px solid #000}
HTML:
<div id="#RobW-sample"></div>
*/
0
我不知道在JavaScript這樣的庫,但邏輯是
- 拆分文本到數組
- 然後進行排序和計數
或者
- 分成數組
- 創建輔助數組
- 遍歷第一陣列的每個項目
- 檢查在輔助陣列中是否存在
- 當前項目,如果不存在 推它爲具有密鑰的項的鍵
- 別的 增大值=所尋求的項目。 HTH
伊沃Stoykov
+0
這並不沒有做IM希望B/C不提取多字的n-gram ...它適用於單個詞只 –
+1
看這裏 - > HTTP://valuetype.wordpress .com/2011/08/24/keyword-density-with-javascript /這是一個帶有一個字數的樣本,但可以容易地擴展爲3或4個字 – i100
相關問題
- 1. Python:使用關鍵短語從字符串中提取文本
- 2. 從短荷蘭文文本中提取關鍵字
- 3. 規則從文本文檔中提取鍵+短語
- 4. 從短片段中提取關鍵短語
- 5. 從solr索引的文檔中提取關鍵短語
- 6. 提取關鍵短語後的字符串的某個部分
- 7. 從文本中查找多個關鍵短語
- 8. 從文本中提取關鍵句子
- 9. 從文章中提取關鍵字
- 10. 從文本中提取標籤或相關關鍵字
- 11. 從java中的文本文件中提取短語
- 12. MySQL全文關鍵字/短語
- 13. 從文本文檔中提取技術關鍵字
- 14. 關鍵短語
- 15. 在文本中匹配存儲的關鍵字/短語
- 16. 如何從Haskell中的文本塊中提取關鍵字
- 17. 圖書館從英文文本中提取短語動詞
- 18. 使用c#或SQL從文本中提取關鍵字
- 19. 用於從輸入文本中提取關鍵字的Java庫
- 20. Rails - 從文本塊中提取seo關鍵字
- 21. 如何從文本中提取關鍵字(標籤)
- 22. 從文本中提取所需的關鍵字
- 23. 從2字節讀取14位數字
- 24. Android - 從短信中提取文本
- 25. 基於關鍵字從xml中提取節點
- 26. 從文本文件中提取特定的字節文本行
- 27. 從文檔中提取單個關鍵字
- 28. 從網頁中提取Meta關鍵字?
- 29. 如何從文本語料庫中提取語義相關性
- 30. Ruby中的簡單關鍵字/關鍵短語分析
我更新了代碼以修復IE8中的錯誤。這個錯誤是通過郵件報告的,我在這裏粘貼了郵件和我的回覆(提供修復幷包含詳細的解釋):http://pastebin.com/7Edx88Gp。 –
美麗,幾年後你仍然在幫助人 –