2017-02-13 49 views
-2

我想使用這個JavaScript代碼:如何從文本中刪除所有停用詞?

var aStopWords = new Array ("a", "the", "blah"...); 

(code to make it run, full code can be found here: https://jsfiddle.net/j2kbpdjr/) 

// sText is the body of text that the keywords are being extracted from. 
// It's being separated into an array of words. 

// remove stop words 
for (var m = 0; m < aStopWords.length; m++) { 
    sText = sText.replace(' ' + aStopWords[m] + ' ', ' '); 
} 

從文本的身體得到的關鍵字。它工作得很好,但是,我遇到的問題是,它似乎只是遍歷並忽略數組aStopWords中的單詞的一個實例。

所以,如果我有文字的以下機身:

how are you today? Are you well?

我把var aStopWords = new Array("are","well")的話,好像它會忽略的are一審,但仍顯示第二are作爲關鍵字。而它會從關鍵字中完全刪除/忽略well

如果有人能夠幫助忽略關鍵字aStopWords中所有單詞的實例,我將不勝感激。

+0

是您的目標是從文本中刪除單詞列表的每一次出現? –

+0

@ T.J.Crowder,道歉。我已經更新了這個問題。 – Jack

+0

@ ssc-hrep3是的,這是正確的 – Jack

回答

1

你可以很容易地做到這一點。

首先,它將文本分割成關鍵字。然後,它會遍歷所有關鍵字。在經歷時,它會檢查它是否是一個停用詞。如果是這樣,它將被忽略。否則,result對象中該關鍵字的出現次數將會增加。

然後,關鍵字是在以下形式的JavaScript對象:

{ "this": 1, "that": 2 } 

對象是不可排序在JavaScript中,但數組是。所以,重映射到以下結構是必要的:

[ 
    { "keyword": "this", "counter": 1 }, 
    { "keyword": "that", "counter": 2 } 
] 

然後,該陣列可以通過使用counter屬性進行排序。使用slice()函數,只能從排序列表中提取前X個值。

var stopwords = ["about", "all", "alone", "also", "am", "and", "as", "at", "because", "before", "beside", "besides", "between", "but", "by", "etc", "for", "i", "of", "on", "other", "others", "so", "than", "that", "though", "to", "too", "trough", "until"]; 
 
var text = document.getElementById("main").innerHTML; 
 

 
var keywords = text.split(/[\s\.;:"]+/); 
 
var keywordsAndCounter = {}; 
 
for(var i=0; i<keywords.length; i++) { 
 
    var keyword = keywords[i]; 
 
    
 
    // keyword is not a stopword and not empty 
 
    if(stopwords.indexOf(keyword.toLowerCase()) === -1 && keyword !== "") { 
 
    if(!keywordsAndCounter[keyword]) { 
 
     keywordsAndCounter[keyword] = 0; 
 
    } 
 
    keywordsAndCounter[keyword]++; 
 
    } 
 
} 
 

 
// remap from { keyword: counter, keyword2: counter2, ... } to [{ "keyword": keyword, "counter": counter }, {...} ] to make it sortable 
 
var result = []; 
 
var nonStopKeywords = Object.keys(keywordsAndCounter); 
 
for(var i=0; i<nonStopKeywords.length; i++) { 
 
    var keyword = nonStopKeywords[i]; 
 
    result.push({ "keyword": keyword, "counter": keywordsAndCounter[keyword] }); 
 
} 
 

 
// sort the values according to the number of the counter 
 
result.sort(function(a, b) { 
 
    return b.counter - a.counter; 
 
}); 
 

 
var topFive = result.slice(0, 5); 
 
console.log(topFive);
<div id="main">This is a test to show that it is all about being between others. I am there until 8 pm event though it will be late. Because it is "cold" outside even though it is besides me.</div>

+0

謝謝!這完全適用於刪除所有停用詞的實例,這是我遇到的一個問題(抱歉是一個痛苦)。問題是,這是列出所有的不停止的單詞,而不是隻有前X個重複出現的單詞。 – Jack

+0

@Jack,我已經更新了以下答案:問題是,一個對象無法排序,因此您需要將其從一個對象轉換爲一個數組(包含對象)。 –

+0

非常感謝! – Jack

相關問題