MATLAB - 如何獲取字符串中每個單詞的出現次數？

-1

假設我們想通過MATLAB來檢查特定文本文件中出現任何單詞的次數，我們該怎麼做？現在，由於我正在檢查單詞是SPAM單詞還是HAM單詞（正在進行內容過濾），因此我正在查找單詞的概率是垃圾郵件，即n（垃圾郵件發生次數）/ n（總髮生次數）將給出概率。MATLAB - 如何獲取字符串中每個單詞的出現次數？

提示？

來源

2014-08-28 Priyam Soneji

我們可以假定文本文件已經導入到一個字符串？或者這些單詞已經在字符串的單元數組中分開了？ – 2014-08-28 19:45:23

不是單元格的字符串數組，認爲它已經從文本文件中導入 – 2014-08-28 19:47:32

那麼你可以導入爲單元數組或字符數組。 – Divakar 2014-08-28 19:52:18

可以使用正則表達式來找到一個詞的出現次數..

例如：

txt = fileread(fileName); 
tokens = regexp(txt, string, 'tokens');

字符串就是你正在尋找一個..

來源

2014-08-28 19:47:01 lakesh

該字符串可以一次一個字符串的所有單元格嗎？這就是我正在尋找我想要 – 2014-08-28 19:49:59

@PriyamSoneji - 是的，它可以。 'regexp'通過使用單個字符串或字符串的單元數組來工作。 – rayryeng 2014-08-29 06:05:55

順便說一句這是一個答案。您有正確的機制來搜索字符串中的特定模式。你沒有邏輯去計算單詞出現的次數。不過，朝着正確的方向努力。 – rayryeng 2014-08-29 06:07:14

舉個例子，請考慮一個名爲text.txt的文本文件，其中包含以下文字：

這兩個與所有句子一樣，句子包含單詞。其中一些詞重複;但不是所有的。

一種可能的方法如下：

s = importdata('text.txt'); %// import text. Gives a 1x1 cell containing a string 
words = regexp([lower(s{1}) '.'], '[\s\.,;:-''"?!/()]+', 'split'); %// split 
%// into words. Make sure there's always at least a final punctuation sign. 
%// You may want to extend the list of separators (between the brackets) 
%// I have made this case insensitive using "lower" 
words = words(1:end-1); %// remove last "word", which will always be empty 
[uniqueWords, ~, intLabels] = unique(words); %// this is the important part: 
%// get unique words and an integer label for each one 
count = histc(intLabels, 1:numel(uniqueWords)); %// occurrences of each label

結果是和count：

uniqueWords = 
    'all' 'are' 'but' 'contain' 'like' 'not' 'of' 'repeated' 
    'sentences' 'some' 'these' 'those' 'two' 'words'  

count = 
     2 1 1 1 1 1 1 1 2 1 1 1 1 2

來源

2014-08-28 20:59:29

+1 - 非常好的具體例子 – rayryeng 2014-08-29 06:06:19

MATLAB - 如何獲取字符串中每個單詞的出現次數？

回答

相關問題