2013-09-25 93 views
0
<s> an evolutionary immune network for data clustering </s> 
<s> an evolutionary immune network for data clustering </s> 
<s> inet an extensible framework for simulating immune network </s> 
<s> immunity based systems a survey </s> 
<s> a recommender system based on the immune network </s> 

我在MATLAB中工作,這些句子來自文本文件,我想逐行讀取這些句子,並且想要提取每個單詞以及計數每個單詞的頻率。我如何使用「正則表達式」函數來提取單詞?寫正則表達式從matlab中的文本文件中讀取句子

+0

你有擡頭的文檔[正則表達式(HTTP:/ /www.mathworks.com/help/matlab/ref/regexp.html)? – Doresoom

+0

@Doresoom是的,我讀過,並閱讀文本文件,我寫了下面的代碼F = fread(fid','* char')'; unigram = sort(unique(regexp(F,'','split')));並且在這個過程中它顯示「」作爲一個單詞,但是這些不同並且我想將它們分開 – user2753079

回答

0

原因</s><s>被認爲是一個詞,因爲您已經閱讀了整個文件,並且只在空格上分裂,而不是空行和空格。

取而代之的是,逐行讀取文件fgets並單獨分割線條,隨着時間增加令牌計數。

0

我認爲字符串'<s></s>'字面上出現在您的文本文件中的某處。如果是這樣的話,分隔空間當然是不夠的;你必須返回'<s>''</s>'或連續的非空格字符所有出現:

regexp(F, '<s>|\w*|</s>', 'match'); 

完整代碼:

% Read file contents 
fid = fopen('test.txt','r'); 
F = fread(fid, '*char').'; 
fclose(fid); 

% Split all words 
C = regexp(F, '<s>|\w*|</s>', 'match'); 

% Find word frequencies 
words = unique(C); 
counts = cellfun(@(x)sum(strcmp(x,C)), words); 

% Group them together for display 
freq = [num2cell(counts.') words.'] 
+0

thx @Rody Oldenhuis這個是爲我工作的,你能給我一點關於正則表達式的描述嗎? – user2753079

+0

這個正則表達式也分裂每個單詞** regexp(F,'| \ n','split')** – user2753079

+0

我明白經常exp,現在我該如何計算每個單詞的**頻率**。 – user2753079

相關問題