如何擺脫標點符號？並檢查拼寫錯誤

消除標點符號
話結識新線和空間分割時，然後存儲在陣列
檢查文本文件有錯誤或不符合checkSpelling.m的函數文件
總和向上的誤差該文章中的總數假定
沒有建議是沒有錯誤，則返回-1
誤差的總和> 20，返回1
總和誤差< = 20，返回的-1

我想檢查某個段落的拼寫錯誤，我面臨的問題擺脫了標點符號。它可能有問題的其他原因，我返回如下錯誤：如何擺脫標點符號？並檢查拼寫錯誤

enter image description here

我DATA2文件是：

enter image description here

checkSpelling.m

function suggestion = checkSpelling(word) 

h = actxserver('word.application'); 
h.Document.Add; 
correct = h.CheckSpelling(word); 
if correct 
    suggestion = []; %return empty if spelled correctly 
else 
    %If incorrect and there are suggestions, return them in a cell array 
    if h.GetSpellingSuggestions(word).count > 0 
     count = h.GetSpellingSuggestions(word).count; 
     for i = 1:count 
      suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name'); 
     end 
    else 
     %If incorrect but there are no suggestions, return this: 
     suggestion = 'no suggestion'; 
    end 

end 
%Quit Word to release the server 
h.Quit

f19.m

for i = 1:1 

data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r') 
CharData = fread(data2, '*char')'; %read text file and store data in CharData 
fclose(data2); 

word_punctuation=regexprep(CharData,'[`[email protected]#$%^&*()-_=+[{]}\|;:\''<,>.?/','') 

word_newLine = regexp(word_punctuation, '\n', 'split') 

word = regexp(word_newLine, ' ', 'split') 

[sizeData b] = size(word) 

suggestion = cellfun(@checkSpelling, word, 'UniformOutput', 0) 

A19(i)=sum(~cellfun(@isempty,suggestion)) 

feature19(A19(i)>=20)=1 
feature19(A19(i)<20)=-1 
end

來源

2014-05-06 user3340270

替換您的regexprep呼叫

word_punctuation=regexprep(CharData,'\W','\n');

這裏\W找到的所有非字母數字字符（inclulding空格）獲得與新行取代。

然後

word = regexp(word_punctuation, '\n', 'split');

正如你可以看到你不需要的空間分割（見上文）。但你可以刪除空單元格：

word(cellfun(@isempty,word)) = [];

一切都爲我工作。不過，我不得不說，你checkSpelling函數非常慢。在每次調用時，都必須創建一個ActiveX服務器對象，添加新文檔，並在檢查完成後刪除該對象。考慮重寫函數以接受字符串的單元數組。

UPDATE

我看到的唯一的問題是消除報價'字符（我，不這樣做，等）。你可以用下劃線（是的，它被認爲是字母數字）或任何未使用的字符序列臨時替換它們。或者，您可以使用所有非字母數字字符的列表在方括號中刪除而不是\W。

UPDATE 2

另一種解決方案的第一更新：

word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

來源

2014-05-07 20:32:41 yuk

如何擺脫標點符號？並檢查拼寫錯誤

回答

相關問題