2015-10-12 49 views
2

我正在使用perl腳本刪除文本中的所有停用詞。停用詞按行存儲。我正在使用Mac OSX命令行,並且perl安裝正確。Perl使用file2從文件1中刪除單詞

此腳本無法正常工作並存在邊界問題。

#!/usr/bin/env perl -w 
# usage: script.pl words text >newfile 
use English; 

# poor man's argument handler 
open(WORDS, shift @ARGV) || die "failed to open words file: $!"; 
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!"; 

my @words; 
# get all words into an array 
while ($_=<WORDS>) { 
    chop; # strip eol 
    push @words, split; # break up words on line 
} 

# (optional) 
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the" 
@words=sort { length($b) <=> length($a) } @words; 

# slurp text file into one variable. 
undef $RS; 
$text = <REPLACE>; 

# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space. 
foreach $word (@words) { 
    $text =~ s/\b\Q$word\E\s?//sg; 
} 

# output "fixed" text 
print $text; 

sample.txt的

$ cat sample.txt 
how about i decide to look at it afterwards what 
across do you think is it a good idea to go out and about i 
think id rather go up and above 

stopWords.txt中

I 
a 
about 
an 
are 
as 
at 
be 
by 
com 
for 
from 
how 
in 
is 
it 
.. 

輸出:

$ ./remove.pl stopwords.txt sample.txt 
i decide look fterwards cross do you think good idea go out d i 
think id rather go up d bove 

正如你所看到的,它取代之後使用作爲fterwards。認爲它是一個正則表達式問題。請有人能幫我快速補丁嗎?感謝您的幫助:J

+1

不應該在's/\ b \ Q $ word \ E \ s?// sg;'這樣:'s/\ b \ Q $ word \ E \ b // sg;' – hjpotter92

+0

@ hjpotter92多數民衆贊成它,張貼作爲答案。 :J – pbu

+0

另請參閱:打開「嚴格」和「警告」。打開Lexical 3 arg文件是一個好主意,尤其是當您從arg列表中取得文件名時。你也不需要在你的while循環中顯式賦值'$ _'。如果你這樣做,或者使用不同的(有意義的)名稱,或者不要 - 'while(<$file>)'完全一樣。 – Sobrique

回答

1

$word的兩側使用字邊界。目前,您只是在開始時檢查它。

你不會需要\s?條件與\b到位:

$text =~ s/\b\Q$word\E\b//sg; 
0

你的正則表達式是不夠嚴謹。

$text =~ s/\b\Q$word\E\s?//sg; 

$worda,該命令是有效s/\ba\s?//sg。這意味着,刪除所有以a開頭的新單詞,後面跟零個或多個空格。在afterwards中,這將成功匹配第一個a

您可以通過以另一個\b結束單詞來使比賽更加嚴格。就像

$text =~ s/\b\Q$word\E\b\s?//sg;