我正在使用perl腳本刪除文本中的所有停用詞。停用詞按行存儲。我正在使用Mac OSX命令行,並且perl安裝正確。Perl使用file2從文件1中刪除單詞
此腳本無法正常工作並存在邊界問題。
#!/usr/bin/env perl -w
# usage: script.pl words text >newfile
use English;
# poor man's argument handler
open(WORDS, shift @ARGV) || die "failed to open words file: $!";
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!";
my @words;
# get all words into an array
while ($_=<WORDS>) {
chop; # strip eol
push @words, split; # break up words on line
}
# (optional)
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the"
@words=sort { length($b) <=> length($a) } @words;
# slurp text file into one variable.
undef $RS;
$text = <REPLACE>;
# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space.
foreach $word (@words) {
$text =~ s/\b\Q$word\E\s?//sg;
}
# output "fixed" text
print $text;
sample.txt的
$ cat sample.txt
how about i decide to look at it afterwards what
across do you think is it a good idea to go out and about i
think id rather go up and above
stopWords.txt中
I
a
about
an
are
as
at
be
by
com
for
from
how
in
is
it
..
輸出:
$ ./remove.pl stopwords.txt sample.txt
i decide look fterwards cross do you think good idea go out d i
think id rather go up d bove
正如你所看到的,它取代之後使用作爲fterwards。認爲它是一個正則表達式問題。請有人能幫我快速補丁嗎?感謝您的幫助:J
不應該在's/\ b \ Q $ word \ E \ s?// sg;'這樣:'s/\ b \ Q $ word \ E \ b // sg;' – hjpotter92
@ hjpotter92多數民衆贊成它,張貼作爲答案。 :J – pbu
另請參閱:打開「嚴格」和「警告」。打開Lexical 3 arg文件是一個好主意,尤其是當您從arg列表中取得文件名時。你也不需要在你的while循環中顯式賦值'$ _'。如果你這樣做,或者使用不同的(有意義的)名稱,或者不要 - 'while(<$file>)'完全一樣。 – Sobrique