我正在使用perl腳本刪除文本中的所有停用詞。停用詞按行存儲。我正在使用Mac OSX命令行，並且perl安裝正確。Perl使用file2從文件1中刪除單詞

此腳本無法正常工作並存在邊界問題。

#!/usr/bin/env perl -w 
# usage: script.pl words text >newfile 
use English; 

# poor man's argument handler 
open(WORDS, shift @ARGV) || die "failed to open words file: $!"; 
open(REPLACE, shift @ARGV) || die "failed to open replacement file: $!"; 

my @words; 
# get all words into an array 
while ($_=<WORDS>) { 
    chop; # strip eol 
    push @words, split; # break up words on line 
} 

# (optional) 
# sort by length (makes sure smaller words don't trump bigger ones); ie, "then" vs "the" 
@words=sort { length($b) <=> length($a) } @words; 

# slurp text file into one variable. 
undef $RS; 
$text = <REPLACE>; 

# now for each word, do a global search-and-replace; make sure only words are replaced; remove possible following space. 
foreach $word (@words) { 
    $text =~ s/\b\Q$word\E\s?//sg; 
} 

# output "fixed" text 
print $text;

sample.txt的

$ cat sample.txt 
how about i decide to look at it afterwards what 
across do you think is it a good idea to go out and about i 
think id rather go up and above

stopWords.txt中

I 
a 
about 
an 
are 
as 
at 
be 
by 
com 
for 
from 
how 
in 
is 
it 
..

輸出：

$ ./remove.pl stopwords.txt sample.txt 
i decide look fterwards cross do you think good idea go out d i 
think id rather go up d bove

正如你所看到的，它取代之後使用作爲fterwards。認爲它是一個正則表達式問題。請有人能幫我快速補丁嗎？感謝您的幫助：J

來源

2015-10-12 pbu

不應該在's/\ b \ Q $ word \ E \ s？// sg;'這樣：'s/\ b \ Q $ word \ E \ b // sg;' – hjpotter92

@ hjpotter92多數民衆贊成它，張貼作爲答案。：J – pbu

另請參閱：打開「嚴格」和「警告」。打開Lexical 3 arg文件是一個好主意，尤其是當您從arg列表中取得文件名時。你也不需要在你的while循環中顯式賦值'$ _'。如果你這樣做，或者使用不同的（有意義的）名稱，或者不要 - 'while（<$file>）'完全一樣。 – Sobrique

在$word的兩側使用字邊界。目前，您只是在開始時檢查它。

你不會需要\s?條件與\b到位：

$text =~ s/\b\Q$word\E\b//sg;

來源

2015-10-12 15:37:23 hjpotter92

你的正則表達式是不夠嚴謹。

$text =~ s/\b\Q$word\E\s?//sg;

當$word是a，該命令是有效s/\ba\s?//sg。這意味着，刪除所有以a開頭的新單詞，後面跟零個或多個空格。在afterwards中，這將成功匹配第一個a。

您可以通過以另一個\b結束單詞來使比賽更加嚴格。就像

$text =~ s/\b\Q$word\E\b\s?//sg;

來源

2015-10-12 15:37:47 Gowtham

Perl使用file2從文件1中刪除單詞

sample.txt的

stopWords.txt中

輸出：

回答

相關問題