0
我使用這個腳本,以消除在Perl無效搜索字詞刪除無效搜索字詞,我在Windows上運行,我無法找到的 兼容版本:的Perl從字符串
Lingua::EN::StopWordList
Lingua::StopWords qw(getStopWords)
我有一個停止詞的數組,但是一旦我使用下面的REGEX,我就會失去導致詞語衝突的關鍵空格。 請注意,Stop-Word數組中的每個單詞都有兩個空格,一個在右側,一個在左側。
如何在不丟失關鍵空白的情況下有效移除停用詞?
use strict;
use warnings;
use utf8;
use IO::File;
use String::Util 'trim';
my $inFile = "C:\\Users\\David\\Downloads\\InfoRet\\Explore the ways to get better grades.txt";
my $inFh = new IO::File $inFile, "r";
my $lineNum = 0;
my $line = undef;
my $loc = undef;
my $str = undef;
my @stopList = (" the ", " a ", " an ", " of ", " and ", " on ", " in ", " by ", " with ", " at ", " after ", " into ", " their ", " is ", " that ", " they ", " for ", " to ", " it ", " them ", " which ");
for(my $i = 1; $i <= 4; $i++) {
<$inFh>
}
while($line = <$inFh>) {
$lineNum++;
chomp $line;
$line =~ s/[\$#@~!&*()\[\];.,:?^`\\\/]+//g;
for my $planet (@stopList) {
$loc = index($line, $planet);
if($loc!=(-1)) {
#$line =~ s/$str//g;
$line =~ s/$planet//g;
}
}
print "$line\n";
}
一個想法是不刪除空白。而不是循環停止列表,使用停用詞作爲鍵和它們的值''「'做一個散列。然後執行#(\ w +)#$ hash {lc($ 1)} // $ 1#g'注意你必須使用defined或'//',因爲'「」'是一個假值。另請注意,您必須從停用詞列表中刪除空格。 – TLP 2014-11-08 18:24:29