.i 1
.t
effici machineindepend procedur
garbag collect variou list structur
.w
method return regist free
list essenti part list process
system. paper past solut recoveri
problem review compar. new algorithm
present offer signific advantag speed
storag util. routin implement
algorithm written list languag
insur degre
machin independ. final applic
algorithm number differ list structur
appear literatur indic.
.b
cacm august 1967
.a
schorr h.
wait w. m.
.n
ca670806 jb februari 27 1978 428 pm
.x
1024 4 1549
1024 4 1549
1050 4 1549
.i 2
.t
comparison batch process instant turnaround
.w
studi program effort student
introductori program cours present
effect have instant turnaround minut
oppos convent batch process
turnaround time hour examin.
item compar number comput
run trip comput center program prepar
time keypunch time debug time
number run elaps time run
run problem.
result influenc fact bonu point
given complet program problem
specifi number run
evid support instant batch.
.b
cacm august 1967
.a
smith l. b.
.n
ca670805 jb februari 27 1978 432 pm
.x
1550 4 1550
1550 4 1550
1304 5 1550
1472 5 1550
現在,上面的文字是2個文件,這是雙方停止,朵朵的內容,新的文件從.I(後跟一個數字)開始的話需要做的在.t & .b,.b & .a,.a & .n,.n & .x之間的文本中索引文本,並忽略.x和新文檔開始之間的所有文本。即I(後跟一個數字)如何索引的所有獨特的語料用perl
所有文件的內容都存儲在一個文件中,稱爲「語料庫」。需要對它們出現在語料庫和每個文檔中的次數進行索引,可能是文檔中的哪些位置。
open FILE, '<', 'sometext.txt' or die $!;
my @texts = <FILE>;
foreach my $text(@texts) {
my @lines = split ("\n",$text);
foreach my $line(@lines) {
my @words = split (" ",$text);
foreach my $word(@words) {
$word = trim($word);
my $match = qr/$word/i;
open STFILE, '<', 'sometext.txt' or die $!;
my $count=0;
while (<STFILE>) {
if ($_ =~ $match) {
my @mword = split /\s+/, $_;
$_ =~ s/[A-Za-z0-9_ ]//g;
for my $i (0..$#mword) {
if ($mword[$i] =~ $match) {
#print "match found on line $. word ", $i+1,"\n";
$count++
}
}
}
}
print "$word appears $count times \n";
close(STFILE) or die "Couldn't close $file: $!\n\n";
}
}
}
close(FILE) or die "Couldn't close $file: $!\n\n";
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
上述代碼計算語料庫中每個詞的出現次數。 如何更改它,以便它也計算單個文檔中的單詞的發生。
'$ ++計數{$詞}' – ikegami 2012-04-13 21:31:49