0
好的,這對我自己的學習來說比實際需要更多。從一批文件中刪除項目的更優雅的解決方案?
我有以下格式文件:
Loading parser from serialized file ./englishPCFG.ser.gz ... done [2.8 sec].
Parsing file: chpt1_1.txt
Parsing [sent. 1 len. 42]: [1.1, Organisms, Have, Changed, over, Billions, of, Years, 1, Long, before, the, mechanisms, of, biological, evolution, were, understood, ,, some, people, realized, that, organisms, had, changed, over, time, and, that, living, organisms, had, evolved, from, organisms, no, longer, alive, on, Earth, .]
(ROOT
(S
(S
(NP (CD 1.1) (NNS Organisms))
(VP (VBP Have)
(VP (VBN Changed)
(PP (IN over)
(NP
(NP (NNS Billions))
(PP (IN of)
(NP (NNP Years) (CD 1)))))
(SBAR
(ADVP (RB Long))
(IN before)
(S
(NP
(NP (DT the) (NNS mechanisms))
(PP (IN of)
(NP (JJ biological) (NN evolution))))
(VP (VBD were)
(VP (VBN understood))))))))
(, ,)
(NP (DT some) (NNS people))
(VP (VBD realized)
(SBAR
(SBAR (IN that)
(S
(NP (NNS organisms))
(VP (VBD had)
(VP (VBN changed)
(PP (IN over)
(NP (NN time)))))))
(CC and)
(SBAR (IN that)
(S
(NP (NN living) (NNS organisms))
(VP (VBD had)
(VP (VBN evolved)
(PP (IN from)
(NP
(NP (NNS organisms))
(ADJP
(ADVP (RB no) (RBR longer))
(JJ alive))))
(PP (IN on)
(NP (NNP Earth)))))))))
(. .)))
num(Organisms-2, 1.1-1)
nsubj(Changed-4, Organisms-2)
aux(Changed-4, Have-3)
ccomp(realized-22, Changed-4)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
num(Years-8, 1-9)
advmod(understood-18, Long-10)
dep(understood-18, before-11)
det(mechanisms-13, the-12)
nsubjpass(understood-18, mechanisms-13)
amod(evolution-16, biological-15)
prep_of(mechanisms-13, evolution-16)
auxpass(understood-18, were-17)
ccomp(Changed-4, understood-18)
det(people-21, some-20)
我需要刪除所有不屬於重要的依存關係(最後一節)。然後保存新文件。這是我的工作代碼:
#!usr/bin/perl
use strict;
use warnings;
##Call with *.txt on command line
##EDIT TO ONLY FIND FILES YOU WANT CHANGED:
my @files = glob("parsed"."*.txt");
foreach my $file (@files) {
my @newfile;
open(my $parse_corpus, '<', "$file") or die $!;
while (my $sentences = <$parse_corpus>) {
#print $sentences, "\n\n";
if ($sentences =~ /(\w+)\(\S+\-\d+\,\s\S+\-\d+\)/) {
if ($sentences =~ /subj\w*\(|obj\w*\(|prep\w*\(|xcomp\w*\(|agent\w*\(|purpcl\w*\(|conj_and\w*\(/) {
push (@newfile, $sentences);
}
}
else {
push (@newfile, $sentences);
}
}
open(FILE ,'>', "select$file");
print FILE @newfile;
close FILE
}
而改變的輸出文件的一部分:
nsubj(Changed-4, Organisms-2)
prep_over(Changed-4, Billions-6)
prep_of(Billions-6, Years-8)
nsubjpass(understood-18, mechanisms-13)
prep_of(mechanisms-13, evolution-16)
nsubj(realized-22, people-21)
nsubj(changed-26, organisms-24)
prep_over(changed-26, time-28)
nsubj(evolved-34, organisms-32)
conj_and(changed-26, evolved-34)
prep_from(evolved-34, organisms-36)
prep_on(evolved-34, Earth-41)
是否有顯著更好的方法,或者用一個更優雅的/聰明的解決方案?
感謝您的時間,這又純粹是爲了您的興趣,所以如果您沒有時間,請不要幫忙。
幹得好!我從中學到了,特別是outfile的效率部分。我喜歡「除非」更改,以及「autodie」。謝謝! p.s.將「my $ newfile」更改爲「my $ outfile」以獲得完美的代碼。 – Jon
有沒有辦法做一個「雙下一個」,所以你跳過(內部和外部)循環迭代(我不知道爲什麼,只是想知道) – Jon
你可以標記你的循環,然後'標籤下'。但是在這個代碼的情況下,它會跳轉到下一個文件。請參閱perlsyn文檔。 – DavidO