這是我的問題。我是西班牙語翻譯員,我有一個非常冗長的西班牙語 - 英語詞彙表文件 - 50K條目。另外,我有一個超過1K條目的停用詞彙表。我想從我打算翻譯的文本中去除這些條目。所以,我構建了一個sed腳本,它反過來從詞彙表中構建了兩個sed腳本,這些腳本完成了剝離操作,並且只留下未翻譯的文本(所以我不需要兩次解決相同的問題)。這很有效,但問題在於長文本需要很長時間,有時會長達15分鐘。這是不可避免的,還是有一種更有效的方式來做到這一點?sed語言翻譯腳本 - 提高長文本效率
這裏的主腳本:
#!/bin/sh
before="$(date +%s)"
#wordstxt=$(wc -w < $1)
#mintime=$(expr "$wordstxt/200" |bc -l)
#maxtime=$(expr "$wordstxt/175" |bc -l)
#echo "Estimated time to process: between $mintime and $maxtime seconds."
sed '
s/\,/\n/g # strip all commas
s/\?/\n/g # strip question marks
s/\*/\n/g # strip asterisks
s/\!/\n/g # strip exclamation marks
s/:/\n/g # strip colons
s/\-/\n/g # strip hyphens
s/\./\n/g # strip periods
s/«/\n/g # strip left Euro-quotes
s/»/\n/g # strip right Euro-quotes
s/」/\n/g # strip slanted US quotes
s/\"/\n/g # strip left quotes
s/(/\n/g # strip left paren
s/)/\n/g # strip right paren
s/\[/\n/g # strip left bracket
s/\]/\n/g # strip right bracket
s/¿/\n/g # "¿"
s/—/\n/g # m-dash
s/\ –\ /\n/g # n-dash
s/…/\n/g # strip elipsis as a single character, not three periods
s/;/\n/g # strip semicolon
s/[0-9]/\n/g # strip out all numbers, replace with returns
' $1 > $1.z.tmp
#echo "Punctuation eliminated."
#cp ../../Spanish\ to\ English\ projects/glossary/stoplist.txt .
sed '
s/^\ //g # strip leading spaces
s/\ $// # strip trailing spaces
/^$/d # delete blank lines
s/\./\n/g # strip periods
s/\ /\\ /g # make spaces into literals
s/^/s\// # begins the substitution
s/$/\/\\n\/g/ # concludes the substitution
1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/
' stoplist.txt > stoplist.sed
chmod +x stoplist.sed
echo "Eliminating stopwords."
./stoplist.sed $1.z.tmp > $1.0.tmp
sed 's/\([A-Za-z\ ]*\t\).*/\1/' SpanishGlossary.utf8 > tempgloss.2.txt
#echo "Target phrases stripped."
sort -u tempgloss.2.txt > tempgloss.3.txt
awk '{ print length(), $0 | "sort -rn" }' tempgloss.3.txt > tempgloss.4.txt
#echo "List ordered by length."
#echo "Now creating new sed script." # THIS AFFECTS THE SED SCRIPT, NOT THE OUTPUT FILE.
sed '
s/[0-9]//g # strip out all numbers
s/^\ //g # strip leading spaces -- all lines have them due to the sort
/^$/d # delete blank lines
s/\//\\\//g # make text slashes into literals
s/"/\n/g # strip quotes
s/\t//g # strip tabs
s/\./\n/g # strip periods
s/'\''/\\'\''/g # make straight apostrophes into literals
s/'\’'/\\'\’'/g # make curly apostrophes into literals
s/\ /\\ /g # make spaces into literals
/^.\{0,5\}$/d # delete lines with less than five characters
s/^/s\/\\b/ # begins the substitution
s/$/\\b\/\\n\/g/ # concludes the substitution
1 s/^/#!\ \/bin\/sed\ \-f\n\ns\/\[0\-9\]\/\/g\ns\/\\\ \\\ \/\\\ \/g\ns\/\\\.\\\ \/\\n\/g\n\n/
' tempgloss.4.txt > glossy.sed
#echo "glossy.sed created."
chmod +x glossy.sed
echo "Eliminating existing entries. This may take a while."
./glossy.sed $1.0.tmp > $1.1.tmp
echo "Now cleaning up lines."
sed -e '
s/\ $// # strip trailing spaces
s/^\ *//g # strip any and all leading spaces
s/\ el$//g # strip "el" from the end
s/\ la$//g # strip "la" from the end
s/\ los//g # strip "los" from the end
s/\ las//g # strip "las" from the end
s/\ o$//g # strip "o" from the end
s/\ y$//g # strip "y" from the end
s/\ $// # strip trailing spaces (yes, again)
' $1.1.tmp > $1.2.tmp
echo "Creating ngrams."
./ngrams 5 < $1.2.tmp > $1.3.tmp 2> /dev/null
linecount="$(wc -l < $1.3.tmp)"
#echo $linecount "lines."
if [ "$linecount" -gt "1000" ]
then
echo "Eliminating single instances."
sed '/^1\t/d' $1.3.tmp > $1.4.tmp
else
echo "Fewer than 1000 entries, so keeping all."
cp $1.3.tmp $1.4.tmp
fi
sed -e '
s/[0-9]//g # strip out all numbers
s/^\t//g # strip leading tab
s/^\ *//g # strip any and all leading spaces
/^.\{0,7\}$/d # delete lines with less than six characters
s/\ $// # strip trailing spaces (yes, again)
#s/$/\t/ # add in the tab
' $1.4.tmp > $1.csv
echo "Looking for duplicates."
sh ./dedupe $1.csv
wordstxt=$(wc -w < $1)
#echo $wordstxt
wordslist=$(wc -w < $1.csv)
#echo $wordslist
wordspercent=$(echo "scale=4; $wordslist/$wordstxt" |bc -l)
wordspercentage=$(echo "$wordspercent * 100" |bc -l)
after="$(date +%s)"
elapsed_seconds="$(expr $after - $before)"
rate=$(echo "scale=3; $wordstxt/$elapsed_seconds" |bc -l)
echo "Created "$1.csv", with $wordspercentage% left, in" $elapsed_seconds "seconds." #, for an effective rate of" $rate "words per second."
rm tempgloss.*.txt
rm *.tmp
rm glossy.sed
有趣的問題,但我沒有時間重寫你的腳本。其他人可能會。你可以結合像s/\ el $ | \ los | \ la $ //'這樣的單詞替換。對於包含行尾標記'$'的字符串使用'/ g'可能不會花費額外的時間,但會讓其他人更難理解您的代碼。你也可以一次對許多單個字符進行分割,比如's/[,?\ *!: - \。]/\ n/g',但是使用'[character-class]'範圍會引起混淆。祝你好運。 – shellter 2013-03-02 02:10:44
感謝您的提示。即使在我發佈這篇文章之後,我將標點符號從腳本的頂部拖出,並將其放入了停用詞列表中。你談論的組合有沒有什麼優勢?擁有一條超級巨大的路線,而不是成千上萬的小路線? – user1889034 2013-03-02 02:44:33
是的,一條線的每次掃描花費你x次。使用包含例如5個ORed表達式(使用'|')的reg ex將時間減少到〜x/5次。我不會試圖在's/wd1 | wd2 /'行上拼寫每一個可能的單詞,你會在調試sed錯誤消息所需的時間內達到遞減的回報點。使它成爲替換組合相關的單詞,以便您的代碼更易於維護。可能還有其他一些技巧可以減少整體運行時間。有時,管道中的命令越多越好,但現在不能說。祝你好運。 – shellter 2013-03-02 02:53:11