2011-05-16 11 views
2

如何將標點符號的自然語言文本格式化? Vim內置的gq命令,或命令行工具,如fmtpar不考慮標點符號的斷行。我給大家舉一個例子,關於標點符號的格式文本

fmt -w 40給人不是我想要的:

we had everything before us, we had 
nothing before us, we were all going 
direct to Heaven, we were all going 
direct the other way 

smart_formatter -w 40會給:

we had everything before us, 
we had nothing before us, 
we were all going direct to Heaven, 
we were all going direct the other way 

當然,有些時候沒有標點符號中發現給定文本寬度,那麼它可以回退到標準文本格式行爲。

我想要這個的原因是爲了得到一個有意義的文本diff,我可以發現哪些句子或子句發生了變化。

回答

0

這是一個不是很優雅,但我終於想出了工作方法。假設標點符號上的換行符值6個字符。這意味着,如果「粗糙」長度少於6個字符,我會接受一個更粗糙的結果,但包含更多行以標點符號結尾的行。例如,這是可以的(「粗糙」是3個字符)。

Wait! 
He said. 

這不是OK(「毛糙」大於6個字符)

Wait! 
He said to them. 

的方法是將每個標點符號後添加6個虛字符,格式化文本,然後取出虛字符。

這裏是作爲對虛字符此

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g' 

我用_代碼(空間+下劃線),假設他們沒有包含在文本。結果看起來不錯,

we had everything before us, 
we had nothing before us, 
we were all going direct to 
Heaven, we were all going 
direct the other way