關於標點符號的格式文本

如何將標點符號的自然語言文本格式化？ Vim內置的gq命令，或命令行工具，如fmt或par不考慮標點符號的斷行。我給大家舉一個例子，關於標點符號的格式文本

fmt -w 40給人不是我想要的：

we had everything before us, we had 
nothing before us, we were all going 
direct to Heaven, we were all going 
direct the other way

smart_formatter -w 40會給：

we had everything before us, 
we had nothing before us, 
we were all going direct to Heaven, 
we were all going direct the other way

當然，有些時候沒有標點符號中發現給定文本寬度，那麼它可以回退到標準文本格式行爲。

我想要這個的原因是爲了得到一個有意義的文本diff，我可以發現哪些句子或子句發生了變化。

這是一個不是很優雅，但我終於想出了工作方法。假設標點符號上的換行符值6個字符。這意味着，如果「粗糙」長度少於6個字符，我會接受一個更粗糙的結果，但包含更多行以標點符號結尾的行。例如，這是可以的（「粗糙」是3個字符）。

Wait! 
He said.

這不是OK（「毛糙」大於6個字符）

Wait! 
He said to them.

的方法是將每個標點符號後添加6個虛字符，格式化文本，然後取出虛字符。

這裏是作爲對虛字符此

sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'

我用_代碼（空間+下劃線），假設他們沒有包含在文本。結果看起來不錯，

we had everything before us, 
we had nothing before us, 
we were all going direct to 
Heaven, we were all going 
direct the other way

2011-07-06 20:57:01

回答