"T h e n as data we give the t r a j e c t o r i e s o f the particles between ..." 



$ perl -e '$t="The \ \ \ \ t h i n g w r o n g h e r e is we have a gap s."; print "$t\n"; 
$t=~s/(\s{2,})/$1 /g; print "$t\n"; 
$t=~s/(\w)\s?/$1/g;  print "$t\n"; 
$t=~s/\s+/ /g;   print "$t\n";' 

The  t h i n g w r o n g  h e r e is we have a gap s. 
The   t h i n g  w r o n g  h e r e  is we have  a gap s. 
The   t h i n g  w r o n g  h e r e  is we have  a gap  s. 
The   thing wrong here is we have  a gap  s. 
The thing wrong here is we have a gap s. 

結束句句時間「gap s」。是故意的,它不應該關閉。


問題2.單間隔的唯一OCR文本轉儲可以做些什麼?我認爲只能解決這個問題,一般是爲了清理表格的文本: 「當數據給出粒子之間的軌跡...」 當字邊界不清楚使用一些重型模塊,尋找字典一串單個字母中的單詞。有這樣的模塊嗎? (我已經搜索過但目前還沒有找到)


你試圖用正則表達式來操作自然語言。在最好的情況下,並且在你正在工作的空間中,可能是不可能的。繼續謹慎,這裏有龍... –


http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl – xxfelixxx


http:// search.cpan.org/~cdolan/CAM-PDF-1.60/bin/getpdftext.pl – xxfelixxx



對於第一個問題(空間太多),您可以使用s/\s+/ /g輕鬆解決。至於第二個問題,我不確定是否有這樣的圖書館。



  • 任何基於正則表達式的解決方案不會給你一個空間的問題很好的解決方案。


  • 對於問題一樣, - 「T母雞數據我們給trajectoriesof之間的顆粒」

  • 首先,你可以刪除句子中的所有空間,然後使用Norvig的工作 - Word Segmentation Solution


from __future__ import division 
from collections import Counter 
import re, nltk 

WORDS = nltk.corpus.abc.words() 
COUNTS = Counter(WORDS) 

def pdist(counter): 
    "Make a probability distribution, given evidence from a Counter." 
    N = sum(counter.values()) 
    return lambda x: counter[x]/N 

P = pdist(COUNTS) 

def Pwords(words): 
    "Probability of words, assuming each word is independent of others." 
    return product(P(w) for w in words) 

def product(nums): 
    "Multiply the numbers together. (Like `sum`, but with multiplication.)" 
    result = 1 
    for x in nums: 
     result *= x 
    return result 

def memo(f): 
    "Memoize function f, whose args must all be hashable." 
    cache = {} 
    def fmemo(*args): 
     if args not in cache: 
      cache[args] = f(*args) 
     return cache[args] 
    fmemo.cache = cache 
    return fmemo 

def splits(text, start=0, L=20): 
    "Return a list of all (first, rest) pairs; start <= len(first) <= L." 
    return [(text[:i], text[i:]) 
      for i in range(start, min(len(text), L)+1)] 

def segment(text): 
    "Return a list of words that is the most probable segmentation of text." 
    if not text: 
     return [] 
     candidates = ([first] + segment(rest) 
         for (first, rest) in splits(text, 1)) 
     return max(candidates, key=Pwords) 

text = "T h e n as data we give the t r a j e c t o r i e s o f the particles between" 
text = text.replace(" ", "") 
print segment(text) 
# ['Then', 'as', 'data', 'we', 'give', 'the', 'trajectories', 'of', 'the', 'particles', 'between'] 
