2013-07-27 25 views
0

我想將所有大於2147483647的整數替換爲^^<int>後面的數字前3位。例如,我有我的原始數據爲:如何替換python中的特定模式

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact". 
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> . 
<basic> "language" "89028899" <html>. 

我想下面提到的數據來代替原來的數據:

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact". 
"Ask a Question" <at> "255"^^<int> <stack_overflow> . 
<basic> "language" "89028899" <html>. 

我已經實現的方式是通過線掃描數據線。如果我發現數字大於2147483647,我用前3位數字替換它們。但是,我不知道該如何檢查字符串的下一部分是否爲^^<int>

我想要做的是:對於大於2147483647的數字,例如25500000000,我想用數字的前三位替換它們。由於我的數據大小爲1TB,所以我們非常感謝更快的解決方案。

回答

3

使用re模塊,構建一個正則表達式:

regex = r""" 
(    # Capture in group #1 
    "[\w\s]+" # Three sequences of quoted letters and white space characters 
    \s+   # followed by one or more white space characters 
    "[\w\s]+" 
    \s+ 
    "[\w\s]+" 
    \s+ 
) 
"(\d{10,})"  # Match a quoted set of at least 10 integers into group #2 
(^^\s+\.\s+)  # Match by two circumflex characters, whitespace and a period 
       # into group #3 
(.*)    # Followed by anything at all into group #4 
""" 

COMPILED_REGEX = re.compile(regex, re.VERBOSE) 

接下來,我們需要定義一個回調函數(因爲re.RegexObject.sub採用了回調)來處理更換:

def replace_callback(matches): 
    full_line = matches.group(0) 
    number_text = matches.group(2) 
    number_of_interest = int(number_text, base=10) 
    if number_of_interest > 2147483647: 
     return full_line.replace(number_of_interest, number_text[:3]) 
    else: 
     return full_line 

然後查找和替換:

fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA) 

如果你有一個terrabyte的數據,你可能不想在內存中這樣做 - 你需要打開這個文件,然後迭代它,逐行替換數據並將它寫回到另一個文件(毫無疑問,這可以加快速度,但他們會使技術的難點難以遵循:

# Given the above 
def process_data(): 
    with open("path/to/your/file") as data_file, 
     open("path/to/output/file", "w") as output_file: 
     for line in data_file: 
      fixed_data = COMPILED_REGEX.sub(replace_callback, line) 
      output_file.write(fixed_data) 
1

如果在文本文件中的每一行看起來像你的榜樣,那麼你可以這樣做:

In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"' 

In [2079]: re.findall('\d+"\^\^', line) 
Out[2079]: ['25500000000"^^'] 

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile: 
    for line in infile: 
     for found in re.findall('\d+"\^\^', line): 
      if int(found[:-3]) > 2147483647: 
       line = line.replace(found, found[:3]) 
     outfile.write(line) 

由於內for循環的,這必須是一個低效率的解決方案的潛力。但是,我想不出在目前較好的正則表達式的,所以這應該讓你開始,至少是

相關問題