如何替換python中的特定模式

我想將所有大於2147483647的整數替換爲^^<int>後面的數字前3位。例如，我有我的原始數據爲：如何替換python中的特定模式

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact". 
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> . 
<basic> "language" "89028899" <html>.

我想下面提到的數據來代替原來的數據：

<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact". 
"Ask a Question" <at> "255"^^<int> <stack_overflow> . 
<basic> "language" "89028899" <html>.

我已經實現的方式是通過線掃描數據線。如果我發現數字大於2147483647，我用前3位數字替換它們。但是，我不知道該如何檢查字符串的下一部分是否爲^^<int>。

我想要做的是：對於大於2147483647的數字，例如25500000000，我想用數字的前三位替換它們。由於我的數據大小爲1TB，所以我們非常感謝更快的解決方案。

來源

2013-07-27 Jannat Arora

使用re模塊，構建一個正則表達式：

regex = r""" 
(    # Capture in group #1 
    "[\w\s]+" # Three sequences of quoted letters and white space characters 
    \s+   # followed by one or more white space characters 
    "[\w\s]+" 
    \s+ 
    "[\w\s]+" 
    \s+ 
) 
"(\d{10,})"  # Match a quoted set of at least 10 integers into group #2 
(^^\s+\.\s+)  # Match by two circumflex characters, whitespace and a period 
       # into group #3 
(.*)    # Followed by anything at all into group #4 
""" 

COMPILED_REGEX = re.compile(regex, re.VERBOSE)

接下來，我們需要定義一個回調函數（因爲re.RegexObject.sub採用了回調）來處理更換：

def replace_callback(matches): 
    full_line = matches.group(0) 
    number_text = matches.group(2) 
    number_of_interest = int(number_text, base=10) 
    if number_of_interest > 2147483647: 
     return full_line.replace(number_of_interest, number_text[:3]) 
    else: 
     return full_line

然後查找和替換：

fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)

如果你有一個terrabyte的數據，你可能不想在內存中這樣做 - 你需要打開這個文件，然後迭代它，逐行替換數據並將它寫回到另一個文件（毫無疑問，這可以加快速度，但他們會使技術的難點難以遵循：

# Given the above 
def process_data(): 
    with open("path/to/your/file") as data_file, 
     open("path/to/output/file", "w") as output_file: 
     for line in data_file: 
      fixed_data = COMPILED_REGEX.sub(replace_callback, line) 
      output_file.write(fixed_data)

來源

2013-07-27 01:26:02

如果在文本文件中的每一行看起來像你的榜樣，那麼你可以這樣做：

In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"' 

In [2079]: re.findall('\d+"\^\^', line) 
Out[2079]: ['25500000000"^^'] 

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile: 
    for line in infile: 
     for found in re.findall('\d+"\^\^', line): 
      if int(found[:-3]) > 2147483647: 
       line = line.replace(found, found[:3]) 
     outfile.write(line)

由於內for循環的，這必須是一個低效率的解決方案的潛力。但是，我想不出在目前較好的正則表達式的，所以這應該讓你開始，至少是

來源

2013-07-27 01:06:25 inspectorG4dget

如何替換python中的特定模式

回答

相關問題