2017-04-25 95 views
5

我想分割字符串之前有一個整體,2位數字包圍空格的地方。最終我想用Python來工作,但我一直在用sed工作,我無法弄清楚。如何分割這個字符串?

我的測試數據是這樣的:

13 13 13 13 13 9:07.18 9:12.09 9:15.65 
14 14 14 2:04.86 2:05.99 2:06.87 14 4:21.51 4:23.51 4:25.00 14 8:56.28 9:01.09 9:04.58 
15 15 57.18 57.61 57.95 15 2:02.61 2:03.72 2:04.58 15 4:17.31 4:19.28 4:20.75 15 8:47.15 8:51.87 8:55.30 
16 16 56.34 56.76 57.09 16 2:00.69 2:01.78 2:02.63 16 4:13.75 4:15.69 4:17.14 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99 17 55.62 56.03 56.36 17 1:59.07 2:00.15 2:00.99 17 4:10.76 4:12.69 4:14.11 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73 18 55.01 55.42 55.74 18 1:57.74 1:58.81 1:59.63 18 4:08.34 4:10.24 4:11.66 18 8:33.73 8:37.04 
19 25.20 25.36 25.49 19 54.50 54.91 55.23 19 1:57.74 1:58.56 19 4:08.34 4:09.74 19 8:33.73 

而且我想它這樣被分開(注意逗號的位置 ''):

13, 13, 13, 13, 13 9:07.18 9:12.09 9:15.65 
14, 14, 14 2:04.86 2:05.99 2:06.87, 14 4:21.51 4:23.51 4:25.00, 14 8:56.28 9:01.09 9:04.58 
15, 15 57.18 57.61 57.95, 15 2:02.61 2:03.72 2:04.58, 15 4:17.31 4:19.28 4:20.75, 15 8:47.15 8:51.87 8:55.30 
16, 16 56.34 56.76 57.09, 16 2:00.69 2:01.78 2:02.63, 16 4:13.75 4:15.69 4:17.14, 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99, 17 55.62 56.03 56.36, 17 1:59.07 2:00.15 2:00.99, 17 4:10.76 4:12.69 4:14.11, 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73, 18 55.01 55.42 55.74, 18 1:57.74 1:58.81 1:59.63, 18 4:08.34 4:10.24 4:11.66, 18 8:33.73 8:37.04 
19 25.20 25.36 25.49, 19 54.50 54.91 55.23, 19 1:57.74 1:58.56, 19 4:08.34 4:09.74, 19 8:33.73 

上述數據相當規整,因爲兩位數的整數在[13,19]範圍內,但我應該預期的範圍是[10,99]。

有人可以建議一種方法來執行上述轉換?我一直在使用正則表達式來處理一段時間,但我無法覆蓋所有的情況。

+0

什麼是結構你的數據是什麼? - 如果你的數據是一個'string',那麼'mydata = mydata.split('')' – GiantsLoveDeathMetal

+1

@GiantsLoveDeathMetal簡單的分割並不會做OP想要的。查看第一行所需的輸出:看起來像一個時間組件需要保存在與前一個整數相同的「元素」中。 – blacksite

+0

@not_a_robot呀 - 棘手 – GiantsLoveDeathMetal

回答

7

的前瞻斷言(?=...)可以解決這個問題:

>>> a = """13 13 13 13 13 9:07.18 9:12.09 9:15.65 
14 14 14 2:04.86 2:05.99 2:06.87 14 4:21.51 4:23.51 4:25.00 14 8:56.28 9:01.09 9:04.58 
15 15 57.18 57.61 57.95 15 2:02.61 2:03.72 2:04.58 15 4:17.31 4:19.28 4:20.75 15 8:47.15 8:51.87 8:55.30 
16 16 56.34 56.76 57.09 16 2:00.69 2:01.78 2:02.63 16 4:13.75 4:15.69 4:17.14 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99 17 55.62 56.03 56.36 17 1:59.07 2:00.15 2:00.99 17 4:10.76 4:12.69 4:14.11 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73 18 55.01 55.42 55.74 18 1:57.74 1:58.81 1:59.63 18 4:08.34 4:10.24 4:11.66 18 8:33.73 8:37.04 
19 25.20 25.36 25.49 19 54.50 54.91 55.23 19 1:57.74 1:58.56 19 4:08.34 4:09.74 19 8:33.73""" 

>>> print(re.sub("(\d{2}) (?=\d{2}(|$))","\g<1>, ", a)) 
13, 13, 13, 13, 13 9:07.18 9:12.09 9:15.65 
14, 14, 14 2:04.86 2:05.99 2:06.87, 14 4:21.51 4:23.51 4:25.00, 14 8:56.28 9:01.09 9:04.58 
15, 15 57.18 57.61 57.95, 15 2:02.61 2:03.72 2:04.58, 15 4:17.31 4:19.28 4:20.75, 15 8:47.15 8:51.87 8:55.30 
16, 16 56.34 56.76 57.09, 16 2:00.69 2:01.78 2:02.63, 16 4:13.75 4:15.69 4:17.14, 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99, 17 55.62 56.03 56.36, 17 1:59.07 2:00.15 2:00.99, 17 4:10.76 4:12.69 4:14.11, 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73, 18 55.01 55.42 55.74, 18 1:57.74 1:58.81 1:59.63, 18 4:08.34 4:10.24 4:11.66, 18 8:33.73 8:37.04 
19 25.20 25.36 25.49, 19 54.50 54.91 55.23, 19 1:57.74 1:58.56, 19 4:08.34 4:09.74, 19 8:33.73 

因此,REG EXP。你需要的是(\d{2}) (?=\d{2}(|$))這意味着:

  1. (\d{2}) =>商店2號在第1組和匹配的額外空間。
  2. (?=\d{2}(|$)) =>匹配2個數字和1個空格或EOL,但不要使用它們。

這裏的關鍵是,通過不消耗第二個匹配組,它會在下次應用子函數時再次處理。最後,\g<1>,將取代1.與相同的數字和附加,

+2

請說明您的正則表達式:) – GiantsLoveDeathMetal

+0

@GiantsLoveDeathMetal請告訴我,如果你需要任何額外的解釋:) – VMRuiz

+0

這是件好事 - – GiantsLoveDeathMetal

0

爲了sed的樂趣,因爲你對sed的理解感興趣。

sed ":a;s/\([^,]\)\(\s[0-9]\{2\}\s\)/\1,\2/;ta" 

sed -E ":a;s/([^,])(\s[0-9]{2}\s)/\1,\2/;ta" 
  • 開始循環
    • 外觀比,其他
      • 的東西,重要的是以後循環
      • 一空白,兩個數字和一個空白
    • 由非逗號,逗號,其餘
  • 循環更換如果更換東西

輸出(完全按期望的輸出):

13, 13, 13, 13, 13 9:07.18 9:12.09 9:15.65 
14, 14, 14 2:04.86 2:05.99 2:06.87, 14 4:21.51 4:23.51 4:25.00, 14 8:56.28 9:01.09 9:04.58 
15, 15 57.18 57.61 57.95, 15 2:02.61 2:03.72 2:04.58, 15 4:17.31 4:19.28 4:20.75, 15 8:47.15 8:51.87 8:55.30 
16, 16 56.34 56.76 57.09, 16 2:00.69 2:01.78 2:02.63, 16 4:13.75 4:15.69 4:17.14, 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99, 17 55.62 56.03 56.36, 17 1:59.07 2:00.15 2:00.99, 17 4:10.76 4:12.69 4:14.11, 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73, 18 55.01 55.42 55.74, 18 1:57.74 1:58.81 1:59.63, 18 4:08.34 4:10.24 4:11.66, 18 8:33.73 8:37.04 
19 25.20 25.36 25.49, 19 54.50 54.91 55.23, 19 1:57.74 1:58.56, 19 4:08.34 4:09.74, 19 8:33.73 
+0

非常好。只是想知道,你可以使用擴展正則表達式去掉一些反斜槓嗎? – wjandrea

+0

@wjandrea是的,請參閱編輯。 – Yunnosch

0

添加到VMRuiz's answer,此輸出列出每行,而不是一個大字符串。我必須更改正則表達式才能使用re.split而不是re.sub,我不確定它是否相同。

for line in a.split('\n'): 
    re.split('(?<=\d{2}) (?=\d{2} |$)', line) 

編輯:這絕對是一樣的,但有點尷尬:

for line in re.sub('(\d{2}) (?=\d{2}(|$))', '\g<1>,', a).split('\n'): 
    line.split(',') 
0

如果你想有一個非正則表達式的Python的解決方案,你可以這樣做:

s = """\ 
13 13 13 13 13 9:07.18 9:12.09 9:15.65 
14 14 14 2:04.86 2:05.99 2:06.87 14 4:21.51 4:23.51 4:25.00 14 8:56.28 9:01.09 9:04.58 
15 15 57.18 57.61 57.95 15 2:02.61 2:03.72 2:04.58 15 4:17.31 4:19.28 4:20.75 15 8:47.15 8:51.87 8:55.30 
16 16 56.34 56.76 57.09 16 2:00.69 2:01.78 2:02.63 16 4:13.75 4:15.69 4:17.14 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99 17 55.62 56.03 56.36 17 1:59.07 2:00.15 2:00.99 17 4:10.76 4:12.69 4:14.11 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73 18 55.01 55.42 55.74 18 1:57.74 1:58.81 1:59.63 18 4:08.34 4:10.24 4:11.66 18 8:33.73 8:37.04 
19 25.20 25.36 25.49 19 54.50 54.91 55.23 19 1:57.74 1:58.56 19 4:08.34 4:09.74 19 8:33.73""" 


res="" 
for line in s.splitlines(): 
    buf=line.split() 
    for i, e in enumerate(buf[1:], 1): 
     buf[i-1]+=", " if e.isdigit() else " " 
    res+=''.join(buf)+"\n" 

>>> res 
13, 13, 13, 13, 13 9:07.18 9:12.09 9:15.65 
14, 14, 14 2:04.86 2:05.99 2:06.87, 14 4:21.51 4:23.51 4:25.00, 14 8:56.28 9:01.09 9:04.58 
15, 15 57.18 57.61 57.95, 15 2:02.61 2:03.72 2:04.58, 15 4:17.31 4:19.28 4:20.75, 15 8:47.15 8:51.87 8:55.30 
16, 16 56.34 56.76 57.09, 16 2:00.69 2:01.78 2:02.63, 16 4:13.75 4:15.69 4:17.14, 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99, 17 55.62 56.03 56.36, 17 1:59.07 2:00.15 2:00.99, 17 4:10.76 4:12.69 4:14.11, 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73, 18 55.01 55.42 55.74, 18 1:57.74 1:58.81 1:59.63, 18 4:08.34 4:10.24 4:11.66, 18 8:33.73 8:37.04 
19 25.20 25.36 25.49, 19 54.50 54.91 55.23, 19 1:57.74 1:58.56, 19 4:08.34 4:09.74, 19 8:33.73 

awk你可以這樣做:

awk '{n=split($0,a) 
     for (i=2;i<=n;i++) 
      printf "%s%s", a[i-1], a[i]~/^[[:digit:]]+$/ ? ", " : " " 
     print a[n] 
    }' file 
13, 13, 13, 13, 13 9:07.18 9:12.09 9:15.65 
14, 14, 14 2:04.86 2:05.99 2:06.87, 14 4:21.51 4:23.51 4:25.00, 14 8:56.28 9:01.09 9:04.58 
15, 15 57.18 57.61 57.95, 15 2:02.61 2:03.72 2:04.58, 15 4:17.31 4:19.28 4:20.75, 15 8:47.15 8:51.87 8:55.30 
16, 16 56.34 56.76 57.09, 16 2:00.69 2:01.78 2:02.63, 16 4:13.75 4:15.69 4:17.14, 16 8:39.71 8:44.37 8:47.75 
17 25.69 25.85 25.99, 17 55.62 56.03 56.36, 17 1:59.07 2:00.15 2:00.99, 17 4:10.76 4:12.69 4:14.11, 17 8:33.73 8:38.34 8:41.68 
18 25.43 25.59 25.73, 18 55.01 55.42 55.74, 18 1:57.74 1:58.81 1:59.63, 18 4:08.34 4:10.24 4:11.66, 18 8:33.73 8:37.04 
19 25.20 25.36 25.49, 19 54.50 54.91 55.23, 19 1:57.74 1:58.56, 19 4:08.34 4:09.74, 19 8:33.73