我有類似下面的字符串：匹配多於2位與PyParsing

date    Not Important       value NotImportant2 
11.11.13   useless . useless,21 useless 2  14.21 asmdakldm 
21.12.12   fmpaosmfpoamsp 4      41  ajfa9si90

我只提取日期和結束時的價值。

如果我使用標準過程來匹配多個單詞，則pyparsing會將最後一個「不重要」列作爲「值」匹配。

anything = pp.Forward() 
    anything << anyword + (value | anything) 
    myParser = date + anything

我認爲最好的辦法是強制pyparsing匹配至少2個空格，但我真的不知道如何。有什麼建議？

來源

2013-07-29 rodi

說明

要匹配2米或更多的空間，你可以使用\s{2,}

這個表達式：

拍攝日期字段
捕獲倒數第二場

^(\d{2}\.\d{2}\.\d{2})[^\r\n]*\s(\S+)\s{2,}\S+\s*(?:[\r\n]|\Z)

enter image description here

例子

Live Demo

示例文字

date    Not Important       value NotImportant2 
11.11.13   useless . useless,21 useless 2  14.21 asmdakldm 
21.12.12   fmpaosmfpoamsp 4      41  ajfa9si90

匹配

[0][0] = 11.11.13   useless . useless,21 useless 2  14.21 asmdakldm 

[0][3] = 11.11.13 
[0][4] = 14.21 

[1][0] = 21.12.12   fmpaosmfpoamsp 4      41  ajfa9si90 
[1][5] = 21.12.12 
[1][6] = 41

來源

2013-07-29 12:15:03

此示例文本是列式的，所以pyparsing在這裏有點矯枉過正。你可以這樣寫：

fieldslices = [slice(0,8), # dateslice 
       slice(58,58+8), # valueslice 
       ] 

for line in sample: 
    date,value = (line[x] for x in fieldslices) 
    print date,value.strip()

，並得到：

date  value 
11.11.13 14.21 
21.12.12 41

但因爲你特別想要一個pyparsing的解決方案，那麼對於一些這樣columny，您可以使用GoToColumn類：

from pyparsing import * 

dateExpr = Regex(r'(\d\d\.){2}\d\d').setName("date") 
realNum = Regex(r'\d+\.\d*').setName("real").setParseAction(lambda t:float(t[0])) 
intNum = Regex(r'\d+').setName("integer").setParseAction(lambda t:int(t[0])) 
valueExpr = realNum | intNum 

patt = dateExpr("date") + GoToColumn(59) + valueExpr("value")

GoToColumn與SkipTo類似，但不是推進到表達式的下一個實例，而是前進到一個特定的列號（其中列號是基於1的，而不是像字符串切片那樣基於0）。

現在這裏是適用於你的樣品文本解析器：

# Normally, input would be from some text file 
# infile = open(sourcefile) 
# but for this example, create iterator from the sample 
# text instead 
sample = """\ 
date    Not Important       value NotImportant2 
11.11.13   useless . useless,21 useless 2  14.21 asmdakldm 
21.12.12   fmpaosmfpoamsp 4      41  ajfa9si90 
""".splitlines() 

infile = iter(sample) 

# skip header line 
next(infile) 

for line in infile: 
    result = patt.parseString(line) 
    print result.dump() 
    print

打印：

['11.11.13', 'useless . useless,21 useless 2  ', 14.210000000000001] 
- date: 11.11.13 
- value: 14.21 

['21.12.12', 'fmpaosmfpoamsp 4      ', 41] 
- date: 21.12.12 
- value: 41

注意，值已經從字符串轉換爲整數或浮點數類型;你可以自己做同樣的事情來編寫一個分析動作，將你的日期轉換爲Python的日期時間。還要注意如何定義關聯的結果名稱;這些允許您按名稱訪問各個字段，如print result.date。

我也注意到你的假設，即獲得一個或多個元素的順序，你用這個結構：

anything = pp.Forward() 
anything << anyword + (value | anything)

雖然這並工作，它會創建一個運行時昂貴的遞歸表達式。 pyparsing提供一個迭代的等價物，OneOrMore：

anything = OneOrMore(anyword)

或者如果你喜歡較新的「*」 - 操作形式：

anything = anyword*(1,)

請把掃描通過pyparsing API文檔，其中包括在pyaprsing的源代碼分發，或者在線登錄http://packages.python.org/pyparsing/。

歡迎來到Pyparsing！

來源

2013-07-30 07:25:04 PaulMcG

匹配多於2位與PyParsing

回答

說明

例子

相關問題