我需要從前列腺切除術最終診斷書寫的平面文件中提取格里森分數。這些分數總是有格里森這個詞,兩個數字加起來就是另一個數字。人類在二十多年中打入了這些。包括了空白和修飾符的各種約定。以下是我的Backus-Naur表單以及兩個示例記錄。僅用於前列腺切除術,我們正在查看上千例。Pyparsing:提取可變長度,可變內容,可變空白子串
我正在使用pyparsing,因爲我正在學習python,並沒有我非常有限的暴露於正則表達式寫作的美好回憶。
我的問題:如何在不解析可能會或可能不會在這些最終診斷中的每一個其他可選數據的情況下挑出這些格里森成績?
num = Word(nums)
record ::= accessionDate + accessionNumber + patMedicalRecordNum + finalDxText
accessionDate ::= num + "/" + num + "/" num
accessionNumber ::= "S" + num + "-" + num
patMedicalRecordNum ::= num + "/" + num + "-" + num + "-" + num
finalDxText ::= listOfParts + optionalComment + optionalpTNMStage
listOfParts ::= OneOrMore(part)
part ::= <multiline idiosyncratic freetext which may contain a Gleason score I want> + optionalpTNMStage
optionalComment ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
optionalpTNMStage ::= <multiline idiosyncratic freetext which may contain a Gleason score I don't want>
01/01/11 S11-55555 20/444-55-6666 A. PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
TOTAL GLEASON SCORE: GLEASON 5+4=9
TUMOR LOCATION: BILATERAL
TUMOR QUANTITATION: 15% OF PROSTATE INVOLVED BY TUMOR
EXTRAPROSTATIC EXTENSION: PRESENT AT RIGHT POSTERIOR
SEMINAL VESICLE INVASION: PRESENT
MARGINS: UNINVOLVED
LYMPHOVASCULAR INVASION: PRESENT
PERINEURAL INVASION: PRESENT
LYMPH NODES (SPECIMENS B AND C):
NUMBER EXAMINED: 25
NUMBER INVOLVED: 1
DIAMETER OF LARGEST METASTASIS: 1.7 mm
ADDITIONAL FINDINGS: HIGH-GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA,
ACUTE AND CHRONIC INFLAMMATION, INTRADUCTAL EXTENSION OF INVASIVE
CARCINOMA
PATHOLOGIC STAGE: pT3b N1 MX
B. LYMPH NODES, RIGHT PELVIC, EXCISION:
- ONE OF SEVENTEEN LYMPH NODES POSITIVE FOR METASTASIS (1/17).
C. LYMPH NODES, LEFT PELVIC, EXCISION:
- EIGHT LYMPH NODES NEGATIVE FOR METASTASIS (0/8).
01/02/11 S11-4444 20/111-22-3333 PROSTATE AND SEMINAL VESICLES, PROSTATECTOMY:
- ADENOCARCINOMA.
GLEASON SCORE: 3 + 3 = 6 WITH TERTIARY PATTERN OF 5.
TUMOR QUANTITATION: APPROXIMATELY 10% BY VOLUME.
TUMOR LOCATION: BILATERAL.
EXTRAPROSTATIC EXTENSION: NOT IDENTIFIED.
MARGINS: NEGATIVE.
PERINEURAL INVASION: IDENTIFIED.
LYMPH-VASCULAR INVASION: NOT IDENTIFIED.
SEMINAL VESICLE/VASA DEFERENTIA INVASION: NOT IDENTIFIED.
LYMPH NODES: NONE SUBMITTED.
OTHER: HIGH GRADE PROSTATIC INTRAEPITHELIAL NEOPLASIA.
PATHOLOGIC STAGE (pTNM): pT2c NX.
完全披露:我是一位正在做研究的醫師;這是我第一次使用python進行真正的工作。我已經讀過Lutz的Learning Python,Shaw的Python學習方法,並且通過各種問題集。我在這個論壇上討論了許多與pyparsing相關的問題,pyparsing wiki,並且我購買並閱讀了McGuire先生的Pyparsing入門。當我真的被告知我正站在「當你必須寫解析器時非常普遍的挫折的死亡螺旋」(McGuire,17)時,我可能會問一個問題:我不知道。到目前爲止,我只是很高興能夠開展真正的項目。
自然語言處理是很難!你能做一些簡化的假設嗎? (例如,你關心的分數始終是_first_格里森分數,並且總是以格里森i + j = k'的形式出現) – katrielalex
是的,那些是有效的假設。 – Niels