我試圖從我已經從PDF轉換爲Python石板庫文本的一些世衛組織代碼簿中提取數據。正則表達式(Python)數據提取 - 重疊或不完整的結果
我想打2個數字,破折號,2位,其次是一些文字開始,以「Q」 +1位或2位,並再次「Q」結尾的文本+1位或2位
17-17How old are you?Q1Q1
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
有時,這些短語的結束與一個空白,有時接下來的問題會立即開始(這裏有三個問題),觀察Q4Q424-29和Q5Q530-30
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q424-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q530-30During the past 30 days, how often did you go hungry because there was not enough food in your home?Q6Q7
隨着
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d(\d)*?
我非常接近,但當第二個「Q」有兩位數字時,我錯過了第二個數字。
我試圖添加一個負前瞻
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d((\d)(?!\d\d-))
排除有兩位數字,破折號模式的開始。
\d{2}-\d{2}[a-zA-Z0-9 .()?:,]+Q\d{1,2}Q\d{1,2}
包括「Q」的第二個數字但產生重疊的結果,例如,在Q4Q424-29,第一個字符串以Q4Q42結尾,第二個字符串以4-29開頭。
與原樣品的文字部分的正則表達式是在這裏:https://regex101.com/r/d9Dlga/2/
任何建議,誰提取出正確的字符串,如:
17-17How old are you?Q1Q1
20-23How tall are you without your shoes on? (Note: Data are in meters.)Q4Q4
24-29How much do you weigh without your shoes on? (Note: Data are in kilograms.)Q5Q5
31-31During the past 30 days, how many times per day did you usually eat fruit, such as bananas, apples, oranges, dates, or any other fruits?Q7Q11
謝謝!
你先行基於模式的接近,BOT您需要檢查單下面的數字,讓整個東西可選,如'Q \ d((\ d)(?!\ d - ))?' –