2017-05-31 144 views
2

我正在研究一個項目,我必須從文本語料庫中分離適當的語句。我曾嘗試使用NLTK語句標記器,但它似乎基於句點(「。」)標記句子。從文本語料庫中分離非結構化語句

所以我在想有沒有辦法將表格數據,短語從文本文件中分離出來?

這裏是一個示例文本文件。我指的是TEXT標籤下的那些。

<?xml version='1.0' encoding='UTF-8'?> 
<root> 
    <TEXT><![CDATA[ 


Record date: 2078-09-07 





RYBURY HOSPITAL INTERN ADMISSION NOTE 





Name: Goldberg, Joel 

MR #: 0370149 

Date of admission: 9/6/2078 

Resident: Lange/Bailey 

Attending: Schmidt MD 

PCP: Odom, Kacie MD 



CC: L foot pain 



HPI: The patient is a 48 yo gentleman with a hx of DM2, peripheral neuropathy and PVD with multiple admissions for LE cellulitis in the setting of gangrenous toes in the past 5 years, last one in July. He now presents with acute on chronic LLE sweeling that began this morning after he got up walked around his home for about 2-3 hrs and then suddenly felt an acute pain shooting up his leg, with a severity of 10/10, he knew right away this was similar to the pain he had felt before on prior admissions for cellulitis so he called 911. On arrival to the ED his temp was 98.1, 112, 145/79, 20, 99%RA and was started on antibiotic treatment with Unasyn for cellulitis.  



ROS: Per HPI. No F/C/NS. No CP/Palps. No Orthopnea. No SOB/cough/hemoptysis/wheezing/sore throat/. No hematochezia/melena. No delta MS/LOC. No slurring of speech, unilateral weakness. No dysuria. No chills or fevers, no lightheadedness. 



PMH: 

1. DM2 diagnosed in 2075, says peripheral neuropathy was diagnosed around the same time, denies any retinopathy or nephropathy. 

2. Peripheral vascular disease with the following surgeries performed: 

    Right 5th toe amputation 2/2 osteomyelitis 12/14/76 

Right 4th toe amputation 2/2 wet gangrene 9/03/76 

Angioplasty and stenting of the distal LEFT superficial femoral artery 11/6/76 

Angioplasty and stenting of the distal RIGHT superficial femoral artery 7/20/76 

I&D of right thigh abscess 4/75 



Medications on admission (confirmed with patient): 

1. Glyburide 2.5mg BID 

2. Glucopahge 500mg QD 

3. Zestril 2.5mg QD 

4. Percocet PRN 



ALL: Codeine upsets his stomach 

SH: Lives in Arroyo Grande apartment with friend, works occasionally as a copy editor but unemployed right now, has smoked 1/2ppd for 35 years, no ETOH, no drugs. Adequate diet. 

FH: Many family members with DM.  



Physical Exam: 

V: 98.5, 149/84, 98, 18, 99%RA 

Gen: NAD, conversant 

HEENT: PERRL, EOMI.  

Neck: Supple, no thyromegaly, no carotid bruits, JVP 

Nodes: No cervical or supraclavicular LAN 

Cor: RRR S1, S2 nl. No m/r/g. No S3, S4 

Chest: CTAB 

Abdomen: +BS Soft, NT, ND. No HSM, No CVA tenderness. 

Ext: LLE with dorsal and medial erythema, extending from L 5th toe that has eschar on its side and is mildly tender, no secretions. L toe also tender. Pulse on LLE + and RLE ++. 

Skin: No other rashes 

Neuro: AO X 3. CN II-XII intact. Decreased sensation from LT up to knee on R and 4cm above ankle on Left.. 



Labs and Studies: 

    RSC  

      09/06/78 

      10:17  



NA   137              

K   4.5(T)              

CL   106              

CO2   28.4              

BUN   25               

CRE   0.9              

GLU   266(H)              

CA   9.5              

PHOS  3.1              

MG   1.7              



CBC 

WBC   12.7(H)             

RBC   4.13(L)             

HGB   13.0(L)             

HCT   37.1(L)             

MCV   90               

MCH   31.5              

MCHC  35.0              

PLT   165              

RDW   13.3              

DIFFR  Received             

METHOD  Auto              

%NEUT  79(H)              

%LYMPH  17(L)              

%MONO  3(L)              

%EOS  1               

%BASO  0               

ANEUT  10.02(H)             

ALYMP  2.13              

AMONS  0.44(H)             

AEOSN  0.11              

ABASOP  0.03              

ANISO  None              

HYPO  None              

MACRO  None              

MICRO  None              



PT   11.9              

PTT   25.0              



LENIS: Negative for DVT, did not assess arteries. 

FOOT ANKLE XR: There is a lytic lesion in the distal lateral aspect of the proximal phalanx of the fifth toe. This can be consitent with an area of infection/osteomyelitis. 



Microbiology 

21-Jul-2076 09:41 

    Specimen Type:  WOUND 

    Specimen Comment: ULCER 4TH 5TH TOE 



    Wound Culture - Final Reported: 24-Jul-76 15:05 

    Moderate PROTEUS VULGARIS 

     RAPID METHOD 

     Antibiotic      Interpretation 

     ---------------------------------------------- 

     Amikacin      Susceptible 

     Ampicillin      Resistant  

     Aztreonam      Susceptible 

     Cefazolin      Resistant  

     Cefepime      Susceptible 

     Cefpodoxime      Susceptible 

     Ceftriaxone      Susceptible 

     Gentamicin      Susceptible 

     Levofloxacin     Susceptible 

     Piperacillin     Susceptible 

     Trimethoprim/Sulfamethoxazole Susceptible 







A/P: 48M with a hx of DM2, PVD and multiple admissions in the past for LE cellulitis in the setting of gangrene. 

1. ID: Patient is now presenting with appears to be another episode of cellulitis but now probably coming from his L 5th Toe lesion. Surgery has debrided the wound, sending wound cultures as well as blood cultures. Acute OM would not be visible on XR changes and clinical picture is more consistent with acute than Chronic OM. Will consider further work up for OM if symptoms do not respond to treatment. Levo and flagyl were added to unasyn in accord to previous culture data. 

2. PVD: Will need arterial LENIS to assess for vascular patency and flow. Continuing ACEI, and adding ASA and lipitor, will order lipid profile and smoking cessation consult. 

3. DM2: Very poor control last admission, eventhough patient now says he takes medications and checks it up to QID. Will order HgbA1C and glucose monitoring. 



_______________________________________________________________________ 

Name Ian Jurado MD        

Pager # 14558 

PGY-1 











]]></TEXT> 
    <TAGS> 
    <MEDICATION id="DOC0" time="during DCT" type1="ACE inhibitor" type2=""> 
     <MEDICATION id="M0" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M1" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M2" start="1782" end="1789" text="Zestril" time="during DCT" type1="ACE inhibitor" type2="" comment=""/> 
    </MEDICATION> 
    <MEDICATION id="DOC1" time="after DCT" type1="statin" type2=""> 
     <MEDICATION id="M3" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="statin" type2="" comment=""/> 
     <MEDICATION id="M4" start="7126" end="7133" text="lipitor" time="after DCT" type1="statin" type2="" comment=""/> 
    </MEDICATION> 
    <DIABETES id="DOC2" time="before DCT" indicator="mention"> 
     <DIABETES id="D0" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/> 
     <DIABETES id="D1" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/> 
     <DIABETES id="D2" start="1180" end="1183" text="DM2" time="before DCT" indicator="mention" comment=""/> 
     <DIABETES id="D3" start="6444" end="6447" text="DM2" time="before DCT" indicator="mention" comment=""/> 
     <DIABETES id="D4" start="7195" end="7198" text="DM2" time="before DCT" indicator="mention" comment=""/> 
     <DIABETES id="D5" start="296" end="299" text="DM2" time="before DCT" indicator="mention" comment=""/> 
    </DIABETES> 
    <MEDICATION id="DOC3" time="after DCT" type1="sulfonylureas" type2=""> 
     <MEDICATION id="M5" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M6" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M7" start="1734" end="1743" text="Glyburide" time="after DCT" type1="sulfonylureas" type2="" comment=""/> 
    </MEDICATION> 
    <MEDICATION id="DOC4" time="after DCT" type1="metformin" type2=""> 
     <MEDICATION id="M8" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M9" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M10" start="1758" end="1768" text="Glucopahge" time="after DCT" type1="metformin" type2="" comment=""/> 
    </MEDICATION> 
    <MEDICATION id="DOC5" time="during DCT" type1="metformin" type2=""> 
     <MEDICATION id="M11" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M12" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M13" start="1758" end="1768" text="Glucopahge" time="during DCT" type1="metformin" type2="" comment=""/> 
    </MEDICATION> 
    <HYPERTENSION id="DOC6" time="during DCT" indicator="high bp"> 
     <HYPERTENSION id="H0" start="2100" end="2106" text="149/84" time="during DCT" indicator="high bp" comment=""/> 
     <HYPERTENSION id="H1" start="828" end="834" text="145/79" time="during DCT" indicator="high bp" comment=""/> 
    </HYPERTENSION> 
    <MEDICATION id="DOC7" time="before DCT" type1="ACE inhibitor" type2=""> 
     <MEDICATION id="M14" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M15" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M16" start="1782" end="1789" text="Zestril" time="before DCT" type1="ACE inhibitor" type2="" comment=""/> 
    </MEDICATION> 
    <SMOKER id="DOC8" status="current"> 
     <SMOKER id="S0" start="7163" end="7191" text=" smoking cessation consult. " status="current" comment=""/> 
     <SMOKER id="S1" start="1965" end="1995" text="has smoked 1/2ppd for 35 years" status="current" comment=""/> 
     <SMOKER id="S2" start="1969" end="1995" text="smoked 1/2ppd for 35 years" status="current" comment=""/> 
    </SMOKER> 
    <MEDICATION id="DOC9" time="before DCT" type1="metformin" type2=""> 
     <MEDICATION id="M17" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M18" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/> 
     <MEDICATION id="M19" start="1758" end="1768" text="Glucopahge" time="before DCT" type1="metformin" type2="" comment=""/> 
    </MEDICATION> 
    <MEDICATION id="DOC10" time="after DCT" type1="ACE inhibitor" type2=""> 
     <MEDICATION id="M20" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M21" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/> 
     <MEDICATION id="M22" start="1782" end="1789" text="Zestril" time="after DCT" type1="ACE inhibitor" type2="" comment=""/> 
    </MEDICATION> 
    <MEDICATION id="DOC11" time="during DCT" type1="sulfonylureas" type2=""> 
     <MEDICATION id="M23" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M24" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M25" start="1734" end="1743" text="Glyburide" time="during DCT" type1="sulfonylureas" type2="" comment=""/> 
    </MEDICATION> 
    <DIABETES id="DOC12" time="during DCT" indicator="mention"> 
     <DIABETES id="D6" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/> 
     <DIABETES id="D7" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/> 
     <DIABETES id="D8" start="1180" end="1183" text="DM2" time="during DCT" indicator="mention" comment=""/> 
     <DIABETES id="D9" start="6444" end="6447" text="DM2" time="during DCT" indicator="mention" comment=""/> 
     <DIABETES id="D10" start="7195" end="7198" text="DM2" time="during DCT" indicator="mention" comment=""/> 
     <DIABETES id="D11" start="296" end="299" text="DM2" time="during DCT" indicator="mention" comment=""/> 
    </DIABETES> 
    <MEDICATION id="DOC13" time="after DCT" type1="aspirin" type2=""> 
     <MEDICATION id="M26" start="7111" end="7133" text="adding ASA and lipitor" time="after DCT" type1="aspirin" type2="" comment=""/> 
     <MEDICATION id="M27" start="7118" end="7121" text="ASA" time="after DCT" type1="aspirin" type2="" comment=""/> 
    </MEDICATION> 
    <FAMILY_HIST id="DOC14" indicator="not present"> 
     <FAMILY_HIST id="F0" indicator="not present"/> 
     <FAMILY_HIST id="F1" indicator="not present"/> 
     <FAMILY_HIST id="F2" indicator="not present"/> 
    </FAMILY_HIST> 
    <DIABETES id="DOC15" time="after DCT" indicator="mention"> 
     <DIABETES id="D12" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/> 
     <DIABETES id="D13" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/> 
     <DIABETES id="D14" start="1180" end="1183" text="DM2" time="after DCT" indicator="mention" comment=""/> 
     <DIABETES id="D15" start="6444" end="6447" text="DM2" time="after DCT" indicator="mention" comment=""/> 
     <DIABETES id="D16" start="7195" end="7198" text="DM2" time="after DCT" indicator="mention" comment=""/> 
     <DIABETES id="D17" start="296" end="299" text="DM2" time="after DCT" indicator="mention" comment=""/> 
    </DIABETES> 
    <MEDICATION id="DOC16" time="before DCT" type1="sulfonylureas" type2=""> 
     <MEDICATION id="M28" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M29" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/> 
     <MEDICATION id="M30" start="1734" end="1743" text="Glyburide" time="before DCT" type1="sulfonylureas" type2="" comment=""/> 
    </MEDICATION> 
    <PHI id="P0" start="16" end="26" text="2078-09-07" TYPE="DATE"/> 
    <PHI id="P1" start="39" end="54" text="RYBURY HOSPITAL" TYPE="HOSPITAL"/> 
    <PHI id="P2" start="88" end="102" text="Goldberg, Joel" TYPE="PATIENT"/> 
    <PHI id="P3" start="110" end="117" text="0370149" TYPE="MEDICALRECORD"/> 
    <PHI id="P4" start="139" end="147" text="9/6/2078" TYPE="DATE"/> 
    <PHI id="P5" start="159" end="164" text="Lange" TYPE="DOCTOR"/> 
    <PHI id="P6" start="165" end="171" text="Bailey" TYPE="DOCTOR"/> 
    <PHI id="P7" start="184" end="191" text="Schmidt" TYPE="DOCTOR"/> 
    <PHI id="P8" start="201" end="212" text="Odom, Kacie" TYPE="DOCTOR"/> 
    <PHI id="P9" start="267" end="269" text="48" TYPE="AGE"/> 
    <PHI id="P10" start="441" end="445" text="July" TYPE="DATE"/> 
    <PHI id="P11" start="1197" end="1201" text="2075" TYPE="DATE"/> 
    <PHI id="P12" start="1422" end="1430" text="12/14/76" TYPE="DATE"/> 
    <PHI id="P13" start="1474" end="1481" text="9/03/76" TYPE="DATE"/> 
    <PHI id="P14" start="1554" end="1561" text="11/6/76" TYPE="DATE"/> 
    <PHI id="P15" start="1635" end="1642" text="7/20/76" TYPE="DATE"/> 
    <PHI id="P16" start="1671" end="1675" text="4/75" TYPE="DATE"/> 
    <PHI id="P17" start="1866" end="1879" text="Arroyo Grande" TYPE="CITY"/> 
    <PHI id="P18" start="1927" end="1938" text="copy editor" TYPE="PROFESSION"/> 
    <PHI id="P19" start="2717" end="2720" text="RSC" TYPE="HOSPITAL"/> 
    <PHI id="P20" start="2740" end="2748" text="09/06/78" TYPE="DATE"/> 
    <PHI id="P21" start="5510" end="5521" text="21-Jul-2076" TYPE="DATE"/> 
    <PHI id="P22" start="5638" end="5647" text="24-Jul-76" TYPE="DATE"/> 
    <PHI id="P23" start="6427" end="6429" text="48" TYPE="AGE"/> 
    <PHI id="P24" start="7431" end="7441" text="Ian Jurado" TYPE="DOCTOR"/> 
    <PHI id="P25" start="7485" end="7490" text="14558" TYPE="PHONE"/> 
    </TAGS> 
</root> 

每當我試着來標記上述課文的基礎上,NLTK弄亂了,善良的腫塊所有詞組也可以句號(「」)作爲句子之前找到。

+0

我認爲,對於這種特殊情況,你最好自己寫一個簡單的句子拆分器。分裂在兩條新線上的一個似乎在這裏訣竅。之後,您可以查看將該列表的結果元素提供給nltk句子分段器是否可以提高輸出質量。 – Igor

+0

謝謝。我認爲是這樣,但我也碰巧有一些文件,換行符在不合適的地方,比如單行內的某處。 –

回答

0

此文件中的某些行(真正的段落)包含多個句子。將文件分解成行,然後將句子標記器分別應用於每行。這將防止合併來自不同行的文本,並且會比滾動您自己的基於正則表達式的句子拆分器帶來更好的結果。例如:

text = file.read() 
lines = text.splitlines() 
sentences = [ s for line in lines for s in nltk.sent_tokenize(line) ] 
+0

什麼是lines.text.splitlines()在這裏?只是我可能使用的那種方法/類的示例格式? –

+0

糟糕,錯字!對不起,關於這個:-(不,這是完整的工作代碼,只要你打開文件閱讀。 – alexis