2015-10-19 55 views
0

我有一個簡短的代碼,我想打印出來,而不是提取的一個在低case.The代碼中的原句是如下如何返回,而不是小寫一句

import re 
from nltk import tokenize 
from nltk.tokenize import sent_tokenize 
def foo(): 
    txt = "Risk factors for breast cancer have been well characterized. Breast cancer is 100 times more frequent in women than in men.\ 
    Factors associated with an increased exposure to estrogen have also been elucidated including early menarche, late menopause, later age\ 
    at first pregnancy, or nulliparity. The use of hormone replacement therapy has been confirmed as a risk factor, although mostly limited to \ 
    the combined use of estrogen and progesterone, as demonstrated in the WHI (2). Analysis showed that the risk of breast cancer among women using \ 
    estrogen and progesterone was increased by 24% compared to placebo. A separate arm of the WHI randomized women with a prior hysterectomy to \ 
    conjugated equine estrogen (CEE) versus placebo, and in that study, the use of CEE was not associated with an increased risk of breast cancer (3).\ 
    Unlike hormone replacement therapy, there is no evidence that oral contraceptive (OCP) use increases risk. A large population-based case-control study \ 
    examining the risk of breast cancer among women who previously used or were currently using OCPs included over 9,000 women aged 35 to 64 \ 
    (half of whom had breast cancer) (4). The reported relative risk was 1.0 (95% CI, 0.8 to 1.3) among women currently using OCPs and 0.9 \ 
    (95% CI, 0.8 to 1.0) among prior users. In addition, neither race nor family history was associated with a greater risk of breast cancer among OCP users." 
    words = txt 
    corpus = " ".join(words).lower() 
    sentences1 = sent_tokenize(corpus) 
    a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)] 


    for i in a: 
     print i,'\n','\n' 

foo() 
原判

什麼我不斷收到這是(例如)

>>risk factors for breast cancer have been well characterized 

,而不是這樣的:

>>Risk factors for breast cancer have been well characterized. 
+0

@TimCastelijns,這是與.lower()文本一起使用的更大代碼的一部分,這就是爲什麼我使用它。 – wakamdr

回答

1
corpus = " ".join(words).lower() 

看起來您對字符串使用.lower(),因此您可以稍後將其與risk進行比較。正如您已經注意到的那樣,這會降低整個字符串,並且沒有簡單的方法來反轉該操作。

爲了避免這種情況,您可以改爲將risk改爲word_tokenize(j).lower()。更改這些行

corpus = " ".join(words).lower() 
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j)] 

corpus = " ".join(words) 
a = [" ".join([sentences1[i-1],j]) for i,j in enumerate(sentences1) if 'risk' in word_tokenize(j).lower()] 

這將保留其原始狀態的字符串,同時仍然能夠比較容易risk

+0

謝謝,我最大的問題是我有一個用戶輸入(例如字符串「風險」,它會自動轉換爲.lower(),這就是爲什麼我使用小寫的語料庫所以如果我在語料庫中使用大寫字母,我想我必須弄清楚如何提取。 – wakamdr

相關問題