2015-10-30 100 views
0

我試圖從科學論文摘要文本語料庫(available here)中讀取數據。我已經張貼在下面的示例文件,其中我讀過的數據與在Python中提取文本文件的特定部分

with open(filePath, "r") as f: 
    data = f.readlines() 
for i, x in enumerate(data): print i, x 

我想只提取25行中的類別名稱,並從抽象的文字;所以在下面的例子中將是("Commercial exploitation over the...", "Life Science Biological")。我不能認爲類別名稱和摘要將始終顯示在這些特定的行號上。摘要將始終跟在Abstract後面2行並運行到文件末尾。

0 Title  : CRB: Genetic Diversity of Endangered Populations of Mysticete Whales: 

1    Mitochondrial DNA and Historical Demography 

2 Type  : Award 

3 NSF Org  : DEB 

4 Latest 

5 Amendment 

6 Date  : August 1, 1991  

7 File  : a9000006 

8 

9 Award Number: 9000006 

10 Award Instr.: Continuing grant        

11 Prgm Manager: Scott Collins       

12  DEB DIVISION OF ENVIRONMENTAL BIOLOGY  

13  BIO DIRECT FOR BIOLOGICAL SCIENCES   

14 Start Date : June 1, 1990  

15 Expires  : November 30, 1992 (Estimated) 

16 Expected 

17 Total Amt. : $179720    (Estimated) 

18 Investigator: Stephen R. Palumbi (Principal Investigator current) 

19 Sponsor  : U of Hawaii Manoa 

20  2530 Dole Street 

21  Honolulu, HI 968222225 808/956-7800 

22 

23 NSF Program : 1127  SYSTEMATIC & POPULATION BIOLO 

24 Fld Applictn: 0000099 Other Applications NEC     

25    61  Life Science Biological     

26 Program Ref : 9285, 

27 Abstract : 

28                        

29    Commercial exploitation over the past two hundred years drove     

30    the great Mysticete whales to near extinction. Variation in     

31    the sizes of populations prior to exploitation, minimal       

32    population size during exploitation and current population      

33    sizes permit analyses of the effects of differing levels of      

34    exploitation on species with different biogeographical       

35    distributions and life-history characteristics. Dr. Stephen     

36    Palumbi at the University of Hawaii will study the genetic      

37    population structure of three whale species in this context,     

38    the Humpback Whale, the Gray Whale and the Bowhead Whale. The     

39    effect of demographic history will be determined by comparing     

40    the genetic structure of the three species. Additional studies     

41    will be carried out on the Humpback Whale. The humpback has a     

42    world-wide distribution, but the Atlantic and Pacific       

43    populations of the northern hemisphere appear to be discrete     

44    populations, as is the population of the southern hemispheric     

45    oceans. Each of these oceanic populations may be further      

46    subdivided into smaller isolates, each with its own migratory     

47    pattern and somewhat distinct gene pool. This study will      

48    provide information on the level of genetic isolation among      

49    populations and the levels of gene flow and genealogical      

50    relationships among populations. This detailed genetic       

51    information will facilitate international policy decisions      

52    regarding the conservation and management of these magnificent     

53    mammals 

UPDATE:下面的代碼對我的作品,但有一個更有效的方式來做到這一點? 開放(文件路徑, 「R」)爲f: 數據= f.readlines()

# Find the abstract and category 
    abstract = re.compile("Abstract") 
    for i, line in enumerate(data): 
    if abstract.search(line): break 
    # i is the line number of the "Abstract" identifier 
    temp = "".join(data[i+1:]) 
    abstractText = " ".join(re.findall('[A-Za-z]+', temp)) 
    category = " ".join(re.findall('[A-Za-z]+', data[i-2])) 

    return abstractText, category 

回答

1

你有什麼已經嘗試過?

如果格式是一致的,你可以用正則表達式來完成。

會趕上抽象可能看起來像一個例子:

abstract = re.compile(u"Abstract:([\s\w\d]*)", re.MULTILINE) 

上面的代碼假定沒有一樣是抽象的文字後,該抽象的身體總是被進行「摘要:」