2014-09-20 25 views
1

我想從Java源代碼創建語料庫。
我下面基於區間[2.1] http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf

本文的預處理步驟以後的事應該被刪除:
- 相關的編程語言
[已由removePunctuation完成]的語法文字 - 編程語言關鍵字[已通過tm_map(dsc,removeWords,javaKeywords)完成]]
- 通用英語停用詞[已由tm_map(dsc,removeWords,stopwords(「english」)完成)]
- 詞幹[已完成by tm_map(dsc,stemDocument)]
創建源代碼語料庫中的拆分標識符和方法名稱

其餘部分是根據通用的命名約定將標識符和方法名稱拆分爲多個部分。

例如'firstName'應該分成'first'和'name'。

'calculateAge'的另一個例子應該被分成'calculate'和'age'。
任何人都可以幫助我嗎?

library(tm) 
    dd = DirSource(pattern = ".java", recursive = TRUE) 
    javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while") 
    dsc <- Corpus(dd) 
    dsc <- tm_map(dsc, stripWhitespace) 
    dsc <- tm_map(dsc, removePunctuation) 
    dsc <- tm_map(dsc, removeNumbers) 
    dsc <- tm_map(dsc, removeWords, stopwords("english")) 
    dsc <- tm_map(dsc, removeWords, javaKeywords) 
    dsc = tm_map(dsc, stemDocument) 
    dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE)) 

回答

1

我已經寫過十里Perl的一個工具,做各種源代碼的預處理,包括標識分裂:

https://github.com/stepthom/lscp

相關的代碼有:

=head2 tokenize 
Title : tokenize 
Usage : tokenize($wordsIn) 
Function : Splits words based on camelCase, under_scores, and dot.notation. 
      : Leaves other words alone. 
Returns : $wordsOut => string, the tokenized words 
Args  : named arguments: 
      : $wordsIn => string, the white-space delimited words to process 
=cut 
sub tokenize{ 
    my $wordsIn = shift; 
    my $wordsOut = ""; 

    for my $w (split /\s+/, $wordsIn) { 
     # Split up camel case: aaA ==> aa A 
     $w =~ s/([a-z]+)([A-Z])/$1 $2/g; 

     # Split up camel case: AAa ==> A Aa 
     # Split up camel case: AAAAa ==> AAA Aa 
     $w =~ s/([A-Z]{1,100})([A-Z])([a-z]+)/$1 $2$3/g; 

     # Split up underscores 
     $w =~ s/_/ /g; 

     # Split up dots 
     $w =~ s/([a-zA-Z0-9])\.+([a-zA-Z0-9])/$1 $2/g; 

     $wordsOut = "$wordsOut $w"; 
    } 

    return removeDuplicateSpaces($wordsOut); 
} 

以上的黑客都是基於我自己有預處理源代碼進行文本分析的經驗。隨意竊取和修改。

1

您可以創建自定義函數來分割由大寫字母詞(這裏矢量):

splitCapital <- function(x) 
    unlist(strsplit(tolower(sub('(.*)([A-Z].*)','\\1 \\2',x)),' ')) 

例子:

splitCapital('firstName') 
[1] "first" "name" 

splitCapital(c('firstName','calculateAge')) 
[1] "first"  "name"  "calculate" "age" 

然後你就可以在你的文集迭代:

corpus.split <- lapply(dsc,splitCapital) 
+0

只是將函數調用放在一個控制內,如'dtm < - TermDocumentMatrix(dsc,tokenize = splitCapital))'謝謝 – Fawaz 2014-09-21 09:41:13

+0

@Fawaz只是好奇,爲什麼你要用java代碼進行文本挖掘?我的意思是什麼是你的對象wnd在什麼java不同於其他語言如C++,...從文本挖掘方面? – agstudy 2014-09-21 09:44:42

+0

我正在做一些研究。我的工作的主要問題是「我們可以從文本演進的角度解釋源代碼演變嗎?」源代碼可以看作自然語言或常規文本。我希望我已經餵你的好奇:) @agstudy – Fawaz 2014-09-21 09:56:18