我想從Java源代碼創建語料庫。
我下面基於區間[2.1] http://cs.queensu.ca/~sthomas/data/Thomas_2011_MSR.pdf
本文的預處理步驟以後的事應該被刪除:
- 相關的編程語言
[已由removePunctuation完成]的語法文字 - 編程語言關鍵字[已通過tm_map(dsc,removeWords,javaKeywords)完成]]
- 通用英語停用詞[已由tm_map(dsc,removeWords,stopwords(「english」)完成)]
- 詞幹[已完成by tm_map(dsc,stemDocument)]
創建源代碼語料庫中的拆分標識符和方法名稱
其餘部分是根據通用的命名約定將標識符和方法名稱拆分爲多個部分。
例如'firstName'應該分成'first'和'name'。
'calculateAge'的另一個例子應該被分成'calculate'和'age'。
任何人都可以幫助我嗎?
library(tm)
dd = DirSource(pattern = ".java", recursive = TRUE)
javaKeywords = c("abstract","continue","for","new","switch","assert","the","default","package","synchronized","boolean","do","if","private","this","break","double","implements","protected","throw","byte","else","the","null","NULL","TRUE","FALSE","true","false","import","public","throws","case","enum", "instanceof","return","transient","catch","extends","int","short","try","char","final","interface","static","void","class","finally","long","volatile","const","float","native","super","while")
dsc <- Corpus(dd)
dsc <- tm_map(dsc, stripWhitespace)
dsc <- tm_map(dsc, removePunctuation)
dsc <- tm_map(dsc, removeNumbers)
dsc <- tm_map(dsc, removeWords, stopwords("english"))
dsc <- tm_map(dsc, removeWords, javaKeywords)
dsc = tm_map(dsc, stemDocument)
dtm<- DocumentTermMatrix(dsc, control = list(weighting = weightTf, stopwords = FALSE))
只是將函數調用放在一個控制內,如'dtm < - TermDocumentMatrix(dsc,tokenize = splitCapital))'謝謝 – Fawaz 2014-09-21 09:41:13
@Fawaz只是好奇,爲什麼你要用java代碼進行文本挖掘?我的意思是什麼是你的對象wnd在什麼java不同於其他語言如C++,...從文本挖掘方面? – agstudy 2014-09-21 09:44:42
我正在做一些研究。我的工作的主要問題是「我們可以從文本演進的角度解釋源代碼演變嗎?」源代碼可以看作自然語言或常規文本。我希望我已經餵你的好奇:) @agstudy – Fawaz 2014-09-21 09:56:18