0
我想使用Perl來獲取先前生成的SPSS語法文件並將其格式化以用於R環境。Perl正則表達式語法
對於那些熟悉Perl和正則表達式的人來說,這可能是一個非常簡單的任務,但我很磕磕絆絆。
,因爲我已經擺開這個Perl腳本的步驟如下:
- 讀入SPSS文件
- 找到SPSS文件(正則表達式),以便進一步處理的適當塊和格式化
- 上面提到的進一步處理(更多正則表達式)
- 將R語法返回到命令行或最好是文件。
SPSS值標籤語法的基本格式是:
...A bunch of nonsense I do not care about...
...
Value Labels
/gender
1 "M"
2 "F"
/purpose
1 "business"
2 "vacation"
3 "tiddlywinks"
execute .
...Resume nonsense...
而且所需的R語法我的樣子後:
gender <- as.factor(gender
, levels= c(1,2)
, labels= c("M","F")
)
...
這裏是Perl腳本,我就此寫遠。我已成功將每行讀入適當的數組中。我有我需要的最終打印功能的一般流程,但我需要弄清楚如何只爲每個@vars數組打印適當的@levels和@labels數組。
#!/usr/bin/perl
#Need to change to read from argument in command line
open(VARVAL, "append.txt");
@lines = <VARVAL>;
close(VARVAL);
#Read through each line and put into a variable, a value, or a reject
#I really only want to read in everything between "value labels" and "execute ."
#That probably requires more regex...
foreach (@lines){
if ($_ =~ /\//){ #Anything with a/is a variable, remove the/and push
$_ =~ tr/\///d;
push(@vars, $_)
} elsif ($_ =~/\d/) {
push(@vals, $_) #Anything that has a number in the line is a value
}
}
#Splitting each @vals array into levels or labels arrays
foreach (@vals){
@values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character?
foreach (@values) {
if ($_ =~/\d/){
push(@levels, $_);
} else {
push(@labels, $_)
}
}
}
#Get rid of newline
#I should provavly do this somewhere else?
chomp(@vars);
chomp(@levels);
chomp(@labels);
#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup?
#Need to get rid of final comma
#Need to redirect output to a file
foreach (@vars){
print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ;
foreach (@levels){
print $_ . ",";
}
print ")\n\t, labels = c(";
foreach(@labels){
print $_ . ",";
}
print ")\n\t)\n";
}
最後,這裏是從腳本示例輸出,因爲它目前運行:
gender <- as.factor(gender
, levels = c(1,2,1,2,3,)
, labels = c("M","F","biz","action","tiddlywinks",)
)
我需要這不僅包括各級1,2和標籤M和F
謝謝尋求幫助!
嗯,我猜就是這樣簡單。我需要花一些時間來消化你在那裏做的事情,但我應該能夠弄清楚。謝謝! – Chase 2010-07-29 00:02:42
你能解釋第二個if語句在上面的代碼中嗎?似乎「if(length $ current_label)」將爲每一行返回true,否?這是你的意圖嗎? 我對下一行的解釋是否正確:「if($ line =〜/ ^(\ d)」(。*)「$ /)」「如果我的行以數字開頭,那麼抓住任何和所有字符在「」中,並將它們放在$ 1變量中? – Chase 2010-07-29 00:36:20
@Chase,它看起來像第二個「if」是爲了跳過「一堆廢話」行(假設它們不是以'/' )。它會阻止代碼記錄值,直到它找到一個有效的標籤(注意'$ current_label'初始化爲空字符串,它沒有長度)。就我個人而言,我將'$ current_label'留下未初始化,然後測試對於定義的$ current_label代替,但是這也可以工作 – cjm 2010-07-29 01:28:46