Perl正則表達式語法

我想使用Perl來獲取先前生成的SPSS語法文件並將其格式化以用於R環境。Perl正則表達式語法

對於那些熟悉Perl和正則表達式的人來說，這可能是一個非常簡單的任務，但我很磕磕絆絆。

，因爲我已經擺開這個Perl腳本的步驟如下：

讀入SPSS文件
找到SPSS文件（正則表達式），以便進一步處理的適當塊和格式化
上面提到的進一步處理（更多正則表達式）
將R語法返回到命令行或最好是文件。

SPSS值標籤語法的基本格式是：

...A bunch of nonsense I do not care about... 
... 
Value Labels 
/gender 
1 "M" 
2 "F" 
/purpose 
1 "business" 
2 "vacation" 
3 "tiddlywinks" 

execute . 
...Resume nonsense...

而且所需的R語法我的樣子後：

gender <- as.factor(gender 
    , levels= c(1,2) 
    , labels= c("M","F") 
    ) 
...

這裏是Perl腳本，我就此寫遠。我已成功將每行讀入適當的數組中。我有我需要的最終打印功能的一般流程，但我需要弄清楚如何只爲每個@vars數組打印適當的@levels和@labels數組。

#!/usr/bin/perl 

#Need to change to read from argument in command line 
open(VARVAL, "append.txt"); 
@lines = <VARVAL>; 
close(VARVAL); 

#Read through each line and put into a variable, a value, or a reject 
#I really only want to read in everything between "value labels" and "execute ." 
#That probably requires more regex... 
foreach (@lines){ 
    if ($_ =~ /\//){  #Anything with a/is a variable, remove the/and push 
     $_ =~ tr/\///d; 
     push(@vars, $_) 
    } elsif ($_ =~/\d/) { 
     push(@vals, $_) #Anything that has a number in the line is a value 
     } 
} 
#Splitting each @vals array into levels or labels arrays 
foreach (@vals){ 
    @values = split(/\s+/, $_); #Splitting on a space, vunerable...better to split on first non digit character? 
    foreach (@values) { 
     if ($_ =~/\d/){ 
      push(@levels, $_); 
     } else { 
      push(@labels, $_) 
     } 
    } 
} 

#Get rid of newline 
#I should provavly do this somewhere else? 
chomp(@vars); 
chomp(@levels); 
chomp(@labels); 

#Need to tell it when to stop adding in @levels & @labels. While loop? Hash lookup? 
#Need to get rid of final comma 
#Need to redirect output to a file 
foreach (@vars){ 
    print $_ ." <- as.factor(" . $_ . "\n\t, levels = c(" ; 
     foreach (@levels){ 
      print $_ . ","; 
     } 
    print ")\n\t, labels = c("; 
    foreach(@labels){ 
      print $_ . ","; 
     } 
    print ")\n\t)\n"; 
}

最後，這裏是從腳本示例輸出，因爲它目前運行：

gender <- as.factor(gender 
    , levels = c(1,2,1,2,3,) 
    , labels = c("M","F","biz","action","tiddlywinks",) 
    )

我需要這不僅包括各級1,2和標籤M和F

謝謝尋求幫助！

來源

2010-07-28 Chase

這似乎爲我工作：

#!/usr/bin/env perl 
use strict; 
use warnings; 

my @lines = <DATA>; 

my $current_label = ''; 
my @ordered_labels; 
my %data; 
for my $line (@lines) { 
    if ($line =~ /^\/(.*)$/) { # starts with slash 
     $current_label = $1; 
     push @ordered_labels, $current_label; 
     next; 
    } 
    if (length $current_label) { 
     if ($line =~ /^(\d) "(.*)"$/) { 
      $data{$current_label}{$1} = $2; 
      next; 
     } 
    } 
} 

for my $label (@ordered_labels) { 
    print "$label <- as.factor($label\n"; 
    print " , levels= c("; 
    print join(',',map { $_ } sort keys %{$data{$label}}); 
    print ")\n"; 
    print " , labels= c("; 
    print join(',', 
     map { '"' . $data{$label}{$_} . '"' } 
     sort keys %{$data{$label}}); 
    print ")\n"; 
    print " )\n"; 
} 

__DATA__ 
...A bunch of nonsense I do not care about... 
... 
Value Labels 
/gender 
1 "M" 
2 "F" 
/purpose 
1 "business" 
2 "vacation" 
3 "tiddlywinks" 

execute .

和產量：

gender <- as.factor(gender 
    , levels= c(1,2) 
    , labels= c("M","F") 
    ) 
purpose <- as.factor(purpose 
    , levels= c(1,2,3) 
    , labels= c("business","vacation","tiddlywinks") 
    )

來源

2010-07-28 23:51:00 mfontani

嗯，我猜就是這樣簡單。我需要花一些時間來消化你在那裏做的事情，但我應該能夠弄清楚。謝謝！ – Chase 2010-07-29 00:02:42

你能解釋第二個if語句在上面的代碼中嗎？似乎「if（length $ current_label）」將爲每一行返回true，否？這是你的意圖嗎？我對下一行的解釋是否正確：「if（$ line =〜/ ^（\ d）」（。*）「$ /）」「如果我的行以數字開頭，那麼抓住任何和所有字符在「」中，並將它們放在$ 1變量中？ – Chase 2010-07-29 00:36:20

@Chase，它看起來像第二個「if」是爲了跳過「一堆廢話」行（假設它們不是以'/' ）。它會阻止代碼記錄值，直到它找到一個有效的標籤（注意'$ current_label'初始化爲空字符串，它沒有長度）。就我個人而言，我將'$ current_label'留下未初始化，然後測試對於定義的$ current_label代替，但是這也可以工作 – cjm 2010-07-29 01:28:46

Perl正則表達式語法

回答

相關問題