2012-02-20 171 views
1

這是從這個鏈接(http://www.rottentomatoes.com/movie/box_office.php)我的程序輸出。正如你所看到的,我錯過了頁面上的一些電影,例如18號(一筆錢)不在那裏。我的問題是任何人都可以檢查我的正則表達式,並幫助我找出爲什麼它不抓取所有的電影,或者如果我的代碼中有什麼錯誤,我找不到?電影刮刀,正則表達式不抓取每部電影

我需要補充一點,我使用lynx命令來獲取數據。是的,我必須使用它=(。我更新了代碼,以顯示我如何從網頁獲取信息。

此外,我只想打印35個字符的電影名稱,所以如果它超過了我只是想後截斷一切

OUTPUT:

## ## Movie Title       Weekend  Cume T-Meter 
1 2 Safe House        $78.2M  $7.7k  52% 
2 1 The Vow        $85.5M  $8.0k  30% 
3 -- Ghost Rider: Spirit of Vengeance  $22.0M  $6.9k  15% 
4 3 Journey 2: The Mysterious Island  $53.2M  $5.7k  43% 
5 -- This Means War       $19.2M  $5.5k  25% 
6 4 Star Wars: Episode I - The Phantom Menace (in 3D) $33.7M  $3.0k  57% 
7 5 Chronicle        $51.0M  $2.9k  84% 
8 6 The Woman in Black      $45.3M  $2.6k  63% 
9 -- The Secret World of Arrietty   $6.4M  $4.2k  93% 
10 7 The Grey        $47.9M  $1.4k  78% 
11 9 The Descendants      $75.0M  $2.4k  89% 
12 13 The Artist        $27.4M  $2.9k  97% 
13 8 Big Miracle       $16.6M  $1.3k  73% 
14 14 Hugo         $66.7M  $2.9k  93% 
15 11 Red Tails        $47.5M  $1.4k  36% 
16 10 Underworld Awakening     $61.3M  $1.3k  28% 
17 18 The Iron Lady       $24.4M  $1.7k  53% 
19 15 Extremely Loud & Incredibly Close  $30.6M  $1.1k  45% 
20 17 Contraband        $65.7M  $1.2k  49% 
21 23 Alvin and the Chipmunks: Chipwrecked $129.7M  $1.2k  13% 
22 20 Mission: Impossible Ghost Protocol $207.3M  $1.8k  93% 
23 22 Tinker Tailor Soldier Spy    $22.7M  $2.6k  84% 
24 29 The Adventures of Tintin    $76.4M  $1.3k  75% 
25 33 A Separation       $2.1M  $6.2k  99% 
27 31 Albert Nobbs       $2.4M  $1.6k  53% 
28 -- Thin Ice        $0.2M  $3.6k  72% 
29 36 My Week with Marilyn     $13.6M  $1.5k  84% 
30 37 A Dangerous Method      $5.2M  $1.7k  77% 
31 35 Puss in Boots       $149.0M  $1.0k  83% 
33 53 In Darkness        $0.1M  $5.5k  86% 
34 44 We Need to Talk About Kevin    $0.6M  $4.0k  80% 
36 48 W.E.         $0.2M  $2.5k  13% 
37 47 Rampart         $0.1M  $1.8k  73% 
38 52 Coriolanus        $0.3M  $2.9k  94% 
39 -- Bullhead        $33.6k  $4.8k  86% 
40 -- Undefeated        $30.9k  $6.2k  92% 
42 55 Chico & Rita       $56.2k  $5.3k  93% 
43 54 Pariah         $0.7M  $1.5k  96% 


Biggest Debut: Ghost Rider: Spirit of Vengeance (3) 
Weakest Debut: Undefeated (40) 
Biggest Gain: In Darkness (20 places) 
Biggest Loss: Underworld Awakening (6 places) 

CODE:

my $pageToGrab = "http://www.rottentomatoes.com/movie/box_office.php"; 
my $command = "/usr/bin/lynx -dump -width=150 $pageToGrab"; 
my $tempPageFile = `$command`; 


print "## "."## "."Movie Title       "."Weekend  "."Cume "."T-Meter \n"; 
do 
{ 
     if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)/g) 
     { 
      $curweek[$i] = $1; 
      $lastweek[$i] = $2; 
      $tmeter[$i] = $3; 
      $title[$i] = $4; 
      $weekend[$i] = $7; 
      $cume[$i] = $8; 
      printf("%-4s%-4s%-38s%7s%10s%10s\n",$curweek[$i], $lastweek[$i], $title[$i], $weekend[$i], $cume[$i], $tmeter[$i]); 

      if ($lastweek[$i] ne '--') 
      { 
        $gain = $lastweek[$i] - $curweek[$i]; 
      } 

      if($gain > $largest) 
      { 
        $largest = $gain; 
        $biggestgaintitle = $title[$i]; 
      } 

      if($gain < $smallest) 
      { 
        $smallest = $gain; 
        $biggestlosstitle = $title[$i]; 
      } 

      if($lastweek[$i] eq '--') 
      { 
        $moviedebut[$j] = $curweek[$i]; 
        $lastmovie = $title[$i]; 
        $j++; 
      } 
      $i++; 
    } 
} 
while($i < 38); 
+1

你有使用HTML解析器來解析HTML頁面考慮? – Borealid 2012-02-20 19:19:16

+0

我必須這樣做才能完成作業。 – Trance339 2012-02-20 19:24:20

+0

但用正則表達式解析html並不是正確的方法,即使它適用於某些情況。 – 2012-02-20 21:54:56

回答

2

這裏是18:

18 12 2% [82]One for the Money 4 $0.8M $25.5M $830 933 

請注意,第三個金額($ 830)沒有M或K後綴。使用[Mk]?,也許對所有3美元金額:

if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk]?)\s+(\d+)/g) { 

要截斷:

$title =[$i] = substr $4, 0, 35; 

perldoc -f substr

+1

感謝您的幫助,完美地制定出來。這只是我的第二個正則表達式程序,我正在試圖修復這個巨大的正則表達式。 – Trance339 2012-02-20 19:52:07

+1

不客氣。你可以使用'// x'修飾符來使你的正則表達式更具可讀性。參見'perldoc perlre'。 – toolic 2012-02-20 19:54:35