這是從這個鏈接(http://www.rottentomatoes.com/movie/box_office.php)我的程序輸出。正如你所看到的,我錯過了頁面上的一些電影,例如18號(一筆錢)不在那裏。我的問題是任何人都可以檢查我的正則表達式,並幫助我找出爲什麼它不抓取所有的電影,或者如果我的代碼中有什麼錯誤,我找不到?電影刮刀,正則表達式不抓取每部電影
我需要補充一點,我使用lynx命令來獲取數據。是的,我必須使用它=(。我更新了代碼,以顯示我如何從網頁獲取信息。
此外,我只想打印35個字符的電影名稱,所以如果它超過了我只是想後截斷一切
OUTPUT:
## ## Movie Title Weekend Cume T-Meter
1 2 Safe House $78.2M $7.7k 52%
2 1 The Vow $85.5M $8.0k 30%
3 -- Ghost Rider: Spirit of Vengeance $22.0M $6.9k 15%
4 3 Journey 2: The Mysterious Island $53.2M $5.7k 43%
5 -- This Means War $19.2M $5.5k 25%
6 4 Star Wars: Episode I - The Phantom Menace (in 3D) $33.7M $3.0k 57%
7 5 Chronicle $51.0M $2.9k 84%
8 6 The Woman in Black $45.3M $2.6k 63%
9 -- The Secret World of Arrietty $6.4M $4.2k 93%
10 7 The Grey $47.9M $1.4k 78%
11 9 The Descendants $75.0M $2.4k 89%
12 13 The Artist $27.4M $2.9k 97%
13 8 Big Miracle $16.6M $1.3k 73%
14 14 Hugo $66.7M $2.9k 93%
15 11 Red Tails $47.5M $1.4k 36%
16 10 Underworld Awakening $61.3M $1.3k 28%
17 18 The Iron Lady $24.4M $1.7k 53%
19 15 Extremely Loud & Incredibly Close $30.6M $1.1k 45%
20 17 Contraband $65.7M $1.2k 49%
21 23 Alvin and the Chipmunks: Chipwrecked $129.7M $1.2k 13%
22 20 Mission: Impossible Ghost Protocol $207.3M $1.8k 93%
23 22 Tinker Tailor Soldier Spy $22.7M $2.6k 84%
24 29 The Adventures of Tintin $76.4M $1.3k 75%
25 33 A Separation $2.1M $6.2k 99%
27 31 Albert Nobbs $2.4M $1.6k 53%
28 -- Thin Ice $0.2M $3.6k 72%
29 36 My Week with Marilyn $13.6M $1.5k 84%
30 37 A Dangerous Method $5.2M $1.7k 77%
31 35 Puss in Boots $149.0M $1.0k 83%
33 53 In Darkness $0.1M $5.5k 86%
34 44 We Need to Talk About Kevin $0.6M $4.0k 80%
36 48 W.E. $0.2M $2.5k 13%
37 47 Rampart $0.1M $1.8k 73%
38 52 Coriolanus $0.3M $2.9k 94%
39 -- Bullhead $33.6k $4.8k 86%
40 -- Undefeated $30.9k $6.2k 92%
42 55 Chico & Rita $56.2k $5.3k 93%
43 54 Pariah $0.7M $1.5k 96%
Biggest Debut: Ghost Rider: Spirit of Vengeance (3)
Weakest Debut: Undefeated (40)
Biggest Gain: In Darkness (20 places)
Biggest Loss: Underworld Awakening (6 places)
CODE:
my $pageToGrab = "http://www.rottentomatoes.com/movie/box_office.php";
my $command = "/usr/bin/lynx -dump -width=150 $pageToGrab";
my $tempPageFile = `$command`;
print "## "."## "."Movie Title "."Weekend "."Cume "."T-Meter \n";
do
{
if ($tempPageFile =~ /(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+\[\d+\](.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)/g)
{
$curweek[$i] = $1;
$lastweek[$i] = $2;
$tmeter[$i] = $3;
$title[$i] = $4;
$weekend[$i] = $7;
$cume[$i] = $8;
printf("%-4s%-4s%-38s%7s%10s%10s\n",$curweek[$i], $lastweek[$i], $title[$i], $weekend[$i], $cume[$i], $tmeter[$i]);
if ($lastweek[$i] ne '--')
{
$gain = $lastweek[$i] - $curweek[$i];
}
if($gain > $largest)
{
$largest = $gain;
$biggestgaintitle = $title[$i];
}
if($gain < $smallest)
{
$smallest = $gain;
$biggestlosstitle = $title[$i];
}
if($lastweek[$i] eq '--')
{
$moviedebut[$j] = $curweek[$i];
$lastmovie = $title[$i];
$j++;
}
$i++;
}
}
while($i < 38);
你有使用HTML解析器來解析HTML頁面考慮? – Borealid 2012-02-20 19:19:16
我必須這樣做才能完成作業。 – Trance339 2012-02-20 19:24:20
但用正則表達式解析html並不是正確的方法,即使它適用於某些情況。 – 2012-02-20 21:54:56