2012-02-15 53 views
1

我想寫一個程序在這個頁面上使用lynx命令「http://www.rottentomatoes.com/movie/box_office.php」,我似乎無法將我的頭包裹在某些問題....自己獲得標題。我的問題是一個標題可以包含特殊字符,數字,並且所有標題的長度都是可變的。我想寫一個正則表達式,可以解析整個頁面,並找到像這樣的行...... (我在標題和下一個數字之間加上了空格,這就是它出去了多少星期,以區分標題和星期發佈)正則表達式分析的問題

1 -- 30% The Vow           1 $41.2M $41.2M $13.9k 2958 
2 -- 53% Safe House          1 $40.2M $40.2M $12.9k 3119 
3 -- 42% Journey 2: The Mysterious Island     1 $27.3M $27.3M $7.9k 3470 
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655 
5 1 86% Chronicle           2 $12.1M $40.0M $4.2k 2908 

我與開始時的正則表達式是:

/(\d+)\s(\d+|\-\-)\s(\d+\%)\s 

如果有人可以幫助我弄清楚如何抓住標題成功,將不勝感激!先謝謝了。

+2

是你的任務解析頁面,或者寫一個正則表達式解析的頁面?如果它是前者,你應該考慮使用DOM庫而不是正則表達式。 – Borealid 2012-02-15 17:21:55

+0

正在使用正則表達式來執行此操作嗎?由於數據已經合理,爲什麼不剪切適當的列,然後應用修剪功能? – VeeArr 2012-02-15 17:22:52

+0

我完全同意你們兩個,但分配是使用lynx命令並解析所有信息=/ – Trance339 2012-02-15 17:25:56

回答

2

捕獲所有的東西!

^(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+(.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)$ 

解釋:

^       <- Start of the line 
    (\d+)\s+     <- Numbers (captured) followed by as many spaces as you want 
    (\d+|\-\-)\s+   <- Numbers [or "--"] (captured) followed by as many spaces as you want 
    (\d+\%)\s+    <- Numbers [with '%'] (captured) followed by as many spaces as you want 
    (.*)\s+     <- Anything you can match [don't be greedy] (captured) followed by as many spaces as you want 
    (\d+)\s+     <- Numbers (captured) followed by as many spaces as you want 
    (\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want 
    (\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want 
    (\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want 
    (\d+)     <- Numbers (captured) 
$       <- End of the line 

所以要嚴重,這是我做了什麼,我被騙了一下,一切繳獲要(我想你會做到底)爲標題捕獲提供前瞻。

在非正則表達式的貪婪(.*) [或者,如果你想強制「ungreedyness」 (.*?)]將捕捉儘可能少的字符和正則表達式的結尾試圖捕捉一切。

你的正則表達式最終只捕獲標題(唯一剩下的)。

你可以做的是使用一個實際的向前看和斷言。


資源:

+0

這絕對有助於抓住線上的所有東西!謝謝您的幫助! – Trance339 2012-02-15 17:44:26

+0

我做了大部分的工作(任務),因爲我不認爲有一種簡單的方法來解釋正則表達式而不給出正則表達式本身。但我希望你閱讀解釋,並且你會閱讀regular-expressions.info鏈接(無論如何,這真的很有趣)。 – 2012-02-15 17:47:37

+0

抓取這只是作業的一小部分。我寫更多的正則表達式,我只是在解決如何獲得整個標題方面遇到困難。這只是我的第二個正則表達式任務,所以仍然試圖全部解決。 – Trance339 2012-02-15 17:52:56