2014-07-21 57 views
1

我一直在努力從Metacritic中提取信息,但是現在我遇到了不能夠乾淨地提取帶有撇號或破折號的文本的問題。WWW ::機械不處理撇號或破折號

use WWW::Mechanize; 
$reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews'; 
$Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.'; 
$l = WWW::Mechanize->new(); 
    $l->get($reviewspage); 
    $k = $l->content; 
    @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s; 
    print "@Review\n"; 

輸出:

       Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry â€」 into family love, slow compromise and, yes, death â€」 resonates strongly. 

即使網站上的編碼是:

<div class="review_body"> 
           Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly. 
          </div> 

我創建類似

這個問題在下面的代碼說明之前的腳本已經使用了WWW :: Mechanize,並且它們都沒有替換掉這樣的字符。

回答

1

Metacritic的使用UTF-8字符集:

<meta http-equiv="content-type" content="text/html; charset=UTF-8"> 

因此打印此內容到控制檯,必須適應該字符集。

在我的Windows機器上,我必須在執行我的perl腳本之前在控制檯中運行chcp 65001。我必須指定STDOUT character set

use strict; 
use warnings; 
use utf8; 

use WWW::Mechanize; 

binmode STDOUT, ':utf8'; # output should be in UTF-8 

my $url = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews'; 
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.'; 

my $lwp = WWW::Mechanize->new(); 
$lwp->get($url); 
my $data = $lwp->content; 

if ($data =~ m{$Review.*?<div class="review_body">(.*?)</div>}s) { 
    print "$1\n"; 
} else { 
    warn "Review not found"; 
} 

輸出(換行補充):

Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins 
and others won’t persuade you that Death could have been huge, nor does a 
clichéd last-act reunion show. But the film’s alternating inquiry — into 
family love, slow compromise and, yes, death — resonates strongly. 
1

顯然,這是一個unicode的問題。

每建議在this answer,我能得到這個版本的代碼的工作:

use v5.12 ; 
use utf8 ; 
use open qw(:encoding(UTF-8) :std) ; 

use WWW::Mechanize ; 
my $reviewspage = 'http://www.metacritic.com/movie/a-band-called-death/critic-reviews' ; 
my $Review = 'In the end Death triumphs, but its allure and obsession remain a mystery.' ; 
my $l = WWW::Mechanize->new() ; 
    $l->get($reviewspage) ; 
    my $k = $l->content ; 
    my @Review = $k =~ m{$Review.*?<div class="review_body">(.*?)</div>}s ; 
    print "@Review\n" ; 

輸出:

        Too much of the doc takes our taste for granted; Alice Cooper, Henry Rollins and others won’t persuade you that Death could have been huge, nor does a clichéd last-act reunion show. But the film’s alternating inquiry — into family love, slow compromise and, yes, death — resonates strongly.