Perl腳本打開文件，獲取URL並進行HTML清理

-1

我是Perl新手，一般在編程中。我在這裏面臨一些真正的問題。我需要一個Perl腳本，可以打開文本文件，讀取一系列URL，獲取頁面內容，進行HTML清理，並將內容保存到另一個文件中。Perl腳本打開文件，獲取URL並進行HTML清理

非常感謝您的指導。

來源

2012-05-16 stomp

「使HTML清洗」？ – Borodin

它會是這樣的

使用open打開文本文件
使用while (<$fh>) { ... }從中
chomp每行念給刪除換行符
使用LWP模塊讀取每個URL
進行HTML清理
使用open和print寫入到文件

來源

2012-05-17 00:23:30 Borodin

請參見下面的現實生活中的例子，一個簡單的方法來做到這一點是：

的文件中讀取：

$ cat /tmp/list.txt 
http://stackoverflow.com/questions/10627644/perl-script-to-open-file-get-url-and-make-html-cleaning 
http://google.com

Perl代碼，我用的是基本的LWP :: UserAgent的「瀏覽器」

#!/usr/bin/env perl 

use strict; 
use warnings; 

require LWP::UserAgent; 

open FH, "<", "/tmp/list.txt"; 

my $ua = LWP::UserAgent->new; 

$ua->timeout(10); 

foreach my $line (<FH>) { 
    my $response = $ua->get($line); 

    if ($response->is_success) { 
     # you need another file handle here to write to a new file 
     print $response->decoded_content; 
    } 
    else { 
     die $response->status_line; 
    } 
} 

close FH;

這是一個良好的基礎，你必須多一點的作品完成您的所有需求： - 使用另一個文件句柄寫一個新的文件 - 清理HTML

編輯：真的不知道關於「清潔」是什麼：你想轉儲頁作爲文本而不任何HTML？如果是的話，認爲：

#!/usr/bin/env perl 

use strict; 
use warnings; 

while (<>) { 
    `links -dump "$_" > "$1" `if m!https?://([^/]+)!; 
}

然後，在你的shell，你可以調用這樣的腳本：

$ perl script.pl < /path/to/URLs.list

來源

2012-05-17 00:30:36 wam

這是它如何進行，包括HTML清理和文件的例子節能

#!/usr/bin/perl 
use LWP::Simple; 
use HTML::Clean; 
open FILE, "</path/to/file/urls.txt" or die $!; 
while(<FILE>){ 
    chomp $_;$url=$_; 
    my $content=get($url); 

    my $h = new HTML::Clean(\$content); 
    $h->compat(); 
    $h->strip(); 
    my $data = $h->data(); 

    $url=~s/(http:\/\/)(.+\..+)(\/*)/$2/g; 

    open NF, ">>/path/to/file/$url.html"; 
    binmode(NF, ":utf8"); 
    print NF $$data; 
    close NF; 
} 
close FILE;

這將節省 'http://url.com/something' 爲 'url.com.html'

來源

2012-05-17 00:50:58 orhanhenrik

Perl腳本打開文件，獲取URL並進行HTML清理

回答

相關問題