2013-02-13 69 views
0

我有一個問題,我希望你可以幫忙嗎?Perl WWW ::機械化存儲網址,除非它已被發現

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 

    # compare 
    if ($mech->response->header('Content-Length') != $content_length) { 
     print "$child_url: different content length: $content_length vs " 
     . $mech->response->header('Content-Length') . "!\n"; 
     #HERE I want to store the urls that are found to have different content 
     #lengths to the base url 
     #only if the same url has not already been stored 
    } elsif ($mech->response->header('Content-Length') == $content_length) { 
     print "Content lengths are the same\n"; 
     #HERE I want to store the urls that are found to have the same content 
     #length as the base url 
     #only if the same url has not already been stored 
    } 
    } 
} 

我遇到的問題:

正如你可以在代碼中看到上面我想存儲的URL取決於如果內容長度相同或不同,所以我會最終得到一組具有與其基本URL不同的內容長度的URL,並且最終將得到另一組具有與其基本URL相同內容長度的URL。

我知道如何做到這一點很容易使用數組

push (@differentContentLength, $url); 
push (@sameContentLength, $url); 

但我將如何去使用這個散列(或另一種首選方法)?

我仍然得到與哈希交手所以你的幫助將非常感激,

非常感謝

+0

您應該在您的循環中添加右括號。 – simbabque 2013-02-13 12:18:09

+0

@simbabque - 是你的權利,道歉 – 2013-02-13 12:25:07

回答

1

您可以創建一個hashref到將所有網址存儲在循環之外。我們稱之爲$content_lengths。這是一個標量,因爲它是對散列的引用。在您的$child_url循環中,將內容長度添加到該數據結構。我們將首先使用基礎網址,在$content_lengths->{$url}內部給我們另一個hashref。我們決定是否需要equaldifferent。在這兩個鍵的內部將會有另一個保存$child_url的hashref。他們反過來將他們的內容長度作爲價值。當然,如果你不想保存長度,我們可以在這裏說++

my $content_lengths; # this is at the top 
foreach my $url (# ... more stuff 

# compare 
if ($mech->response->header('Content-Length') != $content_length) { 
    print "$child_url: different content length: $content_length vs " 
    . $mech->response->header('Content-Length') . "!\n"; 

    # store the urls that are found to have different content 
    # lengths to the base url only if the same url has not already been stored 
    $content_lengths->{$url}->{'different'}->{$child_url} = $mech->response->header('Content-Length'); 

} elsif ($mech->response->header('Content-Length') == $content_length) { 
    print "Content lengths are the same\n"; 

    # store the urls that are found to have the same content length as the base 
    # url only if the same url has not already been stored 
    $content_lengths->{$url}->{'equal'}->{$child_url} = $mech->response->header('Content-Length'); 
} 
+0

當你說'使用++,如果你不想要長度被存儲',這應該如何寫'$ content_lengths - > {$ url} - > {'不同的'} - > {$ child_url} ++;'?並澄清,究竟是什麼'++'在做什麼? – 2013-02-13 14:04:56

+1

@ perl-user是的,這就是我的意思。它是增量速記運算符。它向左側的var添加1並分配它。所以他們都有價值1.如果其中一個網站被看到兩次,價值將是2.這是如何計數器和'記住名字,但不關心有多少'實施。你可以用'keys'來訪問它。把它想象成SQL中的「GROUP BY」。 – simbabque 2013-02-13 15:53:45

+0

哦,我看到現在如何防止重複(使用Data :: Dumper),而不是在其中添加另一個重複的url只是通過增加分配給該url的數字來註冊它的存在,這並不重要,因爲我們對該部分不感興趣,感謝你的上面的評論解釋得很好:) – 2013-02-13 16:40:17

1

請檢查該解決方案:

my %content_length; 

foreach my $url (keys %{$newURLs}) { 
    # first get the base URL and save its content length 
    $mech->get($url); 
    my $content_length = $mech->response->header('Content-Length'); 

    # now iterate all the 'child' URLs 
    foreach my $child_url (@{ $newURLs->{$url} }) { 
    # get the content 
    $mech->get($child_url); 
    my $new_content_length = $mech->response->header('Content-Length'); 
    # store in hash 
    print "New URL! url: $child_url\n" if ! defined $content_length{$child_url}; 
    print "Different content_length! url: $child_url, old_content_length: $content_length, new_content_length: $new_content_length\n" if $new_content_length != $content_length{$child_url}; 
    $content_length{$child_url} = $new_content_length; 
    } 
}