perl的SUBSTR在字符串中刪除兩個位置之間的一切

所以我有這個文件clip.txt僅包含：perl的SUBSTR在字符串中刪除兩個位置之間的一切

<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

現在我想，這樣我就結了，除去之間< ...>一切Kanye West，Christ Martin。

Kanye West，Christ Martin。

用Perl我有當前代碼：

#!/usr/local/bin/perl 

$file = 'clip.txt'; 
open(FILE, $file); 
@lines = <FILE>; 
close(FILE); 
$line = @lines[0]; 

while (index($line, "<") != -1) { 
my $from = rindex($line, "<"); 
my $to = rindex($line, ">"); 

print $from; 
print ' - '; 
print $to; 
print ' '; 

print substr($line, $from, $to+1); 
print '|'; // to see where the line stops 
print "\n"; 
substr($line, $from, $to+1) = ""; //removes between lines 
$counter += 1; 

} 

print $line;

所有的「打印」線是相當多餘的，但良好的進行調試。

現在結果就變成：

138 - 141 </a> 
| 
67 - 125 <a href="http://http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin| 
61 - 64 </a>, | 
0 - 50 <a href="https://en.wikipedia.org/wiki/Kanye_West">| 
Kanye West

首先劇本找到138 -141之間的位置，並將其刪除。然後找到67 - 125，但它刪除67 - 137. 接下來它發現61 - 64，但它刪除61 - 66.

它爲什麼這樣做？在底線上找到0 - 64，並完美刪除。所以我找不到這裏的邏輯。

來源

2013-07-28 Mattis Asp

一個soulution可能是使用HTML解析器的Perl –

substr的第三個參數是長度，不結束索引，所以應該通過$to-$from+1。

（雖然你也應該調整你的代碼，以確保它找到既<和>，那>是<後）。

來源

2013-07-28 10:28:04 ysth

可以使用s///操作：

$line =~ s/<[^>]+>//g

來源

2013-07-28 09:44:02 choroba

生病了，非常感謝！：D –

還是非貪心？ '$ line =〜s/<.*?> // g'' :) :) – jm666

的妥善解決確實是使用類似HTML::TokeParser::Simple。但是，如果你只是在做這個作爲一個學習的過程，你可以通過提取簡化它，你想要什麼，而不是刪除你做什麼不是：

#!/usr/bin/env perl 

use strict; 
use warnings; 
use feature 'say'; 

while (my $line = <DATA>) { 
    my $x = index $line, '>'; 
    next unless ++$x; 
    my $y = index $line, '<', $x; 
    next unless $y >= 0; 
    say substr($line, $x, $y - $x); 
} 

__DATA__ 
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

輸出：

Kanye West 
Chris Martin

在另一方面，使用HTML解析器是不是真的那麼複雜：

#!/usr/bin/env perl 

use strict; 
use warnings; 
use feature 'say'; 

use HTML::TokeParser::Simple; 

my $parser = HTML::TokeParser::Simple->new(\*DATA); 

while (my $anchor = $parser->get_tag('a')) { 
    my $text = $parser->get_text('/a'); 
    say $text; 
} 

__DATA__ 
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

來源

2013-07-28 10:07:50

而一個簡單的正則表達式替換應該做你想要的示例數據是什麼， parsing (X)HTML with regexes is generally a bad idea（並且用簡單的字符搜索做同樣的事情基本上是一樣的）。更靈活和更易讀的方法是使用適當的HTML解析器。

實施例與Mojo::DOM：

#!/usr/bin/env perl 

use strict; 
use warnings; 
use feature 'say'; 
use Mojo::DOM; 

# slurp data into a parser object 
my $dom = Mojo::DOM->new(do { local $/; <DATA> }); 

# iterate all links 
for my $link ($dom->find('a')->each) { 

    # print the link text 
    say $link->text; 
} 

__DATA__ 
<a href="https://en.wikipedia.org/wiki/Kanye_West">Kanye West</a>, 
<a href="http://en.wikipedia.org/wiki/Chris_Martin">Chris Martin</a>

輸出：

Kanye West 
Chris Martin

來源

2013-07-28 10:14:57 memowe

perl的SUBSTR在字符串中刪除兩個位置之間的一切

回答

相關問題