2012-03-04 87 views
0

我有一個實驗室任務,並且我正在關於刪除html標記的問題。下面是刪除HTML標籤的方法:刪除剩餘的html標記

public String getFilteredPageContents() { 
    String str = getUnfilteredPageContents(); 
    String temp = ""; 
    boolean b = false; 
    for(int i = 0; i<str.length(); i++) { 
     if(str.charAt(i) == '&' || str.charAt(i) == '<') { 
      b = true; 
     } 
     if(b == false) { 
      temp += str.charAt(i); 
     } 
     if(str.charAt(i) == '>' || str.charAt(i) == ';') { 
      b = false; 
     } 
    } 
    return temp; 
} 

這裏是我的文字最早的形式:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN"> 
<html> 

<head> 
<meta http-equiv="Content-Type" 
content="text/html; charset=iso-8859-1"> 
<meta name="GENERATOR" content="Microsoft FrontPage 2.0"> 
<title>A Shropshire Lad</title> 
</head> 

<body bgcolor="#008000" text="#FFFFFF" topmargin="10" 
leftmargin="20"> 

<p align="center"><font size="6"><strong></strong></font>&nbsp;</p> 
<div align="center"><center> 

<pre><font size="7"><strong>A Shropshire Lad 
</strong></font><strong> 
by A.E. Housman 
Published by Dover 1990</strong></pre> 
</center></div> 

<p><strong>This collection of sixty three poems appeared in 1896. 
Many of them make references to Shrewsbury and Shropshire, 
however, Housman was not a native of the county. The Shropshire 
of his book is a mindscape in which he blends old ballad meters, 
classical reminiscences and intense emotional experiences 
&quot;recollected in tranquility.&quot; Although they are not 
particularly to my taste, their style, simplicity and 
timelessness are obvious even to me. Below are two short poems 
which amused me, I hope you find them interesting too.</strong></p> 

<hr size="8" width="80%" color="#FFFFFF"> 
<div align="left"> 

<pre><font size="5"><strong><u> 
XIII</u></strong></font><font size="4"><strong> 

When I was one-and-twenty 
I heard a wise man say, 
'Give crowns and pounds and guineas 
But not your heart away;</strong></font></pre> 
</div><div align="left"> 

<pre><font size="4"><strong>Give pearls away and rubies 
But keep your fancy free. 
But I was one-and-twenty, 
No use to talk to me.</strong></font></pre> 
</div><div align="left"> 

<pre><font size="4"><strong>When I was one-and-twenty 
I heard him say again, 
'The heart out of the bosom 
Was never given in vain; 
'Tis paid with sighs a plenty 
And sold for endless rue' 
And I am two-and-twenty, 
And oh, 'tis true 'tis true. 

</strong></font><strong></strong></pre> 
</div> 

<hr size="8" width="80%" color="#FFFFFF"> 

<pre><font size="5"><strong><u>LVI . The Day of Battle</u></strong></font><font 
size="4"><strong> 

'Far I hear the bugle blow 
To call me where I would not go, 
And the guns begin the song, 
&quot;Soldier, fly or stay for long.&quot;</strong></font></pre> 

<pre><font size="4"><strong>'Comrade, if to turn and fly 
Made a soldier never die, 
Fly I would, for who would not? 
'Tis sure no pleasure to be shot.</strong></font></pre> 

<pre><font size="4"><strong>'But since the man that runs away 
Lives to die another day, 
And cowards' funerals, when they come, 
Are not wept so well at home,</strong></font></pre> 

<pre><font size="4"><strong>'Therefore, though the best is bad, 
Stand and do the best, my lad; 
Stand and fight and see your slain, 
And take the bullet in your brain.'</strong></font></pre> 

<hr size="8" width="80%" color="#FFFFFF"> 
</body> 
</html> 

而當這段文字實現我的方法:

charset=iso-8859-1"> 

A Shropshire Lad 







A Shropshire Lad 

by A.E. Housman 
Published by Dover 1990 


This collection of sixty three poems appeared in 1896. 
Many of them make references to Shrewsbury and Shropshire, 
however, Housman was not a native of the county. The Shropshire 
of his book is a mindscape in which he blends old ballad meters, 
classical reminiscences and intense emotional experiences 
recollected in tranquility. Although they are not 
particularly to my taste, their style, simplicity and 
timelessness are obvious even to me. Below are two short poems 
which amused me, I hope you find them interesting too. 
. 
. 
. 

我的問題是:我怎樣才能擺脫在文本charset=iso-8859-1">的開頭的小代碼。我無法擺脫那堆代碼?謝謝...

+0

您可以先避免使用FrontPage。像這樣的工具可以方便地交換正確的代碼 – Joseph 2012-03-04 00:48:36

+0

避免FrontPage可能是一個好主意。但我認爲這個任務是處理HTML代碼,無論它來自哪裏? – Nayuki 2012-03-04 00:51:51

回答

2

我可以看到你的意圖是刪除看起來像<xxx>&xxx;的東西。您正在使用變量b來記住您目前是否在跳過內容。

您是否注意到您的算法會跳過<xxx;&xxx>的格式?即,&<將導致跳過開始,並且>;將導致跳過結束,但您不必與<>&;匹配。那麼如何實現代碼來記住哪個字符開始跳過?

進一步複雜化,不過,是&xxx;東西可以嵌入<xxx>的東西,像這樣:<p title="&amp;">

順便說一句,temp += str.charAt(i);會讓你的程序很慢,當字符串長度。請改用StringBuilder代替。


下面是一些代碼,應該解決您的問題或接近:

import java.util.Stack; 

public String getFilteredPageContents() { 
    String str = getUnfilteredPageContents(); 
    StringBuilder() temp = new StringBuilder(); 

    // The closing character for each thing that we're inside 
    Stack<Character> expectedClosing = new Stack<Character>(); 

    for(int i = 0; i<str.length(); i++) { 
     char c = str.charAt(i); 
     if(c == '<') 
      expectedClosing.push('>'); 
     else if(c == '&') 
      expectedClosing.push(';'); 

     // Is the current character going to close something? 
     else if(!expectedClosing.empty() && c == expectedClosing.peek()) 
      expectedClosing.pop(); 

     else { 
      // Only add to output if not currently inside something 
      if(expectedClosing.empty()) 
       temp.append(c); 
     } 
    } 
    return temp.toString(); 
} 
0

這是學校裏的功課,但任何機會,你可以使用一個結構良好的HTML解析器如this作業?

+0

我無法使用外部庫或分析器來完成此任務。 – El3ctr0n1c4 2012-03-04 09:49:33

0

解決這種情況的最優雅的方法可能是使用regular expressions。有了它們,您可以專門搜索標籤結構並將其從輸出中移除。

但是,由於您已經編寫了一個程序代碼,除了您提到的問題外,它還可以正常工作,所以一個簡短的&髒解決方案可能就足夠了。

我能想到的一件事就是應用一種類似過濾器的算法,它可以一行一行地掃描文本輸出,並在它們出現時將其刪除。就像閱讀每行並檢查最後一個字符是否爲>。如果它是刪除行/用空字符串替換它。在正常的文本中不應該有任何>和一個句子的結尾,所以你不應該在那裏有太多的麻煩。