2013-10-28 77 views
1

我有一個文檔我想解析它包含html,我想要轉換如果從htmlplaintext但格式化。使用jSoup格式化HTML文本輸出

例提取

<p>My simple paragragh</p> 
<p>My paragragh with <a>Link</a></p> 
<p>My paragragh with an <img/></p> 

我可以做做簡單的例子,很容易(也許不是efficently)

StringBuilder sb = new StringBuilder(); 

for(Element element : doc.getAllElements()){ 
    if(element.tag().getName().equals("p")){ 
     sb.append(element.text()); 
     sb.append("\n\n"); 
    } 
} 

是否有可能(我會怎麼做)來插入輸出內聯元素在正確的位置。舉個例子:

<p>My paragragh with <a>Link</a> in the middle</p> 

將成爲:

My paragragh with (Location: http://mylink.com) in the middle 

回答

1

您可以用TextNode替換每個鏈接標籤:

final String html = "<p>My simple paragragh</p>\n" 
     + "<p>My paragragh with <a>Link</a></p>\n" 
     + "<p>My paragragh with an <img/></p>"; 

Document doc = Jsoup.parse(html, ""); 

// Select all link-tags and replace them with TextNodes 
for(Element element : doc.select("a")) 
{ 
    element.replaceWith(new TextNode("(Location: http://mylink.com)", "")); 
} 


StringBuilder sb = new StringBuilder(); 

// Format as needed 
for(Element element : doc.select("*")) 
{ 
    // An alternative to the 'if'-statement 
    switch(element.tagName()) 
    { 
     case "p": 
      sb.append(element.text()).append("\n\n"); 
      break; 
     // Maybe you have to format some other tags here too ... 
    } 
} 

System.out.println(sb);