從HTML中提取文本,我用的是完全HTML5-compliant tokenizer and parser,這樣如何在Golang中僅提取HTML中的文本?
s := `
<p>Links:</p><ul><li><a href="foo">Foo</a><li>
<a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span>
<script type='text/javascript'>
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
</script>`
domDocTest := html.NewTokenizer(strings.NewReader(s))
for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; {
if tokenType != html.TextToken {
tokenType = domDocTest.Next()
continue
}
TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text())))
if len(TxtContent) > 0 {
fmt.Printf("%s\n", TxtContent)
}
tokenType = domDocTest.Next()
}
,但我得到這個結果
Links:
Foo
BarBaz
TEXT
I
WANT
/* <![CDATA[ */
var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};
/* ]]> */
我不想CDATA
內容。一些想法,如何只獲取文本內容?
真的出現你想要的這裏是忽略任何在非再生的元素,也叫做'script'標籤。要做到這一點,您不僅需要查看「TextToken」,還需要查看「StartTagToken」。如果令牌是腳本標記的開始,請忽略以下文本標記。 –