html
  • go
  • text
  • 2017-06-08 26 views 1 likes 
    1

    從HTML中提取文本,我用的是完全HTML5-compliant tokenizer and parser,這樣如何在Golang中僅提取HTML中的文本?

    s := ` 
    <p>Links:</p><ul><li><a href="foo">Foo</a><li> 
    <a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span> 
    <script type='text/javascript'> 
    /* <![CDATA[ */ 
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."}; 
    /* ]]> */ 
    </script>` 
    
        domDocTest := html.NewTokenizer(strings.NewReader(s)) 
        for tokenType := domDocTest.Next(); tokenType != html.ErrorToken; { 
         if tokenType != html.TextToken { 
          tokenType = domDocTest.Next() 
          continue 
         } 
         TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text()))) 
         if len(TxtContent) > 0 { 
          fmt.Printf("%s\n", TxtContent) 
         } 
         tokenType = domDocTest.Next() 
        } 
    

    ,但我得到這個結果

    Links: 
    Foo 
    BarBaz 
    TEXT 
    I 
    WANT 
    /* <![CDATA[ */ 
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."}; 
    /* ]]> */ 
    

    我不想CDATA內容。一些想法,如何只獲取文本內容?

    +0

    真的出現你想要的這裏是忽略任何在非再生的元素,也叫做'script'標籤。要做到這一點,您不僅需要查看「TextToken」,還需要查看「StartTagToken」。如果令牌是腳本標記的開始,請忽略以下文本標記。 –

    回答

    0

    正如@Eric波利表示,我看TextTokens & StartTagTokens。 這裏是我的解決方案

    s := ` 
    <p>Links:</p><ul><li><a href="foo">Foo</a><li> 
    <a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span> 
    <script type='text/javascript'> 
    /* <![CDATA[ */ 
    var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."}; 
    /* ]]> */ 
    </script>` 
    
        domDocTest := html.NewTokenizer(strings.NewReader(s)) 
        previousStartTokenTest := domDocTest.Token() 
    loopDomTest: 
        for { 
         tt := domDocTest.Next() 
         switch { 
         case tt == html.ErrorToken: 
          break loopDomTest // End of the document, done 
         case tt == html.StartTagToken: 
          previousStartTokenTest = domDocTest.Token() 
         case tt == html.TextToken: 
          if previousStartTokenTest.Data == "script" { 
           continue 
          } 
          TxtContent := strings.TrimSpace(html.UnescapeString(string(domDocTest.Text()))) 
          if len(TxtContent) > 0 { 
           fmt.Printf("%s\n", TxtContent) 
          } 
         } 
        } 
    
    2

    如果您使用github.com/PuerkitoBio/goquery實現您想要的功能相當容易。

    所以,最終的代碼將是

    package main 
    
    import (
        "fmt" 
        "strings" 
        "github.com/PuerkitoBio/goquery" 
    ) 
    
    func main(){ 
        s := `<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul><span>TEXT <b>I</b> WANT</span><script type='text/javascript'>/* <![CDATA[ */var post_notif_widget_ajax_obj = {"ajax_url":"http:\/\/site.com\/wp-admin\/admin-ajax.php","nonce":"9b8270e2ef","processing_msg":"Processing..."};/* ]]> */</script>` 
    
        p := strings.NewReader(s) 
        doc, _ := goquery.NewDocumentFromReader(p) 
    
        doc.Find("script").Each(func(i int, el *goquery.Selection) { 
         el.Remove() 
        }) 
    
        fmt.Println(doc.Text()) // Links:FooBarBazTEXT I WANT 
    
    } 
    
    相關問題