Google Cloud Natural Language API實際上是否支持解析HTML？

我試圖從新聞網站提取主體內容&博客。Google Cloud Natural Language API實際上是否支持解析HTML？

該文檔使它看起來好像documents.analyzeSyntax將通過與content傳遞一個document作爲頁面的原始HTML（UTF-8）和文檔的type設置爲HTML與HTML正常工作。文檔絕對包含HTML作爲支持的內容類型。

然而，實際上，生成的句子和標記與HTML標籤混雜在一起，就好像分析器認爲輸入是純文本一樣。就目前而言，這爲我的用例排除了GC NL API，推測很多其他人通過自然語言處理網頁是一項非常普遍的任務。

作爲參考，這裏是一個example由Dandelion API輸出類型的人會期望給定的HTML輸入（或者更確切地說，在這種情況下，一個HTML頁面的URL作爲輸入）。

那麼我的問題是我錯過了什麼，可能是錯誤地調用了API，還是NL API不支持HTML？

來源

2017-06-12 fisch2

是的。

不知道你用什麼語言，但下面是使用客戶端庫在python一個例子：

from google.cloud import language 

client = language.Client() 

# document of type PLAIN_TEXT 
text = "hello" 
document_text = client.document_from_text(text) 
syntax_text = document_text.analyze_syntax() 

print("\n\ndocument of type PLAIN_TEXE:") 
for token in syntax_text.tokens: 
    print(token.__dict__) 

# document of type HTML 
html = "<p>hello</p>" 
document_html = client.document_from_html(html) 
syntax_html = document_html.analyze_syntax() 

print("\n\ndocument of type HTML:") 
for token in syntax_html.tokens: 
    print(token.__dict__) 

# document of type PLAIN_TEXT but should be HTML 
document_mismatch = client.document_from_text(html) 
syntax_mismatch = document_mismatch.analyze_syntax() 

print("\n\ndocument of type PLAIN_TEXT but with HTML content:") 
for token in syntax_mismatch.tokens: 
    print(token.__dict__)

這對我的作品在html標籤<p>和</p>不會被處理自然語言。

如果你去通過在this page，你可以迅速與gcloud命令行工具實驗的設置步驟：

gcloud beta ml language analyze-syntax --content="<p>hello</p>" --content-type="HTML"

來源

2017-06-16 19:07:24 dizcology

Google Cloud Natural Language API實際上是否支持解析HTML？

回答

相關問題