2013-03-13 56 views
2

我正在嘗試WebClient中的DownloadData方法。我目前的問題是,我一直無法弄清楚如何將ASCII result&lt;<,\n,&gt;>)從Encoding.ASCII.GetString(myDataBuffer);生產出來,在page之外。將ASCII編碼爲HTML

pagesource http://iforce.co.nz/i/z4f2wggp.evi.png

/// <summary> 
    /// Curl data from the PMID 
    /// </summary> 
    private void ClientPMID(int pmid) 
    { 
     //generate the URL for the client 
     StringBuilder pmid_url_string = new StringBuilder(); 
     pmid_url_string.Append("http://www.ncbi.nlm.nih.gov/pubmed/").Append(pmid.ToString()).Append("?report=xml"); 
     Uri PMIDUri = new Uri(pmid_url_string.ToString()); 
     //declare and initialize the client 
     WebClient client = new WebClient(); 
     // Download the Web resource and save it into a data buffer. 
     byte[] myDataBuffer = client.DownloadData(PMIDUri); 
     this.DownloadCompleted(myDataBuffer); 
    } 
    /// <summary> 
    /// Crawl over the binary from myDataBuffer 
    /// </summary> 
    /// <param name="myDataBuffer">Binary Buffer</param> 
    private void DownloadCompleted(byte[] myDataBuffer) 
    { 
     string download = Encoding.ASCII.GetString(myDataBuffer); 
     PMIDCrawler pmc = new PMIDCrawler(download, "/pre/PubmedArticle/MedlineCitation/Article"); 
     //iterate over each node in the file 
     foreach (XmlNode xmlNode in pmc.crawl) 
     { 
      string AbstractTitle = xmlNode["ArticleTitle"].InnerText; 
      string AbstractText = xmlNode["Abstract"]["AbstractText"].InnerText; 
     } 
    } 

代碼PMIDCrawler可以用我的關於DownloadStringCompletedEventHandler其他SO問題。儘管從string html = HttpUtility.HtmlDecode(nHtml);輸出無效HTML (OR XML)(由於它不響應xml http標頭),在收到Encoding.ASCII.GetString的內容後。

+1

下面是如何用JavaScript做到這一點,例如http://stackoverflow.com/questions/5796718/html-entity-decode – Hogan 2013-03-13 02:48:28

回答

2

不幸的是這臺服務器無法正確響應Accept: text/xmlAccept: application/xml所以你必須要做到這一點艱難地(HttpUtility

string download = HttpUtility.HtmlDecode(Encoding.ASCII.GetString(myDataBuffer)); 

(在.NET FX或WebUtility.Decode 4.5+)

string download = Encoding.ASCII.GetString(myDataBuffer); 
if (download != null) { // this won't get all HTML escaped characters... 
    download = download.Replace("&lt;", "<").Replace("&gt;", ">"); 
} 

另請參閱this question瞭解更多信息。

+0

+1爲一個很好的建議,但無論如何要解決的事實,每個'屬性'正在逃脫?例如[<?xml version = \「1.0 \」encoding = \「utf-8 \」?>](http://pastebin.com/hjCwhEhL) – Killrawr 2013-03-13 03:16:16

+1

確保你的'\''和'\ n'你看到的不僅僅是Visual Studio調試器的工件,如果你在斷點處檢查一個字符串的話(以前一直都是這樣)我們可以通過Console.WriteLine來驗證我是否記得我的C# /.NET正確。 – cfeduke 2013-03-13 03:19:07

+0

Are you certain?'curl --header「Accept:text/html」http://www.ncbi.nlm.nih.gov/pubmed/22918716 \?report \ = xml'告訴我HTML實體轉義了「XML」,但沒有「\ n」和「\」'標記。 – cfeduke 2013-03-13 03:22:23