我一直想要麼一個<object>或<embed>標籤使用得到添加到DOM。選擇元素通過腳本

任何人都可以請告訴我如何獲得這些標籤和他們的InnerHtml？

一個嵌入的YouTube視頻看起來是這樣的：

<embed height="385" width="640" type="application/x-shockwave-flash" 
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..." 
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">

我有一種感覺的的JavaScript可能會停止工作的SWF播放器，希望不是......

乾杯

來源

2010-08-25 Alex

更新2010 -08-26（迴應OP的評論）：

我想你是在錯誤地思考它，Alex。假設我寫了看起來像這樣一些C＃代碼：

string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";

現在，如果我寫了一個C＃編譯器，它應該認識到字符串字面上面的內容爲C＃代碼，並突出顯示它（或其他）這樣？否，因爲在格式良好的C＃文件的上下文中，該文本表示string變量正在分配給codeBlock變量。

同樣，在YouTube上的頁面HTML中，<object>和<embed>元素是不是真的在當前的HTML文件的情況下，在所有元素。它們是駐留在JavaScript代碼中的字符串值的內容。

事實上，如果HtmlAgilityPack沒有無視這一事實，並試圖識別文本可能是HTML的所有部分，但它仍然無法與這些元素，因爲作爲內部的JavaScript，他們正在大量成功逃脫與\字符（請注意在我發佈的代碼中解決此問題的穩健的Unescape方法）。

我不是說我下面的hacky解決方案是解決這個問題的正確方法;我只是解釋爲什麼獲得這些元素並不像用HtmlAgilityPack來抓住它們那麼簡單。

`YouTubeScraper`

OK，亞歷克斯：你自找的，所以在這兒呢。一些真正的hacky代碼從JavaScript的海洋中提取您珍貴的<object>和<embed>元素。

class YouTubeScraper 
{ 
    public HtmlNode FindObjectElement(string url) 
    { 
     HtmlNodeCollection scriptNodes = FindScriptNodes(url); 

     for (int i = 0; i < scriptNodes.Count; ++i) 
     { 
      HtmlNode scriptNode = scriptNodes[i]; 

      string javascript = scriptNode.InnerHtml; 

      int objectNodeLocation = javascript.IndexOf("<object"); 

      if (objectNodeLocation != -1) 
      { 
       string htmlStart = javascript.Substring(objectNodeLocation); 

       int objectNodeEndLocation = htmlStart.IndexOf(">\" :"); 

       if (objectNodeEndLocation != -1) 
       { 
        string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1); 

        string unescaped = Unescape(finalEscapedHtml); 

        var objectDoc = new HtmlDocument(); 

        objectDoc.LoadHtml(unescaped); 

        HtmlNode objectNode = objectDoc.GetElementbyId("movie_player"); 

        return objectNode; 
       } 
      } 
     } 

     return null; 
    } 

    public HtmlNode FindEmbedElement(string url) 
    { 
     HtmlNodeCollection scriptNodes = FindScriptNodes(url); 

     for (int i = 0; i < scriptNodes.Count; ++i) 
     { 
      HtmlNode scriptNode = scriptNodes[i]; 

      string javascript = scriptNode.InnerHtml; 

      int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed"); 

      if (approxEmbedNodeLocation != -1) 
      { 
       string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15); 

       int embedNodeEndLocation = htmlStart.IndexOf(">\";"); 

       if (embedNodeEndLocation != -1) 
       { 
        string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1); 

        string unescaped = Unescape(finalEscapedHtml); 

        var embedDoc = new HtmlDocument(); 

        embedDoc.LoadHtml(unescaped); 

        HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player"); 

        return videoEmbedNode; 
       } 
      } 
     } 

     return null; 
    } 

    protected HtmlNodeCollection FindScriptNodes(string url) 
    { 
     var doc = new HtmlDocument(); 

     WebRequest request = WebRequest.Create(url); 
     using (var response = request.GetResponse()) 
     using (var stream = response.GetResponseStream()) 
     { 
      doc.Load(stream); 
     } 

     HtmlNode root = doc.DocumentNode; 
     HtmlNodeCollection scriptNodes = root.SelectNodes("//script"); 

     return scriptNodes; 
    } 

    static string Unescape(string htmlFromJavascript) 
    { 
     // The JavaScript has escaped all of its HTML using backslashes. We need 
     // to reverse this. 

     // DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness 
     // of this code. If you could improve it, please, I beg of you to do so. Personally, 
     // I tested it on a grand total of three inputs. It worked for those, at least. 
     return Regex.Replace(htmlFromJavascript, @"\\(.)", UnescapeFromBeginning); 
    } 

    static string UnescapeFromBeginning(Match match) 
    { 
     string text = match.ToString(); 

     if (text.StartsWith("\\")) 
     { 
      return text.Substring(1); 
     } 

     return text; 
    } 
}

而如果你有興趣，這裏有一個小的演示我扔在一起（超級幻想，我知道）：

class Program 
{ 
    static void Main(string[] args) 
    { 
     var scraper = new YouTubeScraper(); 

     HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs"); 
     Console.WriteLine("David After Dentist:"); 
     Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml); 
     Console.WriteLine(); 

     HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o"); 
     Console.WriteLine("Drunk History:"); 
     Console.WriteLine(drunkHistoryObjectNode.OuterHtml); 
     Console.WriteLine(); 

     HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg"); 
     Console.WriteLine("Jessica's Daily Affirmation:"); 
     Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml); 
     Console.WriteLine(); 

     HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4"); 
     Console.WriteLine("Jazzercise - Move your Boogie Body:"); 
     Console.WriteLine(jazzerciseObjectNode.OuterHtml); 
     Console.WriteLine(); 

     Console.Write("Finished! Hit Enter to quit."); 
     Console.ReadLine(); 
    } 
}

原來的答案

爲什麼不嘗試使用元素的ID代替？

HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");

更新：哦，那你是尋找那些本身內 JavaScript的HTML標籤？這絕對是爲什麼這不起作用。（從HtmlAgilityPack的角度來看，它們並不是真正的標籤;所有這些JavaScript實際上都是<script>標籤中的一個大字符串。）也許有一些方法可以將<script>標籤的內部文本本身解析爲HTML並去從那裏。

來源

2010-08-25 23:29:00

我得到一個錯誤的代碼：'HtmlAgilityPack.HtmlDocument'不包含'GetElementById'的定義，也沒有擴展方法'GetElementById'' – Alex 2010-08-25 23:32:58

@AlexW：它看起來像'b'應該是小寫？嘗試一下（'GetElementbyId'），看看你是否有幸運。 – 2010-08-25 23:44:10

@Dan Tao - 仍然沒有抓取任何'videoEmbedNode == null;' – Alex 2010-08-25 23:50:00

選擇元素通過腳本

回答

YouTubeScraper

原來的答案

相關問題

`YouTubeScraper`