2017-06-22 43 views
0

我正在使用java中的JSoup API讀取html內容並從可用列表中獲取文件名和相關時間戳。解析HTML文檔後無法獲得預期的數據

HTML數據讀取的文件名:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"> 
<html dir="ltr" class="ms-isBot" lang="en-US"> 
<head> 
    <meta name="GENERATOR" content="Microsoft SharePoint" /> 
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" /> 
    <title> 
    ALL ELP REPORTS 
</title> 
    <!-- === Favicon/Windows Tile ==================================================================== --> 
    <link rel="shortcut icon" href=" " type="image/vnd.microsoft.icon" id="favicon" /> 
    <meta name="msapplication-TileImage" content=" " /> 
    <meta name="msapplication-TileColor" content="#0072C6" /> 
    <script type="text/javascript" src=" "></script> 
    <link rel="stylesheet" type="text/css" href=" " /> 
    <link id="CssRegistration1" rel="stylesheet" type="text/css" href=" " /> 
    <link id="CssRegistration2" rel="stylesheet" type="text/css" href=" 0" /> 
    <script type="text/javascript">CallASP("one.js"); 
    </script> 
    <script type="text/javascript">RegisterSod("strings.js", "\u002f_layouts\u002f15\u002f1033\u002fstrings.js?rev=cG2ZohQxWuyz1\u00252BF2exRTjA\u00253D\u00253D");RegisterSodDep("strings.js", "initstrings.js"); 

    <link type="text/xml" rel="alternate" href="/_asd.xls" /> 
    <!-- Additional header placeholder =========================== --> 
    <link rel="alternate" type="application/rss+xml" title="Documents" href="/_layouts/15/listfeed.aspx?List=573d80cd%2D44f6%2D47b4%2D942f%2Da12a5a1841cb" /> 
    <span id="analytics"> 
    <script language="JavaScript" type="text/javascript"> 

    <noscript> 
    <div class="noindex"> 
    You may be trying to access this site from a secured browser on the server. Please enable scripts and reload this page. 
    </div> 
    </noscript> 
    <!-- ===== SP IDs/Prefetch SP images/SP Form =========================================================================== --> 
    <div id="imgPrefetch" style="display:none"> 
    <img src="/_layouts/15/images/spcommon.png" /> 
    </div> 
    <form method="post" action="./AllItems.aspx?RootFolder=%2fShared+Documents%2f08.Test+Report%2fMY20+Test+Reports%2fSanity%2fRaw+Data&amp;FolderCTID=0x0120003C2FB175ACD9FE42B875BA259F53A6E3&amp;View=%7bF8BC514C-49A5-47A2-8A6D-52DF70D61AE7%7d" id="aspnetForm"> 
    <input type="hidden" name="_wpcmWpid" id="_wpcmWpid" value="" /> 
    <input type="hidden" name="wpcmVal" id="wpcmVal" value="" /> 
    <input type="hidden" name="MSOWebPartPage_PostbackSource" id="MSOWebPartPage_PostbackSource" value="" /> 

    </script> 

     <div id="ctl00_ctl47_asdasd" class="asdaBrandMenu"> 
     <a href="http://www.qwer.com/" target="_blank"> </a> 

     <!-- =============Suite Bar Links ======================--> 
     <div id="DeltaSuiteLinks" class="ms-core-deltaSuiteLinks"> 
     <div id="suiteLinksBox"> 
     <div id="SuiteLinksHidden" style="display: none"> 
      </div> 
      <div id="launcherIconContainer"> 
      </div> 

     <span style="display:none"> 
     <menu type="ServerMenu" id="zz1_ID_PersonalActionMenu" hideicons="true"> 
     <ie:menuitem id="zz2_ID_MyProfile" type="option" onmenuclick="" text="My Profile" menugroupid="100"></ie:menuitem> 
     <ie:menuitem id="zz3_ID_Logout" type="option" onmenuclick="" text="Sign Out" description="Logout of this site." menugroupid="100"></ie:menuitem> 
     </menu></span> 
     <span id="zz4_Menu_t" class="ms-menu-althov ms-welcome-root" title="Open Menu" onmouseover="MMU_PopMenuIfShowing(this);MMU_EcbTableMouseOverOut(this, true)" hoveractive="ms-menu-althov-active ms-welcome-root ms-welcome-hover" hoverinactive="ms-menu-althov ms-welcome-root" onclick=" CoreInvoke('MMU_Open',byid('zz1_ID_PersonalActionMenu'), MMU_GetMenuFromClientId('zz4_Menu'),event,true, null, 0); return false;" foa="MMU_GetMenuFromClientId('zz4_Menu')" oncontextmenu="ClkElmt(this); return false;" style="white-space:nowrap"><a class="ms-core-menu-root" id="zz4_Menu" accesskey="/" href="javascript:;" title="Open Menu" onfocus="MMU_EcbLinkOnFocusBlur(byid('zz1_ID_PersonalActionMenu'), this, true);" onkeydown="MMU_EcbLinkOnKeyDown(byid('zz1_ID_PersonalActionMenu'), MMU_GetMenuFromClientId('zz4_Menu'), event);" onclick=" CoreInvoke('MMU_Open',byid('zz1_ID_PersonalActionMenu'), MMU_GetMenuFromClientId('zz4_Menu'),event,true, null, 0); return false;" oncontextmenu="ClkElmt(this); return false;" menutokenvalues="MENUCLIENTID=zz4_Menu,TEMPLATECLIENTID=zz1_ID_PersonalActionMenu" serverclientid="zz4_Menu">Bhavani Borra<span class="ms-accessible">Use SHIFT+ENTER to open the menu (new window).</span></a><span style="height:4px;width:7px;position:relative;display:inline-block;overflow:hidden;" class="s4-clust ms-viewselector-arrow ms-menu-stdarw ms-core-menu-arrow"><img src="/_catalogs/theme/Themed/EB5E82F/spcommon-B35BB0A9.themedpng?ctag=3" alt="Open Menu" style="position:absolute;left:-95px !important;top:-259px !important;" /></span><span style="height:4px;width:7px;position:relative;display:inline-block;overflow:hidden;" class="s4-clust ms-core-menu-arrow ms-viewselector-arrow ms-menu-hovarw"><img src="/_catalogs/theme/Themed/EB5E82F/spcommon-B35BB0A9.themedpng?ctag=3" alt="Open Menu" style="position:absolute;left:-86px !important;top:-259px !important;" /></span></span> 
     </div> 
     <!-- ======== Start: Site Actions menu ============= --> 
     <div id="suiteBarButtons"> 
     <span class="ms-siteactions-root" id="siteactiontd"> <span style="display:none"> 
     <menu type="ServerMenu" id="zz5_FeatureMenuTemplate1" hideicons="true"> 
      <ie:menuitem id="zz6_MenuItem_ShareThisSite" type="option" onmenuclick="" description="See who's here and invite new people." menugroupid="100"></ie:menuitem> 
      <ie:menuitem id="zz7_MenuItem_ViewAllSiteContents" type="option" iconsrc="" onmenuclick="STSNavigate2(event,'/_layouts/15/viewlsts.aspx');" text="Site contents" description="View all libraries and lists in this site." menugroupid="200"></ie:menuitem> 
     </menu></span><span id="zz8_SiteActionsMenu_t" class="ms-siteactions-normal" title="Settings" onmouseover="MMU_PopMenuIfShowing(this);MMU_EcbTableMouseOverOut(this, true)" hoveractive="ms-siteactions-normal ms-siteactions-hover" hoverinactive="ms-siteactions-normal"> 
     <a class="ms-core-menu-root" id="zz8_SiteActionsMenu" accesskey="/" href="javascript:;" title="Settings" onkeydown="MMU_EcbLinkOnKeyDown(byid('zz5_FeatureMenuTemplate1'), MMU_GetMenuFromClientId('zz8_SiteActionsMenu'));" menutokenvalues="MENUCLIENTID=zz8_SiteActionsMenu,TEMPLATECLIENTID=zz5_FeatureMenuTemplate1" serverclientid="zz8_SiteActionsMenu"><span class="ms-siteactions-imgspan"><img class="ms-core-menu-buttonIcon" src="/_catalogs/theme/Themed/EB5E82F/Settings-white-94FE89A9.themedpng?ctag=3" alt="Settings" title="Settings" /></span><span class="ms-accessible">Use SHIFT+ENTER to open the menu (new window).</span></a></span> </span> 
     </div> 
     <!-- ================== End: Site Actions Menu ============================================ --> 
     <!-- ================== IT Help Link ============================================ --> 

      <div class="ms-core-listMenu-verticalBox"> 
      </div> 
     </div> 
     </div> 
     </div> 
     <!-- ===== Main Content ========================================================================================== --> 

     <tr class="ms-alternating ms-itmhover" iid="47,1430,0"> 
     <td class="ms-vb-itmcbx ms-vb-firstCell"><input type="checkbox" class="s4-itm-cbx" /></td> 
     <td class="ms-vb-icon"><img border="0" alt="ECS-dailyTask.xls" title="ECS-dailyTask.xls" src="" /></td> 
     <td height="100%" onmouseover="OnChildItem(this)" class="ms-vb-title"> 
     <div class="ms-vb itx" onmouseover="OnItem(this)" ctxname="ctx47" id="1430" field="LinkFilename" perm="0x1b03c4312ef" eventtype=""> 
      <a onfocus="OnLink(this)" href="/MyDocuments/ECS-dailyTask.xls" onmousedown="">ECS-dailyTask</a> 
     </div> 
     <div class="s4-ctx" onmouseover="OnChildItem(this.parentNode); return false;"> 
      <span>&nbsp;</span> 
      <a onfocus="OnChildItem(this.parentNode.parentNode); return false;" onclick="" href="javascript:;" title="Open Menu"></a> 
      <span>&nbsp;</span> 
     </div></td> 
     <td class="ms-vb2"> 
     <nobr> 
      3/31/2013 11:04 AM 
     </nobr></td> 

     <tr class="ms-alternating ms-itmhover" iid="47,1429,0"> 
     <td class="ms-vb-itmcbx ms-vb-firstCell"><input type="checkbox" class="s4-itm-cbx" /></td> 
     <td class="ms-vb-icon"><img border="0" alt="ECS-MontlhyTask.xls" title="ECS-MontlhyTask.xls" src="/_layouts/15/images/icxls.png?rev=23" /></td> 
     <td height="100%" onmouseover="OnChildItem(this)" class="ms-vb-title"> 
      <div class="ms-vb itx" onmouseover="OnItem(this)" ctxname="ctx47" id="1429" field="LinkFilename" perm="0x1b03c4312ef" eventtype=""> 
      <a onfocus="OnLink(this)" href="/MyDocs/ECS-MontlhyTask.xls" onmousedown="">ECS-MontlhyTask</a> 
      </div> 
      <div class="s4-ctx" onmouseover="OnChildItem(this.parentNode); return false;"> 
      <span>&nbsp;</span> 
      <a onfocus="" onclick="" href="javascript:;" title="Open Menu"></a> 
      <span>&nbsp;</span> 
      </div></td> 
     <td class="ms-vb2"> 
      <nobr> 
      7/24/2016 10:09 PM 
      </nobr></td> 
     <td class="ms-vb-user"><span class="ms-noWrap"><span class="ms-imnSpan"><a href="#" onclick="" class="ms-imnlink ms-spimn-presenceLink"> 
     <span class="ms-spimn-presenceWrapper ms-imnImg ms-spimn-imgSize-10x10"> 
     <img name="imnmark" class="" title="" showofflinepawn="1" src="" alt="No presence information" id="imn_16532,type=sip" /> 
     </span></a></span><span class="ms-noWrap ms-imnSpan"> 
     <a href="#" onclick="" class="ms-imnlink" tabindex="-1"><img name="imnmark" class="ms-hide" title="" showofflinepawn="1" src="" alt=""/></a> 
     <a class="ms-subtleLink" onclick="" href="/_layouts/15/userdisp.aspx?ID=113">ASDF</a></span></span></td> 
    </tr> 
    .. 

Java代碼:

import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
import java.io.File; 
import java.io.IOException; 

public class ReadFileNamesFromHTMLContent { 

    public static void main(String args[]) throws IOException { 
     File input = new File("C:/jsoupTest/readData.html"); 
     Document doc = Jsoup.parse(input, "UTF-8"); 
     Elements trs = doc.select("a"); //select all "tr" elements from document 
     for (Element tr : trs) { 
      //Getting the class string form tr element 
      System.out.println("The file class is: " + tr.attr("href"));/* 
        //getting the filename string that holds inside td element 
        + " The filamee is: " + tr.select("td").text());*/ 
     } 
    } 
} 

我的輸出:

The file class is: javascript:; 
The file class is: javascript:; 
The file class is: /MyDocuments/ECS-dailyTask.xls 
The file class is: javascript:; 
The file class is: /MyDocs/ECS-MontlhyTask.xls 
The file class is: javascript:; 
The file class is: # 
The file class is: # 
The file class is: /_layouts/15/userdisp.aspx?ID=113 

預期輸出:

ECS-dailyTask 3/31/2013 11:04 AM 
ECS-MontlhyTask 7/24/2016 10:09 PM 

任何建議都會有幫助。我試着用不同的場景進行迭代,但輸出不像預期的輸出。

回答

1

1st:我不明白你爲什麼想要得到tr標籤,tr標籤也沒有屬性href。你的期望值也沒有找到

第二:你的期望值是td標籤中找到:

<td class="ms-vb2"> 
    <nobr> 
    3/31/2013 11:04 AM 
    </nobr> 
</td> 

因此,代碼,讓您的期望值應(未測試):

Elements tds = doc.select("div.ms-vb, .itx"); //select div with class ms-vb and itx 
    for (Element td : tds) { 
     System.out.println("The file class is: " + td.select("a").text()); 
    } 

    Elements td1s = doc.select("nobr"); //select tag nobr 
    for (Element td : td1s) { 
     System.out.println("The date is: " + td.text()); 
    } 
+0

使用您的代碼嘗試時,輸出爲空白。我也想要文件名,這就是我使用標籤的原因,請參閱我的java類和輸出。我想顯示文件名和相關時間,如上面我的帖子中的預期輸出中所示。 – sss

+0

更新了代碼。因爲你的需求是在不同的標籤,所以你將不得不單獨選擇它們。 –