2014-01-23 15 views
0

我有一個類似這樣的html代碼。我使用Node.js進行網頁抓取。使用javascript查找那些不在任何元素內的文本

<div id="content_column"> 

<CENTER>01/23/2014</CENTER> 
<BR> <B>Name : </B> GLUCK MARTIN <BR> <B>Address : </B> <BR> 
<B>Profession : </B> MEDICINE <BR> <B>License No: </B> 077798 <BR> 
<B>Date of Licensure : </B> 05/05/56 <BR> <B>Additional 
    Qualification : </B> &nbsp; <BR> <B> <A 
    href="http://www.op.nysed.gov/help.htm#status"> Status :</A></B> DECEASED 
11/24/13 <BR> <B>Registered through last day of : </B> <BR> <B>Medical 
    School: </B> UNIVERSITY OF GENEVA <B>&nbsp;&nbsp;&nbsp; Degree Date : 
</B> Not on file <BR> 
<HR> 
<div class="note"> 
    (Use your browser's back key to return to licensee list.)<BR> <BR> 
    * Use of this online verification service signifies that you have read 
    and agree to the <A href="http://www.op.nysed.gov/usage.htm">terms 
     and conditions of use</A>. See <A href="http://www.op.nysed.gov/help.htm">HELP 
     glossary</A> for further explanations of terms used on this page. <BR> 
    <BR> <B>Note: </B> The Board of Regents does not discipline <i>physicians(medicine), 
     physician assistants,</i> or <i>specialist assistants.</i> The status of 
    individuals in these professions may be impacted by information 
    provided by the NYS Department of Health. To search for the latest 
    discipline actions against individuals in these professions, please 
    check the New York State Department of Health's <A 
     href="http://www.health.state.ny.us/nysdoh/opmc/main.htm"> Office 
     of Professional Medical Conduct</A> homepage. 
    </UL> 
</div> 
<HR> 
<div class="note"> 
    Further information on physicians may be found on the following 
    external sites (The State Education Department is not responsible for 
    the accuracy or completeness of information located on external 
    Internet addresses.): <BR> <BR> <a 
     href="http://www.abms.org/">American Board of Medical Specialties</a> 
    <BR> <BR> <a href="http://www.ama-assn.org/">American 
     Medical Association:</a> <BR> - For the general public: <a 
     href="http://www.ama-assn.org/aps/amahg.htm">AMA Physician 
     Select, On-line Doctor Finder</a><BR>&nbsp;&nbsp;&nbsp; <BR> - 
    For organizations that verify physician credentials: <a 
     href="http://www.ama-assn.org/physdata/physrel/physrel.htm">AMA 
     Physician Profiles</a> <BR> <BR> <a 
     href="http://www.aoa-net.org/">American Osteopathic Association, 
     AOA-Net</a> <BR> <BR> <a href="http://www.docboard.org/">Association 
     of State Medical Board Executive Directors-(A.I.M."DOCFINDER")</a> <BR> 
    <BR> <a href="http://www.nydoctorprofile.com/welcome.jsp">New 
     York State Department of Health Physician Profiles</a><BR> <BR>The 
    following sites provide additional information concerning the medical 
    profession: <BR> <BR> <a href="http://www.clearhq.org/">CLEAR 
     (Council on Licensure, Enforcement and Regulation)</a> <BR> <BR> 
    <a href="http://www.fsmb.org/">Federation of State Medical Boards</a><BR> 
    <BR> 
</div> 
<CENTER> 
    <BR> <IMG SRC="http://www.op.nysed.gov/Sedseal.jpg" WIDTH="100" 
     HEIGHT="101" ALT="Seal of the State Education Department"><BR> 
    <BR> 
</CENTER> 
</div> 

我怎麼能找到那些沒有任何元素內的值,在這種情況下,他們是GLUCK MARTIN,醫藥,077798,05/05/56,依此類推。

+1

我試過這個var test = $('#content_column')。contents()。 過濾器(功能(){ 返回this.nodeType == 3; })文本(); – sh977218

+0

爲什麼你需要這樣做? – DaniP

+0

var test是空的。 – sh977218

回答

0

這很容易使用jQuery - 結合的不是,幷包含:

$("#content_column").not(":contains('GLUCK MARTIN')") 
+0

除非我誤解,否則我不認爲他/她在問如何找到不包含短語「GLUCK MARTIN」的元素。我相信他/她在問如何找到沒有被任何標籤包圍的文字。 – krispy

+0

在這種情況下,他/她會給沒有引用字符串的代碼來搜索我相信,我可能誤解了這個問題(不知道100%)。 –

0

參考this answer

$('#content_column').clone().children().remove().end().text() 

這裏是一個fiddle你的榜樣標記。

0

在節點中,我建議使用DOM來代替正則表達式這樣的工作。 jsdom是一個很好的例子,它可以讓你從片段中構建一個DOM。從那裏,你可以查詢document.documentElement(在我的例子中,我將使用jQuery),並拉出任何不包含在標籤中的直接文本節點。

// Count all of the text not in a tag 
var jsdom = require("jsdom"); 

jsdom.env(
    "URL OR YOUR HTML STRING HERE", 
    ["http://code.jquery.com/jquery.js"], 
    function (errors, window) { 
    var textNodes = window.$(window.document.documentElement) 
     .find(":not(iframe)") 
     .addBack() 
     .contents() 
     .filter(function() { 
      return this.nodeType == 3; 
     }); 
    //do something with textNodes 
    } 
); 
相關問題