2013-04-16 230 views
0

我需要拆分String並獲得單詞的String[]。我試過這個:字符串到單詞的字符串[]

String[] plain = plainText.split(" ,;<>/[(!)*=]"); 

但在我的情況下,這是行不通的。拆分後,數組plain仍然只有一個值,它是字符串plainText中的整個字符串。我的字符串如下所示:

<table class="content" border="0" cellpadding="0" cellspacing="0" style="width:540px;" bgcolor="#ffffff"> 
      <tr> 
       <td align="left" valign="top"> 
        <font color="#666666" face="Arial, Verdana" size="1"> 
        eBay Inc.<br /> 
        2145 Hamilton Avenue<br /> 
        San Jose, California 95125<br /><br /> 

        Designated trademarks and brands are the property of their respective owners. eBay and the eBay logo are trademarks of eBay Inc. 
        <br /><br /> 

        <strong>&copy; 2013 eBay Inc. All Rights Reserved</strong><br /><br /> 


        eBay Inc. sent this e-mail to you at [email protected] because you opted in to the eBay Deals Daily Alert campaign by signing up at ebay.com/deals.<br /><br /> 


        Pricing: We compared the selling price for the featured Deals items on eBay to the List Price for the item. The List price is the price (excluding shipping and handling fees) the seller of the item has provided at which the same item, or one that is nearly identical to it, is being offered for sale or has been offered for sale in the recent past. The price may be the seller's own price elsewhere or another seller's price. The "% off" simply signifies the calculated percentage difference between seller-provided List Price and the seller's price for the eBay Deals item. If you have any questions related to the pricing and/or discount offered in eBay Deals, please contact the seller. All items subject to availability.<br /><br /> 

        If you wish to unsubscribe from eBay Deals email alerts, please <a href="http://dailydeal.ebay.com/unsubscribe.jsp?s=4IwA&i=883690252203">click here</a>. 
        Please note that you are only opting out of the eBay Deals email alerts. If you are an eBay customer and wish to change your other eBay Notification Preferences, please log in to My eBay by <a href="http://l.deals.ebay.com/u.d?R4GrxGghJ4SpZccF_r3SS=21801">clicking here</a>. Please note that it may take up to 10 days to process changes to your eBay Notification Preferences. <br /><br /> 

        Visit our <a href="http://l.deals.ebay.com/u.d?f4GrxGghJ4SpZccF_r3Sf=21811">Privacy Policy</a> and <a href="http://l.deals.ebay.com/u.d?KYGrxGghJ4SpZccF_r3SY=21821">User Agreement</a> if you have any questions.<br /><br /> 

        </font> 
       </td> 

這是分析的電子郵件的一部分。那麼,我該如何將這些文字轉換成一系列文字呢?

+2

你想包含哪些單詞? –

+1

字符串數組應該是什麼樣子?什麼是預期的輸出? –

+0

你的意思是你想要忽略html標籤的文字? – NewUser

回答

3

此正則表達式是錯誤的,因爲它的一些字符是正則表達式控制字符(例如[(*等),並已被轉義以用作分裂分離器,還整個字符組必須被內包裝a []:

String[] plain = plainText.split("[ ,;<>/\\[\\(!\\)\\*=\\]]"); 

Java regex here上閱讀更多信息。

編輯:從CPerkins跟進評論,你也可以使用這個表達式:

String[] plain = plainText.split("[\\s^\\W]+"); 

它所做的是它分裂的所有空格字符和所有非單詞字符,這是有點兒我想,你想要什麼。

NB:以上只是對您的問題的直接回答,有很多更好的方法來讀取/解析HTML。

+0

一個改進,但這將產生空白的多個空數組條目,新行將被視爲文本,而不是被剝奪。 – CPerkins

+0

@CPerkins謝謝,用更短/清潔的正則表達式更新了答案。 – maksimov

+0

不用擔心。好答案。很好,很乾淨。 – CPerkins

0

您可以使用Scanner類。你可以閱讀使用文字

while(scanner.hasNext()){} 

類型構造。

鏈接:Scanner

0
String noTags = htmlString.replaceAll("\\<.*?\\>", ""); 
    String clearTxt = noTags.replaceAll("[ \t\n.,!;\\(\\)]+", " "); 
    String[] words = clearTxt.split(" "); 
+2

如果標籤中的文本(像urls可以被忽略),我喜歡這種方法,但是我把noTags.replaceAll()切換到'「[^ \\ w] +」'放置出更多的非alpha字符。如果需要標記文本,@ maksimov的正則表達式可以被修改爲兩遍來清除它。 – n0741337

+1

@rebeliagamer好得多,但你的條目有換行符。 n0741337的補充更正了這一點。 – CPerkins

相關問題