用分隔符分隔帶引號的字符串

我想用分隔符空格分隔字符串。但它應該智能地處理引用的字符串。例如。對於像用分隔符分隔帶引號的字符串

"John Smith" Ted Barry

它應該返回三個字符串約翰史密斯，特德和巴里。

來源

2012-05-22 fastcodejava

您可能需要首先拆分包含引號的字符串，然後再用空格分隔字符串的其餘部分。在這裏必須有一些關於如何做第一步的問題。第二步是微不足道的。 – jahroy

你有什麼嘗試？ –

體面的CSV解析器庫會適合你。大多數會允許選擇分隔符，並會尊重和避免拆分引用的文本。 –

瞎搞它之後，你可以使用正則表達式這一點。運行的上的「匹配所有」等效：

((?<=("))[\w ]*(?=("(\s|$))))|((?<!")\w+(?!"))

的Java示例：

import java.util.regex.Pattern; 
import java.util.regex.Matcher; 

public class Test 
{ 
    public static void main(String[] args) 
    { 
     String someString = "\"Multiple quote test\" not in quotes \"inside quote\" \"A work in progress\""; 
     Pattern p = Pattern.compile("((?<=(\"))[\\w ]*(?=(\"(\\s|$))))|((?<!\")\\w+(?!\"))"); 
     Matcher m = p.matcher(someString); 

     while(m.find()) { 
      System.out.println("'" + m.group() + "'"); 
     } 
    } 
}

輸出：

'Multiple quote test' 
'not' 
'in' 
'quotes' 
'inside quote' 
'A work in progress'

與上面使用的實施例中的正則表達式擊穿在這裏可以查看：

http://regex101.com/r/wM6yT9

與所有的說，正則表達式不應該是轉到解決方案的一切 - 我只是覺得好玩。這個例子有很多邊緣情況，比如處理unicode字符，符號等。你最好使用一個經過驗證的真正的庫來完成這種任務。在使用這個之前請看看其他答案。

來源

2012-05-22 03:12:23

我不確定輸入是否包含Unicode，但是您的代碼將無法使用來處理它。 – nhahtdh

這是一個很好的例子。 +1，爲什麼不把一個if來檢查m.group（）是否返回一個空格，這樣你就不必輸出空格了。 –

輝煌的... +1 –

試試這個醜陋的代碼。

String str = "hello my dear \"John Smith\" where is Ted Barry"; 
    List<String> list = Arrays.asList(str.split("\\s")); 
    List<String> resultList = new ArrayList<String>(); 
    StringBuilder builder = new StringBuilder(); 
    for(String s : list){ 
     if(s.startsWith("\"")) { 
      builder.append(s.substring(1)).append(" "); 
     } else { 
      resultList.add((s.endsWith("\"") 
        ? builder.append(s.substring(0, s.length() - 1)) 
        : builder.append(s)).toString()); 
      builder.delete(0, builder.length()); 
     } 
    } 
    System.out.println(resultList);

來源

2012-05-22 03:35:13

比我的代碼好多了。 +1 –

過多的空白將導致程序生成空字符串。 – nhahtdh

@nhahtdh：O'yeah。實際上，我只是提供了一個提示。不是100％的工作解決方案。 Trevor Senior，把它釘牢了。不過，這也有相同的空白問題。但這不是一個真正的問題，可以很容易地解決。 –

commons-lang有一個StrTokenizer類來爲你做這件事，並且還有java-csv庫。

例與StrTokenizer：

String params = "\"John Smith\" Ted Barry" 
// Initialize tokenizer with input string, delimiter character, quote character 
StrTokenizer tokenizer = new StrTokenizer(params, ' ', '"'); 
for (String token : tokenizer.getTokenArray()) { 
    System.out.println(token); 
}

輸出：

John Smith 
Ted 
Barry

來源

2012-05-22 03:35:18 Matt

@BasilioGerman我添加了一個例子，所以你可以考慮刪除你的評論。 –

好，我做你想要做什麼小snipet和做更多的事情。因爲你沒有指定更多的條件，我沒有經歷麻煩。我知道這是一種骯髒的方式，你可能會獲得更好的結果。但對於編程這裏的樂趣的例子：

String example = "hello\"John Smith\" Ted Barry lol\"Basi German\"hello"; 
    int wordQuoteStartIndex=0; 
    int wordQuoteEndIndex=0; 

    int wordSpaceStartIndex = 0; 
    int wordSpaceEndIndex = 0; 

    boolean foundQuote = false; 
    for(int index=0;index<example.length();index++) { 
     if(example.charAt(index)=='\"') { 
      if(foundQuote==true) { 
       wordQuoteEndIndex=index+1; 
       //Print the quoted word 
       System.out.println(example.substring(wordQuoteStartIndex, wordQuoteEndIndex));//here you can remove quotes by changing to (wordQuoteStartIndex+1, wordQuoteEndIndex-1) 
       foundQuote=false; 
       if(index+1<example.length()) { 
        wordSpaceStartIndex = index+1; 
       } 
      }else { 
       wordSpaceEndIndex=index; 
       if(wordSpaceStartIndex!=wordSpaceEndIndex) { 
        //print the word in spaces 
        System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex)); 
       } 
       wordQuoteStartIndex=index; 
       foundQuote = true; 
      } 
     } 

     if(foundQuote==false) { 
      if(example.charAt(index)==' ') { 
       wordSpaceEndIndex = index; 
       if(wordSpaceStartIndex!=wordSpaceEndIndex) { 
        //print the word in spaces 
        System.out.println(example.substring(wordSpaceStartIndex, wordSpaceEndIndex)); 
       } 
       wordSpaceStartIndex = index+1; 
      } 

      if(index==example.length()-1) { 
       if(example.charAt(index)!='\"') { 
        //print the word in spaces 
        System.out.println(example.substring(wordSpaceStartIndex, example.length())); 
       } 
      } 
     } 
    }

這也檢查了未經過或報價前，用空格分隔的單詞，如「約翰·史密斯」之前加上「你好」之後「巴西德國人」。

當字符串被修改爲"John Smith" Ted Barry輸出是三個串， 1）「約翰·史密斯」 2）泰德 3）巴里

在該示例中的字符串是你好「約翰·史密斯」泰德百里洛爾「巴斯德」你好，並打印 1）喂 2）「約翰·史密斯」 3）泰德 4）百里 5）洛爾 6）「巴斯德」 7）喂

希望它能幫助

來源

2012-05-22 03:35:29

這是所有這些中最好的代碼。它可以處理Unicode輸入，並且當空間過多時不會生成空字符串。它會將所有內容保留在報價中（好吧，這可以是正數或負數）。我認爲代碼可以修改一下刪除引號。進一步擴展可以是：添加對逃脫報價的支持。 – nhahtdh

當然，報價可以刪除。只有我做到了保持報價。 ive添加了關於刪除引號的註釋。 –

這是我自己的版本，清理從http://pastebin.com/aZngu65y（發表評論）。它可以照顧Unicode。它會清理所有過多的空間（即使在報價中） - 根據需要，這可能是好的或壞的。不支持逃脫報價。

private static String[] parse(String param) { 
    String[] output; 

    param = param.replaceAll("\"", " \" ").trim(); 
    String[] fragments = param.split("\\s+"); 

    int curr = 0; 
    boolean matched = fragments[curr].matches("[^\"]*"); 
    if (matched) curr++; 

    for (int i = 1; i < fragments.length; i++) { 
    if (!matched) 
     fragments[curr] = fragments[curr] + " " + fragments[i]; 

    if (!fragments[curr].matches("(\"[^\"]*\"|[^\"]*)")) 
     matched = false; 
    else { 
     matched = true; 

     if (fragments[curr].matches("\"[^\"]*\"")) 
     fragments[curr] = fragments[curr].substring(1, fragments[curr].length() - 1).trim(); 

     if (fragments[curr].length() != 0) 
     curr++; 

     if (i + 1 < fragments.length) 
     fragments[curr] = fragments[i + 1]; 
    } 
    } 

    if (matched) { 
    return Arrays.copyOf(fragments, curr); 
    } 

    return null; // Parameter failure (double-quotes do not match up properly). 
}

用於比較樣品輸入：

"sdfskjf" sdfjkhsd "hfrif ehref" "fksdfj sdkfj fkdsjf" sdf sfssd 


asjdhj sdf ffhj "fdsf fsdjh" 
日本語　中文 "Tiếng Việt" "English" 
    dsfsd  
    sdf  " s dfs fsd f " sd f fs df fdssf "日本語　中文" 
"" ""  "" 
" sdfsfds " "f fsdf

（第二行是空的，第三行是空格，最後一行的格式不正確）。請根據您自己的預期輸出進行判斷，因爲它可能會有所不同，但基線是，第一個案例應該返回[sdfskjf，sdfjkhsd，hfrif ehref，fksdfj sdkfj fkdsjf，sdf，sfssd]。

來源

2012-05-22 04:23:00 nhahtdh

用分隔符分隔帶引號的字符串

回答

相關問題