2016-05-05 118 views
0

How can I split a string containing Chinese or Japanese or English into words by using regex or any utility class?Split string containing Chinese or Japanese or English into words

Example 1:

根據從2013年的一項研究,由一羣來自美國俄亥俄州立大學的研

Output 1:

根據從2013 年的一項研究,由一羣來自美國俄亥俄州立大學的研

Example 2:

According to a 2013 study by a research group from the US to

Output 2:

According, to, a, 2013, study, by, a, research, group, from, the, US, to

It's certain that the input string will not mix English with Japanese - both will come in separate strings; but yes, an English string should also be split by this piece of code:

words = input.split("[ ./()\\[\\]=,<>;\"']+"); 

If this is not possible in Java, please suggest if the Non-English input strings could be separated by whitespace characters only.

+1

May I ask why there is no space in-between "年的", "項研究" and "羣來" in output 1? – Pang

+0

對不起,不瞭解中國人,所以錯誤地發生了。 – Kishore

回答

3

I think the problem that you may have with Chinese (and maybe Japanese as well, but I don't know as much about it) is that the word breaks are contextual. Sometimes two characters will be two separate words, sometimes the same two characters will be a single word.

So I think you will need to parse the text to be able to do this.

+1

肯定與日本的情況下,沒有正則表達式將開始做到這一點。大多數情況下,沒有空白空間,知道詞的開始和結束取決於識別漢字,知道動詞和形容詞如何共軛以及識別粒子特徵。你需要一個自然語言分析器。 –

+1

嗯,我想我可以編輯它來說「你不能」 - 並非所有問題都可以在提問者希望的範圍內得到回答。這是一個足夠合理的問題,但答案不是「像這樣......」這是「你不能,因爲......」 – dlu

+0

@dlu,是的我嘗試了一種自然語言解析器「java.text.BreakIterator」,它工作正常日語,但沒有填滿英文split by「[./()\\\\\\\\\\\,\\\\\\\\\\\\\\\\\\\\\'」 -English languages。 – Kishore

1

Example 1:

根據從2013年的一項研究,由一羣來自美國俄亥俄州立大學的研

Output 1:

根據從2013 年的一項研究,由一羣來自美國俄亥俄州立大學的研

This is incorrect Chinese. The correct output should be:

根據從2013年的一項研究,由一羣來自美國俄亥俄州立大學的研

You need a library for Chin ese words to do this.

+0

好的...告訴我任何免費的Java庫? – Kishore