Split string containing Chinese or Japanese or English into words

How can I split a string containing Chinese or Japanese or English into words by using regex or any utility class?Split string containing Chinese or Japanese or English into words

Example 1:

根據從2013年的一項研究，由一羣來自美國俄亥俄州立大學的研

Output 1:

根據從2013 年的一項研究，由一羣來自美國俄亥俄州立大學的研

Example 2:

According to a 2013 study by a research group from the US to

Output 2:

According, to, a, 2013, study, by, a, research, group, from, the, US, to

It's certain that the input string will not mix English with Japanese - both will come in separate strings; but yes, an English string should also be split by this piece of code:

words = input.split("[ ./()\\[\\]=,<>;\"']+");

If this is not possible in Java, please suggest if the Non-English input strings could be separated by whitespace characters only.

來源

2016-05-05 Kishore

May I ask why there is no space in-between "年的", "項研究" and "羣來" in output 1? – Pang

對不起，不瞭解中國人，所以錯誤地發生了。 – Kishore

I think the problem that you may have with Chinese (and maybe Japanese as well, but I don't know as much about it) is that the word breaks are contextual. Sometimes two characters will be two separate words, sometimes the same two characters will be a single word.

So I think you will need to parse the text to be able to do this.

來源

2016-05-05 14:05:17 dlu

肯定與日本的情況下，沒有正則表達式將開始做到這一點。大多數情況下，沒有空白空間，知道詞的開始和結束取決於識別漢字，知道動詞和形容詞如何共軛以及識別粒子特徵。你需要一個自然語言分析器。 –

嗯，我想我可以編輯它來說「你不能」 - 並非所有問題都可以在提問者希望的範圍內得到回答。這是一個足夠合理的問題，但答案不是「像這樣......」這是「你不能，因爲......」 – dlu

@dlu，是的我嘗試了一種自然語言解析器「java.text.BreakIterator」，它工作正常日語，但沒有填滿英文split by「[./()\\\\\\\\\\\,\\\\\\\\\\\\\\\\\\\\\'」 -English languages。 – Kishore

Example 1:

根據從2013年的一項研究，由一羣來自美國俄亥俄州立大學的研

Output 1:

根據從2013 年的一項研究，由一羣來自美國俄亥俄州立大學的研

This is incorrect Chinese. The correct output should be:

根據從2013年的一項研究，由一羣來自美國俄亥俄州立大學的研

You need a library for Chin ese words to do this.

來源

2016-05-08 03:34:08

好的...告訴我任何免費的Java庫？ – Kishore

Split string containing Chinese or Japanese or English into words

回答

相關問題