RDD中的主要空白來自哪裏以及如何避免它？

string_integers.txtRDD中的主要空白來自哪裏以及如何避免它？

a 1 2 3 
b 4 5 6 
c 7 8 9

sample.py

import re 
pattern = re.compile("(^[a-z]+)\s") 

txt = sc.textFile("string_integers.txt") 
string_integers_separated = txt.map(lambda x: pattern.split(x)) 

print string_integers_separated.collect()

結果

[[u'', u'a', u'1 2 3'], [u'', u'b', u'4 5 6'], [u'', u'c', u'7 8 9']]

預期結果

[[u'a', u'1 2 3'], [u'b', u'4 5 6'], [u'c', u'7 8 9']]

來源

2016-12-31 030

拆分對在字符串的開頭這樣固定前綴將永遠是空的字符串模式。例如，你可以使用匹配：

pattern = re.compile("([a-z]+)\s+(.*$)") 
pattern.match("a 1 2 3").groups() 
# ('a', '1 2 3')

或回顧後：

pattern = re.compile("(?<=a)\s") 
pattern.split("a 1 2 3", maxsplit=1) 
# ['a', '1 2 3']

或剛剛拆分：

"a 1 2 3".split(maxsplit=1) 
# ['a', '1 2 3']

來源

2016-12-31 22:18:52 user7361501

'類型錯誤：拆分（）不帶任何關鍵字參數' – 030

目前仍然不清楚我在前面的空格來從。由於某種原因，正則表達式分割正在引入它。 Based on an example found in this documentation分裂行動創造了不引入了一個前導空格：

txt.map(lambda x: x.split(' ', 1)).collect() 
#[[u'a', u'1 2 3'], [u'b', u'4 5 6'], [u'c', u'7 8 9']]

說明

str.split(str="", num=string.count(str))

limiting the number of splits to num

使用x.split(' ', 2)回報[[u'a', u'1', u'2 3'], [u'b', u'4', u'5 6'], [u'c', u'7', u'8 9']]

來源

2017-01-01 19:08:08 030

RDD中的主要空白來自哪裏以及如何避免它？

回答

相關問題