我必須建立一個函數「removePunctuation」這條標點符號和結果通過這項測試任務:如何獲取Column的名稱或更改現有的名稱?
# TEST Capitalization and punctuation (4b)
testPunctDF = sqlContext.createDataFrame([(" The Elephant's 4 cats. ",)])
testPunctDF.show()
Test.assertEquals(testPunctDF.select(removePunctuation(col('_1'))).first()[0],
'the elephants 4 cats',
'incorrect definition for removePunctuation function')
這是我設法寫。
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
return lower(trim(regexp_replace("column_name", "[\W_]+"," "))).alias("sentence");
但我仍然不能使函數regexp_replace使用別名「句子」。我收到此錯誤:
AnalysisException: u"cannot resolve 'sentence' given input columns: [_1];"
哦對不起,在我發佈的代碼中有一個錯誤,在regexp_replace()第一個參數中必須有bean「column_name」,無論如何,我已經解決了它,但謝謝。 –
@DmitrijKostyushko很高興你解決了它!如果我知道您的問題中的代碼不是您正在使用的代碼,我可能會發布更好的問題。請記住稍後再接受答案。 ;) – gsamaras