-1
假設我試圖刪除這個正則表達式「RT \ s * @ USER \ w \ w {8}:\ s *」 並且我想在我的RDD中刪除這種形式的正則表達式。如何使用RDD去除PySpark中的某些正則表達式?
我現在RDD是:
text = sc.textFile(...)
delimited = text.map(lambda x: x.split("\t"))
和這裏就是我試圖刪除正則表達式的一部分。 我試着做下面的RDD轉換來擺脫每一個匹配這個正則表達式的字符串,但它都給我一個錯誤。
abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", x))
TypeError: expected string or buffer
和
abc = re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", delimited)
TypeError: expected string or buffer
和
abc = delimited.map(lambda x: re.sub(r"RT\s*@USER\w\w{8}:\s*", " ", text))
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an action or transformation. RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
我想刪除這個正則表達式,這樣我可以繼續到下一個RDD轉換。我如何在PySpark中創建這段代碼?
非常感謝... – kys92