我試圖在PySpark中運行作業。我的數據是在RDD使用的PySpark火花上下文類(SC)創建如下:在PySpark列表中有條件地拆分逗號分隔值
directory_file = sc.textFile('directory.csv')
*我不認爲Python的CSV模塊上的RDD內的數據被使用。
這會爲csv中的每一行生成一個列表。我知道這是很厲害的,但這裏有一個列表的樣品(從原來的CSV等同於行):
[u'14K685,El Puente Academy for Peace and Justice,Brooklyn,K778,718-387-1125,718-387-4229,9,12,,,"B24, B39, B44, B44-SBS, B46, B48, B57, B60, B62, Q54, Q59","G to Broadway ; J, M to Hewes St ; Z to Marcy Ave",250 Hooper Street,Brooklyn,NY,11211,www.elpuente.us,225,N/A,Consortium School,"We are a small, innovative learning community that promotes comprehensive academic excellence for all students while inspiring and nurturing leadership for peace and justice. Our state-of-the-art facility allows for a creative and intellectually challenging environment where every student thrives. Our project-based curriculum is designed to prepare students to be active citizens and independent thinkers who share a passion for transforming their communities and the world into a better place. Our trimester system allows students to complete most of their high school credits by the 11th grade, opening opportunities for exciting internships and college courses during the school day in their senior year.","Accelerated credit accumulation (up to 18 credits per year), iLearn, iZone 360, Year-long SAT (Scholastic Aptitude Test) preparatory course, Individualized college counseling, Early College Awareness & Preparatory Program (ECAPP). Visits to college campuses in NYC, Visits to colleges outside NYC in partnership with the El Puente Leadership Center, Internships, Community-based Projects, Portfolio Assessment, Integrated-Arts Projects, Before- and After-school Tutoring; Elective courses include: Drama, Dance (Men\'s and Women\'s Groups), Debate Team partnership with Tufts University, Guitar, Filmmaking, Architecture, Glee",Spanish,,,,"AM and PM Academic Support, B-Boy/B-Girl, Chorus, College and Vocational Counseling and Placement, College Prep, Community Development Project, Computers, Dance Level 1 and 2, Individual Drama; Education for Public Inquiry and International Citizenship (EPIIC), El Puente Leadership Center, Film, Fine Arts, Liberation, Media, Men\u2019s and Women\u2019s Groups, Movement Theater Level 1, Movement Theater Level 2, Music, Music Production, Pre-professional training in Dance, PSAT/SAT Prep, Spoken Word, Student Council, Teatro El Puente, Visual Art",,,,"Boys & Girls Basketball, Baseball, Softball, Volleyball",El Puente Williamsburg Leadership Center; The El Puente Bushwick Center; Leadership Center at Taylor-Wythe Houses; Beacon Leadership Center at MS50.,"Woodhull Medical Center, Governor Hospital","Hunter College (CUNY), Eugene Lang College The New School for Liberal Arts, Pratt College of Design, Tufts University, and Touro College.","El Puente Leadership Center, El Puente Bushwick Center, Beacon Leadership Center at MS50, Leadership Center at Taylor-Wythe Houses, Center for Puerto Rican Studies, Hip- Hop Theatre Festival, Urban Word, and Summer Search.",,,,,Our school requires assessment of an Academic Portfolio for graduation.,,9:00 AM,3:30 PM,This school will provide students with disabilities the supports and services indicated on their IEPs.,ESL,Not Functionally Accessible,1,Priority to Brooklyn students or residents,Then to New York City residents,,,,,,,,,"250 Hooper Street']
我想用逗號作爲分隔符分割每個項目除了當逗號之間雙引號(例如「,,,」)。
parsed = directory_file.map(lambda x: x.split(','))
顯然不能解決雙引號之間的逗號。有沒有辦法做到這一點?我已經看過這個問題,特別提到了csv,但是因爲在這種情況下,csv首先被加載到Spark RDD中,我很確定csv
模塊在這裏不適用。
謝謝。
感謝您的迴應,但這是特定於作爲RDD對象存在的列表,因爲這是不可迭代的,所以這不適用於我噸。這是一個非常奇怪的情況。 – dstar
對不起,我不熟悉Spark。有沒有辦法讓它迭代?此外,我只是在這裏迭代一個'str'。 – Signal