2013-06-03 25 views
1

首先,我想讓你知道我在編碼方面比較新,而且我只有Python和Javascript的膚淺知識。來自多行的彙總文本字段(javascript/python)

我有一個包含名稱及其在數據結構隊名這個巨大的TXT如下:

Name1, Surname1 Team1 
        Team2 
        Team3 
Name2, Surname2 Team2 
        Team4 
Name3, Surname3 Team1 
        Team5 

理想情況下,我想提取我的數據由Team#搜索和返回的人屬於名稱到它。

例如,我需要team1和team2的組件。我的新TXT輸出應該是這樣的:

Team1, Name1, Surname1, Name3, Surname3 
Team2, Name1, Surname1, Name2, Surname2 

非常感謝您的幫助

+0

它又是你現在的輸入結構?一條線,多條線和什麼時候是線路制動器? – Johannes

+0

姓氏和/或隊名中是否有空格?中間是否有製表符,或者是固定列中的團隊名稱? –

+0

@Johannes:輸入非常混亂。唯一的「結構化」部分是「Name1,Surname1」,每次都有一個逗號和1個空格。就團隊而言,他們通常被放置在一個固定的列中,但是,首先報告的團隊(名稱 - 姓氏行中)通常與團隊列不一致,具體取決於包含「姓名,姓氏「 – user2447387

回答

0

一個Python版本,可以看看這個:

fobj_in = io.StringIO("""Name1, Surname1 Team1 
        Team2 
        Team3 
Name2, Surname2 Team2 
        Team4 
Name3, Surname3 Team1 
        Team5""") 

fobj_out = io.StringIO() 

from collections import defaultdict 

teams = defaultdict(list) 

for line in fobj_in: 
    items = line.split() 
    if len(items) == 3: 
     name = items[:2] 
     team = items[2] 
    else: 
     team = items[0] 
    teams[team].append(name) 

for team_name in sorted(teams.keys()): 
    fobj_out.write(team_name + ', ') 
    for name in teams[team_name][:-1]: 
     fobj_out.write('{} {}, '.format(name[0], name[1])) 
    name = teams[team_name][-1] 
    fobj_out.write('{} {}\n'.format(name[0], name[1])) 


fobj_out.seek(0) 
print(fobj_out.read()) 

輸出:

Team1, Name1, Surname1, Name3, Surname3 
Team2, Name1, Surname1, Name2, Surname2 
Team3, Name1, Surname1 
Team4, Name2, Surname2 
Team5, Name3, Surname3 

只要做到這一點讀取和寫入到一個實際的文件:

fobj_in = open('in_file.txt') 
fobj_out = open('out_file.txt', 'w') 

EDIT

:樣品的數據似乎不包含的情況下woud導致多個名稱在輸出一行。

隨着this input data,我們需要改變的代碼:

from collections import defaultdict 
teams = defaultdict(list) 
for line in fobj_in: 
    if not line.strip(): 
     continue 
    items = [entry.strip() for entry in line.split('\t') if entry] 
    if len(items) == 2: 
     name = items[0] 
     team = items[1] 
    else: 
     team = items[0] 
    teams[team].append(name) 
for team_name in sorted(teams.keys()): 
    fobj_out.write(team_name + ', ') 
    for name in teams[team_name][:-1]: 
     fobj_out.write('{}, '.format(name)) 
    name = teams[team_name][-1] 
    fobj_out.write('{}\n'.format(name)) 

生成的文件內容是這樣的:

"Décore ta vie" (2003), Boilard, Naggy 
"Mouki" (2010), Boileau, Sonia 
A chacun sa place (2011), Boinem, Victor Emmanuel 
Absence (2009) (V), Boillat, Patricia 
C.A.L.L.E. (2005), Boillat, Patricia 
Comment devenir un trou de cul et enfin plaire aux femmes (2004), Boire, Roger 
Couleur de peau: Miel (2012), Boileau, Laurent 
Hergé:Les aventures de Tintin (2004), Boillot, Olivier 
Isola, là dove si parla la lingua di Bacco (2011) (co-director), Boillat, Patricia 
L'île (2011), Boillot, Olivier 
La beauté fatale et féroce... (1996), Boire, Roger 
Last Call Indian (2010), Boileau, Sonia 
Le Temple Oublié (2005), Boillot, Olivier 
Le pied tendre (1988), Boire, Roger 
Legit (2006), Boinski, James W. 
Nubes (2010), Boira, Francisco 
Questions nationales (2009), Boire, Roger 
Reconciling Rwanda (2007), Boiko, Patricia 
Soviet Gymnasts (1955), Boikov, Vladimir 
The Corporal's Diary (2008) (V) (head director), Boiko, Patricia 
Un gars ben chanceux (1977), Boire, Roger 
+0

謝謝,但它會處理多個名稱,即雙名/姓氏,單獨的團隊名稱......(請參閱上面的註釋) – user2447387

+0

它將處理示例輸入。我怎麼知道你的實際輸入是怎樣的?化合物名稱是否也由空格,逗號和其他內容分隔?名字,姓氏或團隊有多少部分?代碼需要適應這一點。 –

+0

是的,我知道,我很抱歉。我編輯了我的問題發佈了一個鏈接到我的數據庫的示例,以澄清事情(https://www.dropbox.com/s/sl3tu7m77gei987/sample.txt)。那麼,實際上可能有多個名字和姓氏。此外,團隊領域相當長,因爲它可以在一段時間內添加其他類型的信息(可用時)和引號。理想情況下,我應該在「團隊」字符串中搜索我的關鍵字(其中包含上述說明以及其他信息),並且代碼應返回與其關聯的人員的姓名。 – user2447387