2016-12-20 84 views
-3

我在文本文件中列出了200個名稱。名稱中的所有字符都是小寫字母,並且可以從6個字符或7個字符開始並分成幾個標題。有些有副標題。我嘗試根據空間進行分割,但最終會將文本文件分割爲每個名稱之間的空格。一些\n也被打印。我有兩個不同的想法,並堅持兩個。解析文本文件

Header 
subheading 
namenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamename 


Heading 


Header 
subheading 
namenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamename 

在端我試圖忽略標頭和子報頭(其全部大寫,所有lowercaps之間變化,和這兩者的混合)和只打印的名稱。我開始試圖將所有內容作爲列表追加,但是由於我無法正確解析文本,所以我最終得到錯誤或者單獨打印每個字符串的每個字母。

path_to_file = 'pathgoeshere' 

check_list = [] 

for word in open(path_to_file).read() 
    username = str(word) 
    check_list.append(username) 
    print username 

List = open(path_to_file).readlines() 
print List 


for x in List: 
    user_name = str(x) 
    if user_name.lower(): 
     print user_name 

在我的實際代碼中它的格式正確,但這是我尋找的東西。

最後,我正在尋找解析並計算名稱,而不計算我不需要的無關文本。

我不確定該從哪裏出發。

+0

http://stackoverflow.com/help/someone-answers –

回答

0

我不完全明白你在做什麼。但是,這應該讓你開始(忽略標題和副標題,並只打印所有的名字):

with open('pathgoeshere') as infile: 
    for line in infile: 
     line = line.strip() 
     if any(char.isupper() for char in line): continue 
     print line 

因爲你們的名字(你關心的東西)是全部小寫,你應該能夠逃脫簡單的測試,如果行有大寫字符

+0

很抱歉,如果我的問題聽起來令人困惑。基本上我有很多文本(標題,副標題,用戶名)。這些字符是全部大寫,全部小寫或兩者混合的混合形式。用戶名都是小寫字母,可以是6或7個字符。我試圖從其他文本中解析出用戶名,但沒有得到任何混合的子頭或頭信息。 – Smithw1

+0

@ Smithw1:您是否嘗試過運行我的代碼?它應該做你在問什麼 – inspectorG4dget

+0

謝謝你的幫助。最初,當我運行代碼時,它返回了一些空值,但經過一些調整後,我們設法使其工作。你的代碼是一個很好的起點。它確實包含了我必須注意的無關值,所以必須修改any(char.isupper())。 再次感謝您的幫助。 – Smithw1

0

這裏是我的想法:

  • 名字似乎只是跟從Headersubheading
  • split通過\n,發現各組線的更大的最後一行於3

txt = """Header 
subheading 
namenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamename 


Heading 


Header 
subheading 
namenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamenamename 
""" 

# split text 
s = pd.Series(txt.split('\n')) 

# regex to find lines with nothing but whitespace 
blanks = s.str.match(r'^\s*$') 

# assign groups of lines starting with the first non blank 
# filter groups with `~blanks` to focus on just non blank lines 
non_blank_groups = (~blanks & blanks.shift().fillna(True)).cumsum().loc[~blanks] 

# get value counts of the groups to get rid of groups of lines 
# with only one line like `Heading` 
value_counts = non_blank_groups.value_counts() 

# filter `non_blank_groups` with only the groups with more than 2 lines 
groups = non_blank_groups[non_blank_groups.isin(value_counts.index[value_counts.ge(3)])] 

# finally, groupby and grab last one 
s.groupby(groups).last() 

1.0 namenamenamenamenamenamenamenamenamenamenamena... 
3.0 namenamenamenamenamenamenamenamenamenamenamena... 
dtype: object