使用的findall蟒蛇（給出不正確的結果）

鳴叫提取@mentions我有一個CSV文件是這樣的使用的findall蟒蛇（給出不正確的結果）

text 
RT @CritCareMed: New Article: Male-Predominant Plasma Transfusion Strategy for Preventing Transfusion-Related Acute Lung Injury... htp://… 
#CRISPR Inversion of CTCF Sites Alters Genome Topology &amp; Enhancer/Promoter Function in @CellCellPress htp://.co/HrjDwbm7NN 
RT @gvwilson: Where's the theory for software engineering? Behind a paywall, that's where. htp://.co/1t3TymiF3M #semat #fail 
RT @sciencemagazine: What’s killing off the sea stars? htp://.co/J19FnigwM9 #ecology 
RT @MHendr1cks: Eve Marder describes a horror that is familiar to worm connectome gazers. htp://.co/AEqc7NOWoR via @nucAmbiguous htp://…

我想提取所有提及（以「@」）從鳴叫文本。到目前爲止，我已經做到了這一點

import pandas as pd 
import re 

mydata = pd.read_csv("C:/Users/file.csv") 
X = mydata.ix[:,:] 
X=X.iloc[:,:1] #I have multiple columns so I'm selecting the first column only that is 'text' 

for i in range(X.shape[0]): 
result = re.findall("(^|[^@\w])@(\w{1,25})", str(X.iloc[:i,:])) 

print(result);

有兩個問題在這裏：第一：在str(X.iloc[:1,:])它給了我['CritCareMed']這也不行，因爲它應該給我['CellCellPress']，在str(X.iloc[:2,:])再次給了我['CritCareMed']這是當然不會再罰款。最後的結果，我得到的是

[（ ' ' 'CritCareMed'），（''， 'gvwilson'），（」」， 'sciencemagazine'）]

一點也沒有不包括第二排的提及和最後一排的兩個提及。我想應該是這個樣子：

我怎樣才能取得這些成果？這只是一個示例數據，我的原始數據有很多推文，所以方法好嗎？

來源

2017-10-08 melissa

您可以使用str.findall方法來避免for循環，使用落後負的樣子，以取代(^|[^@\w])形成你不要在你的正則表達式需要另一個捕獲組：

df['mention'] = df.text.str.findall(r'(?<![@\w])@(\w{1,25})').apply(','.join) df # text mention #0 RT @CritCareMed: New Article: Male-Predominant... CritCareMed #1 #CRISPR Inversion of CTCF Sites Alters Genome ... CellCellPress #2 RT @gvwilson: Where's the theory for software ... gvwilson #3 RT @sciencemagazine: What’s killing off the se... sciencemagazine #4 RT @MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous

而且X.iloc[:i,:]給出一個數據幀，因此str(X.iloc[:i,:])爲您提供了一個數據幀的字符串表示形式，它與單元格中的元素非常不同，從text列中提取實際的字符串，您可以使用X.text.iloc[0]或一個tter的方式通過蒸餾塔循環使用iteritems：

import re for index, s in df.text.iteritems(): result = re.findall("(?<![@\w])@(\w{1,25})", s) print(','.join(result)) #CritCareMed #CellCellPress #gvwilson #sciencemagazine #MHendr1cks,nucAmbiguous

來源

2017-10-08 17:18:54 Psidom

如何從df中選擇第一列？如果iloc給出數據幀。在我的文件中有多個列，並且必須僅處理第一列，即'text' – melissa

要選擇第一列，您可以使用列名，即'df.text'，'df ['text'] '或使用'iloc'，'df.iloc [：，0]'。 – Psidom

雖然你已經有了答案，你甚至可以嘗試優化整個導入過程，像這樣：

import re, pandas as pd 

rx = re.compile(r'@([^:\s]+)') 

with open("test.txt") as fp: 
    dft = ([line, ",".join(rx.findall(line))] for line in fp.readlines()) 

    df = pd.DataFrame(dft, columns = ['text', 'mention']) 
    print(df)

其中產量：

           text     mention 
0 RT @CritCareMed: New Article: Male-Predominant...    CritCareMed 
1 #CRISPR Inversion of CTCF Sites Alters Genome ...   CellCellPress 
2 RT @gvwilson: Where's the theory for software ...     gvwilson 
3 RT @sciencemagazine: What’s killing off the se...   sciencemagazine 
4 RT @MHendr1cks: Eve Marder describes a horror ... MHendr1cks,nucAmbiguous

這可能是有點快，你不需要改變df一旦它已經構造ucted。

來源

2017-10-10 05:54:27 Jan

非常感謝你，我會盡力而爲:) – melissa

使用的findall蟒蛇（給出不正確的結果）

回答

相關問題