正則表達式從Dataframe中提取文本並插入到新列

-1

我一直在通過正則表達式的所有帖子打獵，但似乎無法爲我工作。線的正則表達式從Dataframe中提取文本並插入到新列

實施例（某些字被刪節或改變）

Df的$文本：「CommonWord＃79 - 事件類型1200秒[對象] xxx.xxx.xxx.xxx/## XXX .xxx.xxx.xxx/##端口：##

我想＃後提取的數值，並將其放置在一個新的列我想：DF $數< - 子（「\ ＃（[0-9] {2,4}）。*「，」\ 1「，df $ text）

結果是「CommonWord 79」我似乎找不到正確的正則表達式來刪除第一個單詞。
下一個正則表達式我想把「EVENT類型」拉到另一列。「EVENT」和「type」都可以改變，所以我需要在「 - 」之後和「for」之前拉文本。
1. 我需要的最後兩個正則表達式是IP地址和子網掩碼，然後是端口號（僅限數字）。我需要所有這些到新的列。

對不起，長篇大論的問題。被敲打着我的頭就這一個

解決部分1，事件類型和端口需要有一些問題，找到IP地址（只獲得了第一位在

df$number <- sub(".*\\#(\\d{1,4}).*", "\\1", df$text) 
df$attackType <- sub(".*\\-.(\\w+\\s\\w+).*","\\1", df$text) 
df$port <- as.numeric(sub(".*\\:(\\d{1, })?","\\1", df$text))

第一組數字，例如actual ip是127.0.0.1/28，但是我得到了7.0.0.1/28返回。在弄清楚如何獲得IP地址/掩碼後，我需要確定如何在文本中找到多個結果冗長的正則表達式 - 期待稍後優化

df$IPs <- sub(".*(+\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\/\\d{2, }).*","\\1", df$text)

來源

2016-11-29 user3192046

-1

你只是不得不加「*」表示任何#character數量

sub(".*\\#([0-9]{2,4}).*", "\\1", x)

＃之前創建一個新的列

df$new_col <- as.numeric(sub(".*\\#([0-9]{2,4}).*", "\\1", df$text))

來源

2016-11-29 04:55:22

感謝as.numeric！我可以按照你的建議讓子工作，其中正則表達式是：「。* \\＃（\\ d {1,4}）。*「現在需要計算出其他需求。再次感謝 – user3192046

是那些x應該代表的數字？有些值會有幫助，尤其是考慮到IP地址並不完全遵循這種模式。

無論如何，我已經添加了一些東西來搜索。我喜歡將rex包與stringr::str_view_all結合使用來測試正則表達式模式。結果在查看器窗格中突出顯示。

text <- "CommonWord #79 - EVENT type for 1200 seconds [Objects] 192.168.0.24/## xxx.xxx.xxx.xxx/## Port: 80" 
library(stringr) 
library(rex) 

# show matches where at least one digit follows # 
str_view_all(text, rex(at_least(digit, 1) %if_prev_is% "#")) 

# show matches where characters are after - and before 'for' 
str_view_all(text, rex((prints %if_prev_is% "-") %if_next_is% "for")) 

# show matches the x in your IP text match 1-3 digits, and end with/
str_view_all(text, rex(between(digit, 1, 3), dot, 
         between(digit, 1, 3), dot, 
         between(digit, 1, 3), dot, 
         between(digit, 1, 3), "/")) 

# show matches where digits follow 'Port:' 
str_view_all(text, rex(digits %if_prev_is% "Port: "))

來源

2016-11-29 04:55:36

x確實代表數字，我出於隱私原因進行了編輯，但假定它的IP地址，意味着它的四組數字範圍從1 - 254.示例127.0.0.1，等等 – user3192046

正則表達式從Dataframe中提取文本並插入到新列

回答

相關問題