我是shell腳本編程的新手，需要關於典型需求的指導。我有兩個文件（1.master文件和2.pattern文件）主文件包含許多字段|分隔符，只有第10和第15個字段需要根據模式文件進行更新。基於file1搜索file2中的字符串並替換

萬事達文件：

H|20170101 

123|field2|field3|...|field10|field11...|field15|....|field150 

... 

... 

T|1000000

模式文件：

Europe|EU 

Australia|AU 

China|CN

例如，

123|1|2|3|...|9|nice weather in europe today|11|.....

上面的行需要更換成

123|1|2|3|...|9|nice weather in EU today|11|.....

我從一個簡單的sed命令開始，通過從模式文件獲取值來替換主文件..但它不完整，因爲我不知道如何處理一個巨大的主文件，並且也替換了特定的字段。

while read line 

do 

value1=$(echo $line | awk -F"|" '{print $1}') 

value2=$(echo $line | awk -F"|" '{print $2}') 

sed -i 's/ '${value1}' /'${value2}'/g' master.txt 

done < pattern.txt

上面的腳本對於10mb文件非常緩慢，因爲我的主文件有點巨大（100 MB）。

請幫忙。

來源

2017-04-11 Vensmira

這是黑暗中的一個鏡頭，因爲您的示例數據甚至沒有10個字段，我也沒有時間創建測試集。希望它工作，使用awk。下次請充分考慮創建工作數據集（足夠的字段，Europe =/= europe等）。像我說的，未經測試：

$ awk ' 
BEGIN { FS=OFS="|" }      # delimiters 
NR==FNR { a[$1]=$2; next }    # read patterns and hash them 
{ 
    for(i=10;i<=NF;i+=5)     # iterate every fifth field 
     if(i%10==0||i%15==0){    # pick only mod 10 and mod 15 
      n=split($i,b," ")    # split to b the chosen ones 
      for(j=1;j<=n;j++)    # iterate thru the chosen ones 
       if(b[j] in a)    # if word is found among patterns 
        sub(b[j],a[b[j]],$i) # switch the matching pattern 
     } 
}1' pattern master

來源

2017-04-11 20:54:56

我們是否確信模式交替不會太大而不適合關聯數組？ – cdarke

@cdarke沒有，但一個100 MB的主人。 Pfft。我們大膽的， –

@cdarke我散列模式文件，並通過主反覆。我認爲這很好。 –

該腳本是可能是由於您正在創建子進程的數量較慢。此外，您正在閱讀比較小文件更多的文件（master.txt）。

請注意，-i選項到sed是非標準的。

您可以通過使用bash擺脫調用的awk語言解釋器和sed編輯：

# Read patterns into an associative array 
# Requites Bash 4 or later 
declare -A patterns 

while IFS='|' read key value 
do 
    patterns[$key]="$value" 

done < pattern.txt 

# Set the option for case insensitive patterns 
shopt -s nocasematch 

while read line 
do 
    # Iterate through the patterns array 
    for key in "${!patterns[@]}" 
    do 
     line="${line//$key/${patterns[$key]}}" 
    done 

    echo "$line" 

done < master.txt

這並不只允許某些字段進行編輯。這是：

# Read patterns into an associative array 
# Requites Bash 4 or later 
declare -A patterns 

while IFS='|' read key value 
do 
    patterns[$key]="$value" 

done < pattern.txt 

# Set the option for case insensitive patterns 
shopt -s nocasematch 

# IFS is set here because localised setting for 'echo' does not work in bash 
oldIFS="$IFS" 
IFS='|' 

# "line" is an array 
while read -a line 
do 
    # Check there are at least 15 fields 
    if ((${#line[@]} >= 15)) 
    then 
     # Iterate through the patterns array 
     for key in "${!patterns[@]}" 
     do 
      # We are only interested in the 10th and 15th fields 
      # (index 9 and 14 since arrays index from zero) 
      val="${line[9]}" 
      line[9]="${val//$key/${patterns[$key]}}" 
      val="${line[14]}" 
      line[14]="${val//$key/${patterns[$key]}}" 
     done 
    fi 
    echo "${line[*]}" 

done < master.txt 

IFS="$oldIFS"

來源

2017-04-11 18:42:51 cdarke

這將取代歐洲與歐盟在任何領域，而不是領域10或領域15如OP所述。 –

@GeorgeVasiliou：我的腳本取代了'awk'和'sed'。它的行爲與OP的腳本相同。但你是對的，我沒有分離出所需的領域。這將需要更多的線路，明天我會看看。 – cdarke

無論如何，如果OP對sed滿意，那麼你的腳本也會讓他開心。也許這需要由OP進行澄清。順便說一下，我認爲你可以通過避免模式文件打開/關閉巨大主文件的每一行來提高解決方案的性能。您可能會考慮在開始時讀取模式，將數據存儲在數組中並在讀取主文件時遍歷數組。 –

這是一個sed替代方案，基於sed可以從文件讀取命令的事實。

首先，我創建使用模式文件的內容sed命令文件：

$ cat file1 
europe|EU 
australia|AU 
china|CN 

$ while IFS="|" read -r a b;do 
> echo -e "s/((.[^|]*.){9})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g"; 
> echo -e "s/((.[^|]*.){14})(.+)\<$a\>([^|]+)(.*)/\1\3$b\4\5/g"; 
> done<file1 >file11 

$ cat file11 
s/((.[^|]*.){9})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g 
s/((.[^|]*.){14})(.+)\<europe\>([^|]+)(.*)/\1\3EU\4\5/g 
s/((.[^|]*.){9})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g 
s/((.[^|]*.){14})(.+)\<australia\>([^|]+)(.*)/\1\3AU\4\5/g 
s/((.[^|]*.){9})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g 
s/((.[^|]*.){14})(.+)\<china\>([^|]+)(.*)/\1\3CN\4\5/g

然後我們要做的唯一事情就是打電話給上面的命令FILE11 SED和飼料的sed。

$ cat file2 
1|2|3|4|5|europe|7|8|9|nice weather in europe today|11|12|europe|14|nice weather in europe today|16 
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16 
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|nice weather in china today|16 
1|2|3|4|5|europe|7|8|9|nice weather in china today|11|12|china|14|best of chinas today|16 
1|2|3|4|5|europe|7|8|9|nice weather in australia today|11|12|australia|14|nice weather in australia today|16

我已滿足file2的各種值進行測試，並確保所提供的sed的正則表達式只會取代第10和第15場，只有當我們有一個文字字匹配（即字europe被EU取代但字european不會被取代）

這些結果似乎是相當不錯的。我希望這個sed解決方案對你的大文件來說非常快。

$ sed -E -f file11 file2 
1|2|3|4|5|europe|7|8|9|nice weather in EU today|11|12|europe|14|nice weather in EU today|16 
1|2|3|4|5|europe|7|8|9|nice european weather today|11|12|europe|14|nice european weather today|16 
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|nice weather in CN today|16 
1|2|3|4|5|europe|7|8|9|nice weather in CN today|11|12|china|14|best of chinas today|16 
1|2|3|4|5|europe|7|8|9|nice weather in AU today|11|12|australia|14|nice weather in AU today|16

來源

2017-04-11 19:00:47

@cdarke我刪除了以前的awk解決方案，並在經過大量測試後構建了sed解決方案。你覺得這個sed ...怎麼樣？我很享受這個！ –

我可以看到你走到哪裏，我必須承認我不會想到這種方法。我認爲我們（你和我）從這個問題得到了比OP更多的答案:-) – cdarke

@cdarke謝謝你。即使你的方法非常好。當然，我們會在OP問題前進行很多步驟，但我們都是這麼做的：這很有趣！此外，我們都從對方的答案中獲得一些好處！ –

基於file1搜索file2中的字符串並替換

萬事達文件：

模式文件：

回答

相關問題