2013-08-02 44 views
0

我正在將客戶的本土網站遷移到Drupal 7中。該過程需要一段時間 - 設計決策和一些新要求等。肯定你都去過那裏了。 (b) 從Drupal站點和舊站點獲取每個頁面的內容,(c)執行一個xpath查詢(b) 從舊數據庫獲取每個頁面的內容在頁面上使用xidel獲取div#maincontent和div#main的內容,並(d)將這些數據保存在new.txt和old.txt文件中 - 同時保留與站點類似的文件夾結構以供參考。比較網站遷移結果(同時並行運行兩個網站)

gather_data.sh

#!/bin/bash 
# get URLS 
urls=$(ssh [email protected]_ser "~/data_urls.sh" | egrep "^\/" | sort -u) 

# clear out current working folder 
rm -rf ./working 

# loop through paths 
for i in $urls 
do 

    # screen status update, set storage area with url_path in folder path, make folder 
    echo $i 
    storage_area=./working/$i/ 
    mkdir -p $storage_area 


    # strip trailing space 
    i=${i%/} 

    # pull and and run xpath query 
    xidel http://old_server$i -e '//div[@id="maincontent"]//p' > $storage_area/old.txt 
    xidel http://new_server$i -e '//div[@id="content"]//p' > $storage_area/new.txt 

    # run a compare and output data into cmp.cmp 
    cmp $storage_area/old.txt $storage_area/new.txt > $storage_area/cmp.cmp 

done 

輔助腳本通過cmp.cmp文件的結果循環。

run_diff.sh

echo "------------------------------------------------------- " 
echo "The following may have differences in content based on wdiff analysis" 

for i in `find ./working/ -type d`; do 

    better_url_name=`echo $i | sed -e 's#\./working##g'` 


    echo -e "\e[1;37m" 
    echo ----------------------------------------------------------------------- 
    echo http://old_server$better_url_name 
    echo http://new_server$better_url_name 
    echo ----------------------------------------------------------------------- 
    echo -e "\e[00m" 
    wdiff -3s $i/old.txt $i/new.txt | colordiff 
done 

的上述結果產生像下面這樣。

----------------------------------------------------------------------- 
http://old_server/career_services/career_fair.php 
http://new_server/career_services/career_fair.php 
----------------------------------------------------------------------- 


====================================================================== 
[-9. 
School-] {+9.School+} 
====================================================================== 
[-Imagination 
April-] {+ImaginationApril+} 
====================================================================== 
[-contract. 
April-] {+contract.April+} 
====================================================================== 

{+ +} 
====================================================================== 
./working/epics/career_services/career_fair.php/old.txt: 1001 words 995 99% common 0 0% deleted 6 1% changed 
./working/epics/career_services/career_fair.php/new.txt: 999 words 995 100% common 1 0% inserted 3 0% changed 

我的問題:

  • 我怎麼忽略這些誤報?
  • 如何過濾空格和返回標記?
  • 這是正確的方法嗎?我是否應該放棄這種方法來尋找更好的結果?

回答

0

隨着diff命令,您可以使用下面的選項 -

-b --ignore-space-change 
     Ignore changes in the amount of white space. 

    -w --ignore-all-space 
     Ignore all white space. 

    -B --ignore-blank-lines 
     Ignore changes whose lines are all blank. 

     --strip-trailing-cr 
     Strip trailing carriage return on input.