2017-01-23 44 views
-2

我正在嘗試從終端搜索pdf文件。我試圖從終端提供搜索字符串。搜索字符串可以是一個單詞,多個單詞(AND,OR)或一個精確的短語。我想只爲所有搜索查詢保留一個參數。我將把以下命令保存爲一個shell腳本,並將shell腳本作爲zsh或bash shell中的.aliases的別名進行調用。多種模式的一個參數 - grep

以下是sjr的回答,這裏是:search multiple pdf files

我用SJR的回答是這樣的:

find ${1} -name '*.pdf' -exec sh -c 'pdftotext "{}" - | 
     grep -E -m'${2}' --line-buffered --label="{}" '"${3}"' '${4}'' \; 

$1採用路徑

$2限制的結果數量

$3是上下文參數(這是接受-A,-B ,-C,單獨或聯合)

$4以搜索條件g

我面臨的問題是與$4值。正如我前面所說,我希望這個參數傳遞我的搜索字符串,它可以是一個短語或一個字或多個單詞與/或關係。

我無法獲得理想的結果,直到現在我沒有獲得詞組搜索的搜索結果,直到我遵循Robin Green的評論。但仍然短語結果不準確。

編輯從判斷文本:

The original rule was that you could not claim for psychiatric injury in 
negligence. There was no liability for psychiatric injury unless there was also 
physical injury (Victorian Rly Commrs v Coultas [1888]). The courts were worried 
both about fraudulent claims and that if they allowed claims, the floodgates would 
open. 

The claimant was 15 metres away behind a tram and did not see the accident but 
later saw blood on the road. She suffered nervous shock and had a miscarriage. She 
sued for negligence. The court held that it was not reasonably foreseeable that 
someone so far away would suffer shock and no duty of care was owed. 

White v Chief Constable of South Yorkshire [1998] The claimants were police 
officers who all had some part in helping victims at Hillsborough and suffered 
psychiatric injury. The House of Lords held that rescuers did not have a special 
position and had to follow the normal rules for primary and secondary victims. 
They were not in physical danger and not therefore primary victims. Neither could 
they establish they had a close relationship with the injured so failed as 
secondary victims. It is necessary to define `nervous shock' which is the rather 
quaint term still sometimes used by lawyers for various kinds of 
psychiatric injury...rest of para 

word1可以是:休克,(神經性休克)

word2可以是:精神病

exact phrase:(神經性休克)

命令

alias s='sh /path/shell/script.sh' 
export p='path/pdf/files' 

在終端:

s "$p" 10 -5 "word1/|word2"   #for OR search 
s "$p" 10 -5 "word1.*word2.*word3" #for AND search 
s "$p" 10 -5 ""exact phrase""  #for phrase search 

第二個測試樣品: 一個例子pdf文件,由於pdf文件命令運行:Test-File。它的4頁(361微克文件的一部分)

如果我們在其上運行下面的命令,作爲解決方案提到:

s "$p" 10 -5 'doctrine of basic structure' > ~/desktop/BSD.txt && open ~/desktop/BSD.txt

我們會得到相關的文字和「會避免穿通整個文件。認爲這將是一個很酷的方式來閱讀我們想要的,而不是傳統的方法。

+1

爲什麼downvote?想要知道,以便在提問時我可以保重。 – lawsome

+2

單引號將導致引用的參數不被擴展(假設您使用bash或sh),這不是您想要的。你應該使用雙引號來引用bash或sh中的參數。或者你正在使用其他一些shell? –

+1

我沒有投票,我也希望人們在他們這樣做時會留下反饋。也就是說,將[MCVE(Minimal,Complete,and Verifiable Example)](http://stackoverflow.com/help/mcve)的問題減少到總是值得。有關提問的一般提示可以在這裏找到(http://stackoverflow.com/help/how-to-ask)。 – mklement0

回答

1

您需要:

  • 爲了傳遞一個雙引號命令字符串sh -c嵌入式殼變量引用要擴展(然後需要逃離嵌入式"實例作爲\" )。

  • 報價與printf %q正則表達式安全列入命令字符串 - 注意,這需要bashksh,或zsh作爲外殼。

dir=$1 
numMatches=$2 
context=$3 
regexQuoted=$(printf %q "$4") 

find "${dir}" -type f -name '*.pdf' -exec sh -c "pdftotext \"{}\" - | 
    grep -E -m${numMatches} --with-filename --label=\"{}\" ${context} ${regexQuoted}" \; 

3個調用場景將被:

s "$p" 10 -5 'word1|word2'   #for OR search 
s "$p" 10 -5 'word1.*word2.*word3' #for AND search 
s "$p" 10 -5 'exact phrase'   #for phrase search 

注意,沒有必要逃避|,無需加雙引號的一個額外層周圍exact phrase

另請注意,我已將--line-buffered替換爲--with-filename,因爲我認爲這就是您的意思(以PDF文件路徑爲前綴的匹配行)。


注意,上述方法的殼實例必須爲輸入路徑,這是低效的產生,所以考慮重寫你的命令,如下所示,這也避免了需要printf %q(假設regex=$4):

find "${dir}" -type f -name '*.pdf' | 
    while IFS= read -r file; do 
    pdftotext "$f" - | 
     grep -E -m${numMatches} --with-filename --label="$f" ${context} "${regex}" 
    done 

上述假設您的文件名有沒有嵌入換行,這是很少現實世界的關注。如果是,有辦法解決這個問題。

這個解決方案的另外一個優勢是,它僅使用POSIX兼容的功能,但要注意,grep命令使用非標準的選項。