2012-11-07 115 views
2

可能重複:
php: pdf to string如何從PDF文件中的文本,並將其保存到數據庫

我試圖保存PDF文件的文本內容爲DB。我發現這個鏈接有幫助Converting PDF to string,並努力工作。但它只能轉換非常少量的數據:(爲什麼這樣呢?

或任何其他解決方案如何轉換複雜的pdf文件(包含頁眉,頁腳,表格,nd在一些網頁等兩列的佈局等)在串並保存到DB

+0

據我知道世界上沒有其他OCR中PHP,所以它很大程度上取決於你的PDF可以從中解析多少文本。 – Tom

回答

4

很久以前我寫下載一個PDF文件,並轉換成文本腳本這個功能做的皈依:?

function pdf2string($sourcefile) { 

$content = $sourcefile; 

$searchstart = 'stream'; 
$searchend = 'endstream'; 
$pdfText = ''; 
$pos = 0; 
$pos2 = 0; 
$startpos = 0; 

while ($pos !== false && $pos2 !== false) { 

$pos = strpos($content, $searchstart, $startpos); 
$pos2 = strpos($content, $searchend, $startpos + 1); 

if ($pos !== false && $pos2 !== false){ 

if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) { 
$pos += 2; 
} else if ($content[$pos] == 0x0a) { 
$pos++; 
} 

if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) { 
$pos2 -= 2; 
} else if ($content[$pos2 - 1] == 0x0a) { 
$pos2--; 
} 

$textsection = substr(
$content, 
$pos + strlen($searchstart) + 2, 
$pos2 - $pos - strlen($searchstart) - 1 
); 
$data = gzuncompress($textsection); 
$pdfText .= pdfExtractText($data); 
$startpos = $pos2 + strlen($searchend) - 1; 

} 
} 

return preg_replace('/(\s)+/', ' ', $pdfText); 

} 

編輯:我叫pdfExtractText()這功能定義如下:

function pdfExtractText($psData){ 

if (!is_string($psData)) { 
return ''; 
} 

$text = ''; 

// Handle brackets in the text stream that could be mistaken for 
// the end of a text field. I'm sure you can do this as part of the 
// regular expression, but my skills aren't good enough yet. 
$psData = str_replace('\)', '##ENDBRACKET##', $psData); 
$psData = str_replace('\]', '##ENDSBRACKET##', $psData); 

preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si', 
$psData, 
$matches 
); 
for ($i = 0; $i < sizeof($matches[0]); $i++) { 
if ($matches[3][$i] != '') { 
// Run another match over the contents. 
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches); 
foreach ($subMatches[1] as $subMatch) { 
$text .= $subMatch; 
} 
} else if ($matches[4][$i] != '') { 
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i]; 
} 
} 

// Translate special characters and put back brackets. 
$trans = array(
'...' => '…', 
'\205' => '…', 
'\221' => chr(145), 
'\222' => chr(146), 
'\223' => chr(147), 
'\224' => chr(148), 
'\226' => '-', 
'\267' => '•', 
'\374' => 'ü', 
'\344' => 'ä', 
'\247' => '§', 
'\366' => 'ö', 
'\337' => 'ß', 
'\334' => 'Ü', 
'\326' => 'Ö', 
'\304' => 'Ä', 
'\(' => '(', 
'\[' => '[', 
'##ENDBRACKET##' => ')', 
'##ENDSBRACKET##' => ']', 
chr(133) => '-', 
chr(141) => chr(147), 
chr(142) => chr(148), 
chr(143) => chr(145), 
chr(144) => chr(146), 
); 
$text = strtr($text, $trans); 

return $text; 
} 

EDIT2:要獲得從本地文件使用的內容:

$fp = fopen($sourcefile, 'rb'); 
$content = fread($fp, filesize($sourcefile)); 
fclose($fp); 

EDIT3:將數據保存到數據庫我使用轉義函數之前:

function escape($str) 
{ 
$search=array("\\","\0","\n","\r","\x1a","'",'"'); 
$replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"'); 
return str_replace($search,$replace,$str); 
} 
+0

感謝您的回覆,但它沒有輸出任何內容,當我使用它時 $ result = pdf2string('CROI0311.pdf'); echo $ result; – atif

+0

嗨,對於那個很抱歉。我之前更改過我的帖子並添加了缺失的功能。 var'$ sourcefile'在我的情況下沒有pdf文件的路徑。你必須插入pdf流數據。 – Sentencio

+0

我相信還有缺失的東西,因爲我仍然變得空白頁當我回應結果 – atif

相關問題