我試圖保存PDF文件的文本內容爲DB。我發現這個鏈接有幫助Converting PDF to string,並努力工作。但它只能轉換非常少量的數據:(爲什麼這樣呢?
或任何其他解決方案如何轉換複雜的pdf文件(包含頁眉,頁腳,表格,nd在一些網頁等兩列的佈局等)在串並保存到DB
我試圖保存PDF文件的文本內容爲DB。我發現這個鏈接有幫助Converting PDF to string,並努力工作。但它只能轉換非常少量的數據:(爲什麼這樣呢?
或任何其他解決方案如何轉換複雜的pdf文件(包含頁眉,頁腳,表格,nd在一些網頁等兩列的佈局等)在串並保存到DB
很久以前我寫下載一個PDF文件,並轉換成文本腳本這個功能做的皈依:?
function pdf2string($sourcefile) {
$content = $sourcefile;
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}
if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}
$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
}
}
return preg_replace('/(\s)+/', ' ', $pdfText);
}
編輯:我叫pdfExtractText()
這功能定義如下:
function pdfExtractText($psData){
if (!is_string($psData)) {
return '';
}
$text = '';
// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}
// Translate special characters and put back brackets.
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\374' => 'ü',
'\344' => 'ä',
'\247' => '§',
'\366' => 'ö',
'\337' => 'ß',
'\334' => 'Ü',
'\326' => 'Ö',
'\304' => 'Ä',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);
return $text;
}
EDIT2:要獲得從本地文件使用的內容:
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
EDIT3:將數據保存到數據庫我使用轉義函數之前:
function escape($str)
{
$search=array("\\","\0","\n","\r","\x1a","'",'"');
$replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"');
return str_replace($search,$replace,$str);
}
據我知道世界上沒有其他OCR中PHP,所以它很大程度上取決於你的PDF可以從中解析多少文本。 – Tom