刮沒有HTML的純文本文件？

我有一個純文本文件中的以下數據：刮沒有HTML的純文本文件？

1. Value 
Location : Value 
Owner: Value 
Architect: Value 

2. Value 
Location : Value 
Owner: Value 
Architect: Value 

... upto 200+ ...

的編號和每段的字值的變化。

現在我需要將這些數據插入到MySQL數據庫中。

您有關於如何遍歷和刮取它的建議，以便我可以獲得文本旁邊的文本值以及「位置」，「所有者」，「架構師」的值嗎？

由於不存在HTML標籤，似乎很難處理DOM抓取類。

來源

2011-12-08 IMB

一個簡單而應該足夠好。 –

將與一個非常簡單的面向靜態行解析器工作。每條線都將解析的數據累積到一個數組（）中。當某些事情告訴你正在創建一個新記錄時，你會轉儲你解析的內容並重新開始。

面向行的解析器具有很好的屬性：它們只需要很少的內存以及最重要的內存。他們可以在沒有任何汗水的情況下繼續使用千兆字節的數據。我正在管理大量生產服務器，沒有什麼比那些將整個文件拖入內存的腳本更糟糕了（然後用解析的內容填充數組，這需要比原始文件大小多兩倍的內存）。

這工作和主要是牢不可破：

<?php 
$in_name = 'in.txt'; 
$in = fopen($in_name, 'r') or die(); 

function dump_record($r) { 
    print_r($r); 
} 

$current = array(); 
while ($line = fgets($in)) { 
    /* Skip empty lines (any number of whitespaces is 'empty' */ 
    if (preg_match('/^\s*$/', $line)) continue; 

    /* Search for '123. <value> ' stanzas */ 
    if (preg_match('/^(\d+)\.\s+(.*)\s*$/', $line, $start)) { 
     /* If we already parsed a record, this is the time to dump it */ 
     if (!empty($current)) dump_record($current); 

     /* Let's start the new record */ 
     $current = array('id' => $start[1]); 
    } 
    else if (preg_match('/^(.*):\s+(.*)\s*/', $line, $keyval)) { 
     /* Otherwise parse a plain 'key: value' stanza */ 
     $current[ $keyval[1] ] = $keyval[2]; 
    } 
    else { 
     error_log("parsing error: '$line'"); 
    } 
} 

/* Don't forget to dump the last parsed record, situation 
* we only detect at EOF (end of file) */ 
if (!empty($current)) dump_record($current); 

fclose($in); 
?>

Obvously你需要的東西適合你的口味在function dump_record，如打印正確格式化INSERT SQL語句。

來源

2011-12-08 14:31:09 zerodeux

我編輯了我的評論，強調面向流/面向行的解析器。 PHP文化是以file（）/ file_get_contents（）爲導向的，但這並沒有擴展。而且你經常想要擴展，特別是像這樣的數據導入問題。一次只能記錄一條記錄！ – zerodeux

這比老闆更好 – IMB

如果每塊具有相同的結構，你可以用file()功能做到這一點：http://nl.php.net/manual/en/function.file.php

$data = file('path/to/file.txt');

有了這個每一行是數組中的一個項目，你可以通過它循環。

for ($i = 0; $i<count($data); $i+=5){ 
    $valuerow = $data[$i]; 
    $locationrow = $data[$i+1]; 
    $ownerrow = $data[$i+2]; 
    $architectrow = $data[$i+3]; 
    // strip the data you don't want here, and instert it into the database. 
}

來源

2011-12-08 14:15:45

當然裏面的for和後面的語句應該是sql查詢插入的數據 –

@Aurelio：不一定...我總是喜歡寫一個製表符分隔符或類似的文件，然後使用數據庫的批量加載工具（sqlldr，mysql的LOAD DATA INFILE等等），它使你有機會在插入之前查看解析。 – Joe

@Joe很好的解決方案，但Topener不寫也不是我的解決方案，用他的話，每次都是相同的變數，這是我的意見 –

與Topener溶液商定，這裏是一個例子，如果每個塊是4行+空白行：

$data = file('path/to/file.txt'); 
$id = 0; 
$parsedData = array(); 
foreach ($data as $n => $row) { 
    if (($n % 5) == 0) $id = (int) $row[0]; 
    else { 
    $parsedData[$id][$row[0]] = $row[1]; 
    } 
}

結構將是方便的使用，對於MySQL或whatelse。我沒有添加代碼以從第一段中刪除冒號。

祝你好運！

來源

2011-12-08 14:22:49 Mikhail

如果數據的結構不斷變化，您可以使用fscanf從文件中掃描它們。

/* Notice the newlines at the end! */ 
$format = <<<FORMAT 
%d. %s 
Location : %s 
Owner: %s 
Arcihtect: %s 


FORMAT; 

$file = fopen('file.txt', 'r'); 
while ($data = fscanf($file, $format)) { 
    list($number, $title, $location, $owner, $architect) = $data; 
    // Insert the data to database here 
} 
fclose($file);

更多關於fscanf in docs。

來源

2011-12-08 14:26:13 RCE

+ 1用於面向行的解析，另外許多人會更喜歡fscanf的風格，而不是我自己評論中提出的正則表達式。我的解決方案對於空格和報表行級別的錯誤更加穩健rs，但它是代碼量的兩倍。 – zerodeux

用於'fscanf'的+1。我不知道這個功能，它看起來非常有用。 –

如果數值有空間和其他字符，你如何使這項工作？例如''這是，（價值）！「'。 – IMB

這會給你想要的東西，

$array = explode("\n\n", $txt); 
foreach($array as $key=>$value) { 
    $id_pattern = '#'.($key+1).'. (.*?)\n#'; 
    preg_match($id_pattern, $value, $id); 

    $location_pattern = '#Location \: (.*?)\n#'; 
    preg_match($location_pattern, $value, $location); 


    $owner_pattern = '#Owner\: (.*?)\n#'; 
    preg_match($owner_pattern, $value, $owner); 


    $architect_pattern = '#Architect\: (.*?)#'; 
    preg_match($architect_pattern, $value, $architect); 

    $id = $id[1]; 
    $location = $location[1]; 
    $owner = $owner[1]; 
    $architect = $architect[1]; 

    mysql_query("INSERT INTO table (id, location, owner, architect) VALUES ('".$id."', '".$location."', '".$owner."', '".$architect."')"); 
//Change MYSQL query 

}

來源

2011-12-08 14:27:11 nine7ySix

preg_match_all("/(\d+)\.(.*?)\sLocation\s*\:\s*(.*?)\sOwner\s*\:\s*(.*?)\sArchitect\s*\:\s*(.*?)\s?/i",$txt,$m); 

$matched = array(); 

foreach($m[1] as $k => $v) { 

    $matched[$v] = array(
     "location" => trim($m[2][$v]), 
     "owner" => trim($m[3][$v]), 
     "architect" => trim($m[4][$v]) 
    ); 

}

來源

2011-12-08 14:28:24 Lee

刮沒有HTML的純文本文件？

回答

相關問題