2010-07-20 93 views
3

我寫了一個腳本,它向Google發送大塊文本進行翻譯,但有時文本是html源代碼)將最終分裂成html標籤的中間,Google會錯誤地返回代碼。將一個大字符串拆分成一個數組,但拆分點不能破壞標籤

我已經知道如何將字符串拆分成數組,但是有沒有更好的方法來做到這一點,同時確保輸出字符串不超過5000個字符並且不會在標籤上分割?

UPDATE:多虧了答案,這是我最終使用在我的項目的代碼,它的偉大工程

function handleTextHtmlSplit($text, $maxSize) { 
    //our collection array 
    $niceHtml[] = ''; 

    // Splits on tags, but also includes each tag as an item in the result 
    $pieces = preg_split('/(<[^>]*>)/', $text, -1, PREG_SPLIT_DELIM_CAPTURE); 

    //the current position of the index 
    $currentPiece = 0; 

    //start assembling a group until it gets to max size 

    foreach ($pieces as $piece) { 
     //make sure string length of this piece will not exceed max size when inserted 
     if (strlen($niceHtml[$currentPiece] . $piece) > $maxSize) { 
      //advance current piece 
      //will put overflow into next group 
      $currentPiece += 1; 
      //create empty string as value for next piece in the index 
      $niceHtml[$currentPiece] = ''; 
     } 
     //insert piece into our master array 
     $niceHtml[$currentPiece] .= $piece; 
    } 

    //return array of nicely handled html 
    return $niceHtml; 
} 

回答

3

注:還沒有機會測試這個(所以有可能是一個小錯誤或兩個),但它應該給你一個想法:

function get_groups_of_5000_or_less($input_string) { 

    // Splits on tags, but also includes each tag as an item in the result 
    $pieces = preg_split('/(<[^>]*>)/', $input_string, 
     -1, PREG_SPLIT_DELIM_CAPTURE); 

    $groups[] = ''; 
    $current_group = 0; 

    while ($cur_piece = array_shift($pieces)) { 
     $piecelen = strlen($cur_piece); 

     if(strlen($groups[$current_group]) + $piecelen > 5000) { 
      // Adding the next piece whole would go over the limit, 
      // figure out what to do. 
      if($cur_piece[0] == '<') { 
       // Tag goes over the limit, just put it into a new group 
       $groups[++$current_group] = $cur_piece; 
      } else { 
       // Non-tag goes over the limit, split it and put the 
       // remainder back on the list of un-grabbed pieces 
       $grab_amount = 5000 - $strlen($groups[$current_group]; 
       $groups[$current_group] .= substr($cur_piece, 0, $grab_amount); 
       $groups[++$current_group] = ''; 
       array_unshift($pieces, substr($cur_piece, $grab_amount)); 
      } 
     } else { 
      // Adding this piece doesn't go over the limit, so just add it 
      $groups[$current_group] .= $cur_piece; 
     } 
    } 
    return $groups; 
} 

另外請注意,這可以在拆分常規單詞的中間 - 如果您不想要,那麼修改以// Non-tag goes over the limit開頭的部分,以便爲$grab_amount選擇更好的值。我沒有打擾編碼,因爲這只是一個如何解決分裂標籤的例子,而不是一個簡單的解決方案。

+0

哇琥珀,謝謝你。它應該真的讓我的車輪轉動。我會放棄它。 – james 2010-07-21 01:57:49

0

爲什麼發送到Google之前不剝離字符串中的HTML標籤。 PHP有一個strip_tags()函數可以爲你做到這一點。

+0

因爲我需要保持HTML完好無損,因爲它會最終呈現在頁面上 – james 2010-07-21 01:56:32

+0

不是谷歌翻譯出來嗎? – 2010-07-21 07:44:24

+0

不,它會忽略除'alt'之外的html標籤和屬性,就我的測試顯示而言。它返回它們沒有被觸動 – james 2010-07-21 18:21:15

0

preg_split一個很好的正則表達式會爲你做。