簡單GLSL卷積着色器是殘暴慢

我想實現在OpenGL ES2.0的iOS二維輪廓着色器。它非常緩慢。如在5fps緩慢。我跟蹤到了texture2D（）調用。但是，沒有那些卷積着色器是可撤銷的。我嘗試過使用lowp而不是中性，但是一切都是黑色的，儘管它確實給出了5fps，但仍然無法使用。簡單GLSL卷積着色器是殘暴慢

這是我的片段着色器。

varying mediump vec4 colorVarying; 
    varying mediump vec2 texCoord; 

    uniform bool enableTexture; 
    uniform sampler2D texture; 

    uniform mediump float k; 

    void main() { 

     const mediump float step_w = 3.0/128.0; 
     const mediump float step_h = 3.0/128.0; 
     const mediump vec4 b = vec4(0.0, 0.0, 0.0, 1.0); 
     const mediump vec4 one = vec4(1.0, 1.0, 1.0, 1.0); 

     mediump vec2 offset[9]; 
     mediump float kernel[9]; 
     offset[0] = vec2(-step_w, step_h); 
     offset[1] = vec2(-step_w, 0.0); 
     offset[2] = vec2(-step_w, -step_h); 
     offset[3] = vec2(0.0, step_h); 
     offset[4] = vec2(0.0, 0.0); 
     offset[5] = vec2(0.0, -step_h); 
     offset[6] = vec2(step_w, step_h); 
     offset[7] = vec2(step_w, 0.0); 
     offset[8] = vec2(step_w, -step_h); 

     kernel[0] = kernel[2] = kernel[6] = kernel[8] = 1.0/k; 
     kernel[1] = kernel[3] = kernel[5] = kernel[7] = 2.0/k; 
     kernel[4] = -16.0/k; 

     if (enableTexture) { 
       mediump vec4 sum = vec4(0.0); 
      for (int i=0;i<9;i++) { 
       mediump vec4 tmp = texture2D(texture, texCoord + offset[i]); 
       sum += tmp * kernel[i]; 
      } 

      gl_FragColor = (sum * b) + ((one-sum) * texture2D(texture, texCoord)); 
     } else { 
      gl_FragColor = colorVarying; 
     } 
    }

這是沒有優化的，沒有最終確定，但我需要提高性能，然後再繼續。我嘗試用循環中的texture2D（）調用替換爲純粹的vec4，儘管一切正常，但運行沒有問題。

我該如何優化？我知道這是可能的，因爲我已經看到了3D方式中更多的參與效果運行沒有問題。我看不出爲什麼這會造成任何麻煩。

來源

2012-09-18 user1137704

「*我試過用純粹的vec4替換了循環中的texture2D（）調用，它沒有問題*」這是什麼意思？它變快了嗎？它沒有改變性能？發生了什麼？ –

「*我看不出爲什麼這會導致任何麻煩。*」您正在執行每個着色器調用的* 10次紋理訪問，並且您沒有看到可能導致問題的原因？此外，您訪問中心texel兩次。 –

沒有紋理查找，我得到了穩定的60fps（不包括最後一個）。正如我所說，它沒有優化，但沒有辦法避免這些紋理調用。否則，過濾器無法工作。但是我看到很多遊戲，移動和不是，使用基於卷積過濾器的效果，而且他們似乎沒有任何問題。除非有一些竅門可以避免它們？ – user1137704

我知道降低這種着色器所用的時間的唯一方法是通過減少紋理拾取的數量。由於着色器從圍繞中心像素的等間距點採樣紋理併線性組合它們，因此可以通過使用紋理採樣的GL_LINEAR模式availbale來減少拾取次數。

基本上代替採樣在每個紋理像素，在一對紋理像素的樣本之間直接得到線性加權和。

（-stepw，0）作爲X0和X1分別讓我們致電偏移（-stepw，-steph）採樣和。那麼你的總和是

sum = x0*k0 + x1*k1

現在代替如果樣品中的這兩個紋理像素之間，在從X0 k0/(k0+k1)，因此從X1 k1/(k0+k1)的距離，則GPU將期間執行線性加權的獲取和給你，

y = x1*k1/(k0+k1) + x0*k0/(k1+k0)

因此總和可以計算爲

sum = y*(k0 + k1)從一個抓取！

如果重複此爲其他相鄰的像素，你最終會做4紋理拾取的每個相鄰偏移，和一個額外的紋理拾取中心像素。

的link解釋這更好

來源

2012-09-18 05:04:17 Slartibartfast

我做這個確切的事情我自己，我看到幾件事情，可以在這裏進行了優化。

首先，我會刪除enableTexture條件，而是分裂着色器分爲兩個方案，一個本作的真實狀態和一個虛假的。在iOS片段着色器中，條件是非常昂貴的，特別是那些有紋理讀取的片段。

其次，你在這裏有九個依賴紋理。這些是在片段着色器內計算紋理座標的紋理讀取。 iOS設備中的PowerVR GPU上的相關紋理讀取非常昂貴，因爲它們會阻止硬件優化使用緩存等的紋理讀取。因爲您從8個周圍像素和一箇中心像素的固定偏移量進行採樣，所以這些計算應該是向上移動到頂點着色器中。這也意味着不必爲每個像素執行這些計算，每個頂點只執行一次，然後硬件插值將處理其餘部分。第三，for（）循環到目前爲止還沒有被iOS着色器編譯器處理得很好，所以我傾向於避免那些我可以的地方。

正如我所提到的，我在我的開源iOS GPUImage框架中完成了像這樣的卷積着色器。用於通用卷積濾波器，我使用下面的頂點着色器：

attribute vec4 position; 
attribute vec4 inputTextureCoordinate; 

uniform highp float texelWidth; 
uniform highp float texelHeight; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    gl_Position = position; 

    vec2 widthStep = vec2(texelWidth, 0.0); 
    vec2 heightStep = vec2(0.0, texelHeight); 
    vec2 widthHeightStep = vec2(texelWidth, texelHeight); 
    vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight); 

    textureCoordinate = inputTextureCoordinate.xy; 
    leftTextureCoordinate = inputTextureCoordinate.xy - widthStep; 
    rightTextureCoordinate = inputTextureCoordinate.xy + widthStep; 

    topTextureCoordinate = inputTextureCoordinate.xy - heightStep; 
    topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep; 
    topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep; 

    bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep; 
    bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep; 
    bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep; 
}

和以下片段着色器：

precision highp float; 

uniform sampler2D inputImageTexture; 

uniform mediump mat3 convolutionMatrix; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    mediump vec4 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate); 
    mediump vec4 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate); 
    mediump vec4 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate); 
    mediump vec4 centerColor = texture2D(inputImageTexture, textureCoordinate); 
    mediump vec4 leftColor = texture2D(inputImageTexture, leftTextureCoordinate); 
    mediump vec4 rightColor = texture2D(inputImageTexture, rightTextureCoordinate); 
    mediump vec4 topColor = texture2D(inputImageTexture, topTextureCoordinate); 
    mediump vec4 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate); 
    mediump vec4 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate); 

    mediump vec4 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2]; 
    resultColor += leftColor * convolutionMatrix[1][0] + centerColor * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2]; 
    resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2]; 

    gl_FragColor = resultColor; 
}

的texelWidth和texelHeight制服是寬度的倒數和高度的輸入圖像的，並且convolutionMatrix統一規定了卷積中各種樣本的權重。

在iPhone 4上，對於640x480的攝像機視頻幀，該視頻在4-8 ms內運行，這對於在該圖像大小下進行60 FPS渲染足夠好。如果您只需要進行邊緣檢測等操作，則可以簡化上述操作，將圖像轉換爲預通道的亮度，然後僅從一個顏色通道採樣。這甚至更快，在同一設備上每幀大約2毫秒。

來源

2012-09-18 16:40:04

特別感謝。它救了我！ – hiepnd

很好的例子。 tl; dr：**避免依賴紋理讀取**。 Endeavour也通過渲染兩遍來測試可分離的卷積，以減少提取的次數（雖然對於這樣一個9的例子來說，它不會減少到一半以下，所以在這種情況下，一個兩遍的方法可能會是一個壞主意） –

@StevenLu - 一旦你在許多GPU上單次傳遞超過9次紋理讀取，性能就會出現驚人的急劇下降。將其分成兩個通道可以對性能產生非線性影響，與一次通過的樣本數量相比。我已經測試過，並且在單次傳遞中運行它比分離內核慢得多，即使對於這麼少的樣本也是如此。 –

簡單GLSL卷積着色器是殘暴慢

回答

相關問題