Skip to content

Conversation

@jenshannoschwalm
Copy link
Collaborator

Fix misleading OpenCL oversize buffer log

Fallback to non-const buffer should only be reported in DT_DEBUG_VERBOSE mode.

Fix some VNG OpenCL performance regressions

  1. subtle vng border_interpolate kernel improvements
    • #define AVGWINDOW
    • use samplerA where coordinates have been checked
  2. in OpenCL VNG code don't calculate and copy to device buffers if not required as we do only the linear interpolation part.
  3. capture log fix

@jenshannoschwalm jenshannoschwalm added bugfix pull request fixing a bug scope: performance doing everything the same but faster labels Jan 4, 2026
@jenshannoschwalm jenshannoschwalm added this to the 5.4.1 milestone Jan 4, 2026
@jenshannoschwalm
Copy link
Collaborator Author

Inspired by #20055 reporting an OpenCL performance regression for VNG demosaicer.
This PR certainly fixes some performance regressions but i don't think it should be that bad. As VNG doesn't tile in the logs reported there i currently suspect the barrier(CLK_LOCAL_MEM_FENCE) to be blocking on the old card with only few compute units. Not sure what we can do about that ...

@agat114 if you can self-compile you might want to check
@kofa73 @TurboGit would you be able to check performance?

For performance checks use -d pipe -d opencl -d perf -d verbose log options (verbose to report demosaic tiling and modes) and watch out for lines like

   30.2377 [opencl_profiling] spent  0.6216 seconds in vng_lin_interpolate
   30.2378 [opencl_profiling] spent  0.6642 seconds in vng_interpolate

compared with RCD it shouldn't be that bad.

Release note: Fixed some performance regression for OpenCL VNG demosaicers

@jenshannoschwalm jenshannoschwalm added the OpenCL Related to darktable OpenCL code label Jan 4, 2026
@agat114
Copy link

agat114 commented Jan 4, 2026

I'm going to compile today.

@TurboGit
Copy link
Member

TurboGit commented Jan 4, 2026

Without this PR:


    14.3559 [opencl_profiling] spent  0.0019 seconds in vng_border_interpolate
    14.3559 [opencl_profiling] spent  0.1546 seconds in vng_lin_interpolate
    14.3559 [opencl_profiling] spent  0.1483 seconds in vng_interpolate
    14.3559 [opencl_profiling] spent  0.0173 seconds in vng_green_equilibrate

With this PR:

   144.4473 [opencl_profiling] spent  0.0020 seconds in vng_border_interpolate
   144.4473 [opencl_profiling] spent  0.1455 seconds in vng_lin_interpolate
   144.4473 [opencl_profiling] spent  0.1350 seconds in vng_interpolate
   144.4473 [opencl_profiling] spent  0.0171 seconds in vng_green_equilibrate

So a bit faster but nothing noticeable by user I would say. This is using an NVIDIA CUDA Quadro T1000.

@jenshannoschwalm
Copy link
Collaborator Author

So a bit faster but nothing noticeable by user I would say. This is using an NVIDIA CUDA Quadro T1000.

Thanks also for the profiling data. I checked here comparing to 5.2 branch too and couldn't find a performance regression.
So release note: Slight performance gains for OpenCL VNG demosaicer.

@kofa73
Copy link
Contributor

kofa73 commented Jan 4, 2026

I'll run some measurements soon. Should I use huge files that cause tiling, or smaller ones, without tiling?

@jenshannoschwalm
Copy link
Collaborator Author

I'll run some measurements soon. Should I use huge files that cause tiling, or smaller ones, without tiling?

small ones will do. Also compare vs RCD if possible please.

@agat114
Copy link

agat114 commented Jan 4, 2026

@jenshannoschwalm Ignore the above. It seems I have pulled wrong repo and branch.

@agat114
Copy link

agat114 commented Jan 4, 2026

@jenshannoschwalm I have compiled DT from your branch
~/compile_darktable/darktable git remote get-url origin
git@github.com:jenshannoschwalm/darktable.git
~/compile_darktable/darktable git branch --show-current
opencl_vng_demosaicer

The stutter is still there with VNG4 when turning it on and modifying module parameters. No stutter, whatsoever, with AMAZE.
Tested on freshly imported file generating new XMP.
Please find the log in the attachment
dt_opencl_vng_demosaicer_perf_pipe_opencl_verbose.txt
_DSC0010.nef.xmp.txt

UPD the version of compiled app is still weird, despite I pulled it from your repo and this PR branch
/opt/darktable-test/bin/darktable --version
darktable 3.3.0+18191~g91629d8df3
This leaves me confused whether the version for pull has to be selected somehow.

@kofa73
Copy link
Contributor

kofa73 commented Jan 4, 2026

I see no difference (well, no speed-up).

RCD
master:

  1167.3953 OpenCL:0 direct bayer demosaic RCD. tiles=1 tileheight=5520 overlap=0
  1167.5148 [dev_pixelpipe] took 0.120 secs (0.093 CPU) [export] processed `demosaic' on GPU, blended on GPU
  1167.8056 [opencl_profiling] spent  0.0050 seconds in rcd_border_green
  1167.8056 [opencl_profiling] spent  0.0069 seconds in rcd_border_redblue
  1167.8056 [opencl_profiling] spent  0.0072 seconds in rcd_populate
  1167.8056 [opencl_profiling] spent  0.0053 seconds in rcd_step_1_1
  1167.8056 [opencl_profiling] spent  0.0038 seconds in rcd_step_1_2
  1167.8056 [opencl_profiling] spent  0.0026 seconds in rcd_step_2_1
  1167.8056 [opencl_profiling] spent  0.0083 seconds in rcd_step_3_1
  1167.8056 [opencl_profiling] spent  0.0067 seconds in rcd_step_4_1
  1167.8056 [opencl_profiling] spent  0.0041 seconds in rcd_step_4_2
  1167.8056 [opencl_profiling] spent  0.0061 seconds in rcd_step_5_1
  1167.8056 [opencl_profiling] spent  0.0095 seconds in rcd_step_5_2
  1167.8056 [opencl_profiling] spent  0.0093 seconds in rcd_write_output

PR 5.5.0+58~g91629d8df3:

    19.3543 OpenCL:0 direct bayer demosaic RCD. tiles=1 tileheight=5520 overlap=0
    19.4708 [dev_pixelpipe] took 0.117 secs (0.083 CPU) [export] processed `demosaic' on GPU, blended on GPU
    19.7462 [opencl_profiling] spent  0.0043 seconds in rcd_border_green
    19.7462 [opencl_profiling] spent  0.0066 seconds in rcd_border_redblue
    19.7462 [opencl_profiling] spent  0.0073 seconds in rcd_populate
    19.7462 [opencl_profiling] spent  0.0060 seconds in rcd_step_1_1
    19.7462 [opencl_profiling] spent  0.0062 seconds in rcd_step_1_2
    19.7462 [opencl_profiling] spent  0.0035 seconds in rcd_step_2_1
    19.7462 [opencl_profiling] spent  0.0092 seconds in rcd_step_3_1
    19.7462 [opencl_profiling] spent  0.0034 seconds in rcd_step_4_1
    19.7462 [opencl_profiling] spent  0.0041 seconds in rcd_step_4_2
    19.7462 [opencl_profiling] spent  0.0056 seconds in rcd_step_5_1
    19.7462 [opencl_profiling] spent  0.0087 seconds in rcd_step_5_2
    19.7462 [opencl_profiling] spent  0.0093 seconds in rcd_write_output

VNG4
master:

  1191.8212 OpenCL:0 direct bayer demosaic VNG4. tiles=1 tileheight=5520 overlap=0
  1192.0855 [dev_pixelpipe] took 0.265 secs (0.279 CPU) [export] processed `demosaic' on GPU, blended on GPU
  1192.3631 [opencl_profiling] spent  0.0015 seconds in vng_border_interpolate
  1192.3631 [opencl_profiling] spent  0.0870 seconds in vng_lin_interpolate
  1192.3631 [opencl_profiling] spent  0.1536 seconds in vng_interpolate
  1192.3631 [opencl_profiling] spent  0.0113 seconds in vng_green_equilibrate

PR 5.5.0+58~g91629d8df3:

   106.0179 OpenCL:0 direct bayer demosaic VNG4. tiles=1 tileheight=5520 overlap=0
   106.3035 [dev_pixelpipe] took 0.286 secs (0.309 CPU) [export] processed `demosaic' on GPU, blended on GPU
   106.5816 [opencl_profiling] spent  0.0018 seconds in vng_border_interpolate
   106.5816 [opencl_profiling] spent  0.1056 seconds in vng_lin_interpolate
   106.5816 [opencl_profiling] spent  0.1555 seconds in vng_interpolate
   106.5816 [opencl_profiling] spent  0.0115 seconds in vng_green_equilibrate

Image: https://discuss.pixls.us/uploads/short-url/hqnf0h20vaTzsr5Ay0NmWsFsWu2.NEF with minimal processing, capture sharpen OFF.

Nvidia 1060/6GB, resources = large.

@jenshannoschwalm
Copy link
Collaborator Author

Thanks for all that testing, we seem to have a very small benefit with this PR on small GPUs. The main culprit seems to be the VNG algo itself.

@agat114
Copy link

agat114 commented Jan 4, 2026

Thanks for all that testing, we seem to have a very small benefit with this PR on small GPUs. The main culprit seems to be the VNG algo itself.

No problem. It was interesting to help.

@kofa73
Copy link
Contributor

kofa73 commented Jan 4, 2026

Thanks for all that testing, we seem to have a very small benefit with this PR on small GPUs. The main culprit seems to be the VNG algo itself.

In my case, it actually became about a bit slower, 286 vs 265 ms. Of course, it was a single test. Such small changes, either speed-ups or slow-downs, are not really perceptible. I could try with a huge file, but then the tiling would dominate over the algo, I think.

Fallback to non-const buffer should only be reported in DT_DEBUG_VERBOSE mode.
1. subtle vng border_interpolate kernel improvements
   - #define AVGWINDOW
   - use samplerA where coordinates have been checked
2. in OpenCL VNG code don't calculate and copy to device buffers
   if not required as we do only the linear interpolation part.
3. use vectorized copy_zero
4. capture log fix
When dual demosaicing we use the only_linear VNG/VNG4 demosaicer mode.

For bayer sensors we must do the green-equilibrate too to avoid color casts.
@TurboGit
Copy link
Member

TurboGit commented Jan 5, 2026

This introduces at least 2 regressions:

Test 0163-demosaic-rcd-vng4
      Image mire1.cr2
      CPU & GPU version differ by 28191 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 9.62408
      Avg dE                   : 0.00697
      Std dE                   : 0.09082
      ----------------------------------
      Pixels below avg + 0 std : 98.99 %
      Pixels below avg + 1 std : 99.00 %
      Pixels below avg + 3 std : 99.13 %
      Pixels below avg + 6 std : 99.58 %
      Pixels below avg + 9 std : 99.78 %
      ----------------------------------
      Pixels above tolerance   : 0.03 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 11.96889
      Avg dE                   : 0.46261
      Std dE                   : 0.42572
      ----------------------------------
      Pixels below avg + 0 std : 49.84 %
      Pixels below avg + 1 std : 87.90 %
      Pixels below avg + 3 std : 98.81 %
      Pixels below avg + 6 std : 99.95 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.27 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.97302e+06 pixels changed)

The diff for 0163:

image
Test 0164-demosaic-amaze-vng4
      Image mire1.cr2
      CPU & GPU version differ by 16587 pixels
      CPU vs. GPU report :
      ----------------------------------
      Max dE                   : 1.18136
      Avg dE                   : 0.00275
      Std dE                   : 0.03894
      ----------------------------------
      Pixels below avg + 0 std : 99.41 %
      Pixels below avg + 1 std : 99.41 %
      Pixels below avg + 3 std : 99.42 %
      Pixels below avg + 6 std : 99.50 %
      Pixels below avg + 9 std : 99.60 %
      ----------------------------------
      Pixels above tolerance   : 0.00 %
 
      Expected CPU vs. current CPU report :
      ----------------------------------
      Max dE                   : 14.11847
      Avg dE                   : 0.42845
      Std dE                   : 0.40227
      ----------------------------------
      Pixels below avg + 0 std : 48.50 %
      Pixels below avg + 1 std : 86.90 %
      Pixels below avg + 3 std : 98.87 %
      Pixels below avg + 6 std : 99.96 %
      Pixels below avg + 9 std : 100.00 %
      ----------------------------------
      Pixels above tolerance   : 0.16 %
 
  FAILS: image visually changed
         see diff.png for visual difference
         (1.89394e+06 pixels changed)

The diff for 0164:

image

In both cases the Max dE is > 10 and the number of changed pixel is quite big.

@jenshannoschwalm : Can you look at this? TIA.

@jenshannoschwalm
Copy link
Collaborator Author

Yes i just did again and confess openly, i don't have the integration suite tests not yet working on my new system :-)

I am sure the new algo including green equil is correct and leading to better results - iirc i mentioned the color cast while working on 5.4 demosaicer. It's not very visible, especially on light-brown sands there was some color discrepancy between low vs high frequency content which is now gone. (xtrans was not affected as that is VNG)

How to proceed? If you want me to do that, i could add another demoasic version bump introducing a flag oldvng_linear to keep old edits exact - people might have added some tricky hue correction for this - but personally i would always want the better color.

@TurboGit
Copy link
Member

TurboGit commented Jan 6, 2026

@jenshannoschwalm : I had a look at the expected vs new output and I cannot see a difference visually. So let's this be the new expected output. Thanks.

@TurboGit TurboGit merged commit 7df59c6 into darktable-org:master Jan 6, 2026
5 checks passed
@jenshannoschwalm jenshannoschwalm deleted the opencl_vng_demosaicer branch January 6, 2026 13:32
@jenshannoschwalm
Copy link
Collaborator Author

Maybe a release note: Fixed subtle color casts in bayer dual demosaicers

@TurboGit
Copy link
Member

TurboGit commented Jan 7, 2026

Seems like this has also affected 0172-capture-dual-rcd. Is that expected?

@jenshannoschwalm
Copy link
Collaborator Author

Will check again

@jenshannoschwalm
Copy link
Collaborator Author

Sorry i misunderstood your question. Yes of course! The VNG4 linear-only step for dual demosaicing missed the greens-equil for bayer sensors. BTW there is one more pr about vng coming later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix pull request fixing a bug OpenCL Related to darktable OpenCL code scope: performance doing everything the same but faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants