CUJ2K is an implementation of JPEG 2000 in CUDA. It makes use of the highly parallel architecture of graphic-cards.
Our aim was a high-performance-implementation of the basic features of JPEG 2000. There are some parts like pre-processing and wavelet-transform where a really high speedup could be achieved in comparison to the cpu-implementation (for irreversible dwt of a 18.8MB-picture: 280 times faster!!!!).
Due to its serial structure this high performance-gain was not possible for the tier 1-algorithm (the entropy-coding), which in the current implementation takes about 90% of the computation-time. Other parts (file-reading, tier 2 (organizing final data-stream) and file-writing) are still performed on the cpu, because it does not make sense to do them on the graphic-device.
Streaming scheme of CUJ2K (shows only beginning of encoding)
CUJ2K takes advantage of streaming. This is a CUDA feature which helps improving performance by parallelizing GPU kernels with both CPU jobs (like file-reading, -writing and tier 2) and copies between device (GPU) and host (CPU) memory. We implemented a 3-level-pipeline which means that three pictures are processed concurrently, in each of them a different encoding step being performed. Like this, the time needed for reading and writing files is completely compensated and only the GPU kernels are performance-critical.
For more detailed information, please read our documentation.