FPGA JPEG Encoder

April 5, 2026 · Code available on GitHub

This post outlines a project of mine where I built a JPEG encoder on an FPGA. The image you see above was encoded on my FPGA. This post briefly goes over how JPEG encoding transfers to FPGAs (from a software engineer's perspective :) ). For a great explanation of how JPEG encoding itself works, I'd recommend this article.

The setup: a Lattice CrossLink-NX FPGA connected to a Raspberry Pi over SPI. The Pi sends raw 1920x1080 RGB pixels, the FPGA compresses them through a full JPEG pipeline, and sends back valid JPEG data.

The Pipeline

The full encoding pipeline, 10 modules connected via valid/ready handshakes:

Stage by Stage

This section runs through the interesting modules from the pipeline.

Block reordering

JPEG operates on 8x8 blocks, but pixels arrive in raster order. In software, you'd just index into a 2D array. On an FPGA, you need to build the memory, decide when to write, and when to read.

Filling row buffer (raster order)

The row buffer stores 8 full rows (1920 x 8 = 15,360 pixels) in a dual-port RAM block. A state machine controls the lifecycle:

S_FILL - accept bytes, assemble RGB triplets, write to RAM.
S_ADDR - compute the RAM read address, mapping block-order coordinates back to raster-order.
S_LATCH - wait one cycle for the RAM read data to settle. RAM output isn't available until the cycle after you present the address.
S_READ - latch the pixel and offer it downstream.

DCT

The hardest part about the DCT on an FPGA: no floating-point unit. Everything has to be fixed-point integer math. Instead of a matrix multiply, we separate the 2D DCT into two 1D DCT passes (rows, then columns). Each 1D DCT is a dot product of 8 values with 8 cosine coefficients.

Fixed-point multiplication

The DCT needs cosine coefficients like cos(π/16) = 0.9808. On an FPGA there are no floats, so we scale by 1024 and round to get a 12-bit signed integer: round(0.9808 × 1024) = 1004. This is precomputed once and stored in ROM. This is how we multiply 0.9808 × 52 on an FPGA:

Fixed-point multiply: 0.9808 × 52

All 64 cosine coefficients for the 8×8 DCT are precomputed this way and stored in a single 64-entry ROM of 12-bit signed values. Both the row and column passes use the same ROM since the DCT is separable.

8 MACs in parallel

The building block is a multiply-accumulate (MAC) unit: a multiplier feeding into an adder with a 32-bit accumulator register that loops back. Each cycle, one product is added to the running total. With 8 MAC units running simultaneously, all 8 output coefficients for a row are computed at once. Each cycle, one input pixel is broadcast to all 8 MACs, each multiplying by its own ROM coefficient. After 8 cycles the row is done.

The clock waveform below shows the timing: every rising edge, all 8 MACs fire.

CLK

8 parallel MACs computing one row

Input pixels [16-bit signed]

ROM coefficients [12-bit signed]

Accumulators [32-bit]

Quantization

Quantization divides each DCT coefficient by a value from a table. Small high-frequency coefficients round to zero. This is where JPEG actually compresses.

Division is expensive in hardware. The trick: multiply by the reciprocal instead. For each table entry Q[i], precompute round(65536 / Q[i]) and store it in a ROM. Then:

quantized = (|coefficient| * reciprocal + 32768) >> 16

Huffman encoding

Huffman encoding produces variable-length codes, so you're no longer working with byte-aligned data.

The encoder maintains a 40-bit shift register as a bit buffer with a 6-bit counter tracking valid bits. Huffman codes (from ROM) and raw coefficient bits are shifted in. When 8+ bits accumulate, a byte is emitted.

The state machine walks through each block's channels sequentially: Y DC, Y AC, Cb DC, Cb AC, Cr DC, Cr AC. For AC coefficients, it tracks runs of zeros for run-length encoding. The bit buffer persists across blocks since the JPEG bitstream is continuous.

Backpressure

Every module implements the same interface: in_valid, in_ready, out_valid, out_ready. A transfer happens when both valid and ready are high on the same clock edge.

A slow downstream module automatically stalls everything upstream. Each module's ready signal is one line of Verilog:

assign in_ready = out_ready || !out_valid;

Flow control that would require threads, queues, or async/await in software is just a wire in hardware.

Results

The encoder produces valid JPEG files, pixel-identical to a software encoder using the same pipeline. The image above was encoded entirely on the FPGA.

Image resolution	1920 x 1080
Input size	6.2 MB (raw RGB)
Output size	321 KB (JPEG)
Compression ratio	19.4:1
FPGA clock	~19 MHz
DCT cycles per block	1,152
Total blocks (1080p)	32,400

Next?

AV1...?