Parallelizing Linear Feedback Systems with Matrices

Input-driven linear feedback shift registers are a common pattern in digital design. They are usually implemented as serial circuits, processing one input symbol per clock. In many applications, this throughput is not enough. Fortunately, these feedback systems are linear. Once written as a matrix recurrence, the serial update can be raised to an arbitrary power, producing a parallel implementation directly from the polynomial description.

Although LFSRs are often presented as autonomous systems, as in PRBS generators, the same linear-feedback structure can be driven by input symbols, as in CRC and Reed-Solomon encoders.

Matrix representation

In a serial input-driven LFSR, the state is updated once per cycle as

$$ \mathbf{s}_{t+1} = f(\mathbf{s}_t, x_t) $$

where $\mathbf{s}_t$ is a vector representing the system state at time $t$ and $x_t$ is a scalar input symbol. Since $f$ is linear over the underlying field, we can also express the equation above as

$$ \mathbf{s}_{t+1} = \mathbf{A} \mathbf{s}_t + \mathbf{B} x_t $$

where we define $\mathbf{A}$ as the state transition matrix and $\mathbf{B}$ as the input injection matrix.

For example, consider a binary Fibonacci-style LFSR with $n$ state bits:

$$ \mathbf{s}_t = \begin{bmatrix} s_{t,0} \\ s_{t,1} \\ \vdots \\ s_{t,n-1} \end{bmatrix}. $$

If the input is injected into the first stage, the state update has the form

$$ \mathbf{s}_{t+1} = \begin{bmatrix} f_t + x_t \\ s_{t,0} \\ \vdots \\ s_{t,n-2} \end{bmatrix}, $$

where $f_t$ is the feedback value computed from the tapped state bits. To avoid getting buried in indices, let's use a concrete example. Suppose we have a 4-bit state with feedback polynomial $P(x) = x^4 + x^3 + 1$. With our state ordering, this gives $f_t = s_{t,0} + s_{t,3}$.

LFSR block diagram

With the state ordering above, the state transition matrix is

$$ \mathbf{A} = \begin{bmatrix} 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \end{bmatrix}. $$

The lower rows perform the shift, while the first row computes the feedback from the polynomial taps. Since the input is injected into the first stage, the input injection matrix is simply

$$ \mathbf{B} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}. $$

Processing multiple inputs per cycle

After two steps we have

$$ \mathbf{s}_{t+2} = \mathbf{A} \mathbf{s}_{t+1} + \mathbf{B} x_{t+1} $$

Substituting the expression for $\mathbf{s}_{t+1}$ gives

$$ \begin{align} \mathbf{s}_{t+2} &= \mathbf{A} (\mathbf{A} \mathbf{s}_{t} + \mathbf{B} x_{t}) + \mathbf{B} x_{t+1} \\ \mathbf{s}_{t+2} &= \mathbf{A}^2 \mathbf{s}_{t} + \mathbf{A}\mathbf{B} x_{t} + \mathbf{B} x_{t+1} \end{align} $$

Repeating this substitution for $k$ steps gives

$$ \mathbf{s}_{t+k} = \mathbf{A}^k \mathbf{s}_t + \sum_{i=0}^{k-1} \mathbf{A}^{k-1-i}\mathbf{B}x_{t+i}. $$

So the next state can be computed directly from the current state and $k$ input symbols, without evaluating the intermediate states.

Define the parallel input vector as

$$ \mathbf{x} = \begin{bmatrix} x_{t} \\ x_{t+1} \\ \vdots \\ x_{t+k-1} \end{bmatrix}. $$

Then the resulting state after $k$ inputs can be written as

$$ \mathbf{s}_{t+k} = \mathbf{C}\mathbf{s}_t + \mathbf{D}\mathbf{x} $$

where

$$ \mathbf{C} = \mathbf{A}^k $$

and

$$ \mathbf{D} = \begin{bmatrix} \mathbf{A}^{k-1}\mathbf{B} & \mathbf{A}^{k-2}\mathbf{B} & \dots & \mathbf{A}\mathbf{B} & \mathbf{B}\end{bmatrix} $$

Hardware interpretation

So, how does this translate into hardware? For an LFSR over $\mathrm{GF}(2)$, scalar multiplication is just an AND and addition is just an XOR. Better yet, for a fixed feedback polynomial and parallelization factor $k$, both $\mathbf{C}$ and $\mathbf{D}$ are constants that can be computed ahead of time. In other words, the hardware does not need to perform any matrix multiplication or exponentiation at runtime: it all reduces to a combinational XOR network.

Going back to our 4-bit LFSR, suppose we want to process four input bits per cycle, so $k=4$. Computing $\mathbf{C} = \mathbf{A}^4$ and the corresponding $\mathbf{D}$ matrix gives

$$ \mathbf{C} = \begin{bmatrix} 0 & 1 & 1 & 1 \\ 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 \end{bmatrix} $$

and

$$ \mathbf{D} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 1 & 1 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \end{bmatrix}. $$

Expanding $\mathbf{s}_{t+4} = \mathbf{C}\mathbf{s}_t + \mathbf{D}\mathbf{x}$ makes the resulting logic clear:

$$ \begin{align} s_{t+4,0} &= s_{t,1} \oplus s_{t,2} \oplus s_{t,3} \oplus x_t \oplus x_{t+1} \oplus x_{t+2} \oplus x_{t+3} \\ s_{t+4,1} &= s_{t,0} \oplus s_{t,1} \oplus s_{t,2} \oplus s_{t,3} \oplus x_t \oplus x_{t+1} \oplus x_{t+2} \\ s_{t+4,2} &= s_{t,0} \oplus s_{t,2} \oplus s_{t,3} \oplus x_t \oplus x_{t+1} \\ s_{t+4,3} &= s_{t,0} \oplus s_{t,3} \oplus x_t. \end{align} $$

Each next-state bit is now just the XOR of a fixed set of state and input bits, so all four can be computed in parallel during one clock cycle. The diagram below shows the resulting parallel architecture, drawing each bit as a separate XOR network. In the synthesized circuit, common terms may be shared or duplicated depending on the implementation constraints.

Parallel LFSR block diagram

The same approach also works for symbol-based feedback systems such as Reed-Solomon encoders. In a follow-up, we'll apply it over $\mathrm{GF}(2^m)$, generate the resulting hardware with Chisel, and look at how the structure can be optimized for timing.