Input-driven linear feedback shift registers are a common pattern in digital design. They are usually implemented as serial circuits, processing one input symbol per clock. In many applications, this throughput is not enough. Fortunately, these feedback systems are linear. Once written as a matrix recurrence, the serial update can be raised to an arbitrary power, producing a parallel implementation directly from the polynomial description.
Although LFSRs are often presented as autonomous systems, as in PRBS generators, the same linear-feedback structure can be driven by input symbols, as in CRC and Reed-Solomon encoders.
Matrix representation
In a serial input-driven LFSR, the state is updated once per cycle as
where \(\mathbf{s}_t\) is a vector representing the system state at time \(t\) and \(x_t\) is a scalar input symbol. Since \(f\) is linear over the underlying field, we can also express the equation above as
where we define \(\mathbf{A}\) as the state transition matrix and \(\mathbf{B}\) as the input injection matrix.
For example, consider a binary Fibonacci-style LFSR with \(n\) state bits:
If the input is injected into the first stage, the state update has the form
where \(f_t\) is the feedback value computed from the tapped state bits. To avoid getting buried in indices, let's use a concrete example. Suppose we have a 4-bit state with feedback polynomial \(P(x) = x^4 + x^3 + 1\). With our state ordering, this gives \(f_t = s_{t,0} + s_{t,3}\).
With the state ordering above, the state transition matrix is
The lower rows perform the shift, while the first row computes the feedback from the polynomial taps. Since the input is injected into the first stage, the input injection matrix is simply
Processing multiple inputs per cycle
After two steps we have
Substituting the expression for \(\mathbf{s}_{t+1}\) gives
Repeating this substitution for \(k\) steps gives
So the next state can be computed directly from the current state and \(k\) input symbols, without evaluating the intermediate states.
Define the parallel input vector as
Then the resulting state after \(k\) inputs can be written as
where
and
Hardware interpretation
So, how does this translate into hardware? For an LFSR over \(\mathrm{GF}(2)\), scalar multiplication is just an AND and addition is just an XOR. Better yet, for a fixed feedback polynomial and parallelization factor \(k\), both \(\mathbf{C}\) and \(\mathbf{D}\) are constants that can be computed ahead of time. In other words, the hardware does not need to perform any matrix multiplication or exponentiation at runtime: it all reduces to a combinational XOR network.
Going back to our 4-bit LFSR, suppose we want to process four input bits per cycle, so \(k=4\). Computing \(\mathbf{C} = \mathbf{A}^4\) and the corresponding \(\mathbf{D}\) matrix gives
and
Expanding \(\mathbf{s}_{t+4} = \mathbf{C}\mathbf{s}_t + \mathbf{D}\mathbf{x}\) makes the resulting logic clear:
Each next-state bit is now just the XOR of a fixed set of state and input bits, so all four can be computed in parallel during one clock cycle. The diagram below shows the resulting parallel architecture, drawing each bit as a separate XOR network. In the synthesized circuit, common terms may be shared or duplicated depending on the implementation constraints.
The same approach also works for symbol-based feedback systems such as Reed-Solomon encoders. In a follow-up, we'll apply it over \(\mathrm{GF}(2^m)\), generate the resulting hardware with Chisel, and look at how the structure can be optimized for timing.