BIO: The Bao I/O Coprocessor

BIO is the I/O co-processor in the Baochip-1x, a mostly open source 22nm SoC I helped design. You can read more about the Baochip-1x’s background here, or pick up an evaluation board at Crowd Supply.

In this post, I’ll talk about the origins of the BIO, starting by working through a detailed study of the Raspberry Pi PIO as a reference, before diving into the architecture of the BIO. I’ll then work through three programming examples of the BIO, two in assembly and one in C. If all you’re interested in is how to use the BIO, you can skip the background details and go around halfway down the post to the section titled “Design of the BIO”, or go right into the code examples.

Background

I/O co-processors off-load I/O tasks from main CPU cores. Main CPUs have to juggle multiple priorities using some form of multi-tasking, which leads to unpredictable response times. These unpredictable responses manifest as undesirable jitter or delays in critical responses. Dedicating a co-processor to an I/O task achieves a determinism approaching that of a dedicated hardware state machine while maintaining the flexibility of a general purpose CPU.

A well-known example of an I/O co-processor is the Raspberry Pi’s PIO. It consists of a set of four “processors”, each with nine instructions, with an instruction memory of 32 locations, highly tuned to provide great flexibility with easy cycle-accurate manipulation of GPIOs. For example, a SPI implementation with clock, in, and out consists of a configuration modifier plus just two instructions that are executed in an “effective loop” due to configurable side-effects available in the PIO configuration, such as automatic code wrap-around and FIFO management:

".side_set 1", "out pins, 1 side 0 [1]", "in pins, 1 side 1 [1]",

I wanted some form of I/O co-processor in Baochip, so I studied the PIO the best way I knew how – by copying it. I forked Lawrie Griffith’s fpga_pio as a starting point, and did a whole bunch of regression testing and detail simulation to clean up all the missing corner cases. You can find what I think is fairly close to a fully spec-compliant RP2040-generation PIO core in this github repo.

Lessons Learned from the PIO

After building a PIO clone and compiling it for an FPGA, I was surprised to find that the PIO consumes a surprisingly large amount of resources. If you’re thinking about using it in an FPGA, you’d be better off skipping the PIO and just implementing whatever peripherals you want directly using RTL.

Above is a hierarchical resource map of the placed & routed PIO core targeting a XC7A100 FPGA. I’ve highlighted the portion occupied by the PIO in magenta. It uses up more than half the FPGA, even more than the RISC-V CPU core (the “VexRiscAxi4” block on the right)! Despite only being able to run nine instructions, each PIO core consists of about 5,000 logic cells. Compare this to the VexRiscv CPU, which, if you don’t count the I-cache and D-cache, consumes only 4600 logic cells.

... continue reading