Switchboard

Switchboard (SB) is a framework for communication between distinct hardware models, such as RTL simulations, RTL implemented on FPGAs, and fast SW models. This makes it possible to simulate large hardware systems in a distributed fashion, using whatever models are available for the different components.

In such a simulation, each hardware model has one or more SB ports. Each is unidirectional: it may act as an input or an output, but not both. In addition, each SB connection is single-producer, single-consumer (SPSC): an output port may not drive more than one input port, and an input port may not be driven by more than one output port.

Here's an example of what a switchboard connection topology might look like:

The method for adding a switchboard port depends on the language that a HW model is implemented in. For RTL-based models, SB ports are instantiated as Verilog models, whereas for C++ and Python-based models, these ports are instantiated as objects. We provide both a low-level interface for moving data directly between SB ports, as well as a higher-level interface for running UMI transactions over SB connections.

Under the hood, communication happens through shared-memory queues, where an SB output port is driving packets into the queue, and an SB input port is reading from that queue. This standardization is what allows any two kinds of models to talk to each other. A shared-memory SPSC queue is an appealing common interface because it is one of the fastest interprocess communication techniques, with latencies on the order of hundreds of nanoseconds; no system calls are required to transmit and receive data. At the same time, this type of queue is straightforward to implement for FPGA platforms, with queue read and write operations only requiring a handful of memory transactions.

Switchboard also has mixed-signal features that make it easy to incorporate SPICE subcircuits into a system emulation; see the xyce example.

Installation

The fastest way to install this package is from PyPI:

pip install switchboard-hw

However, if you want to run the examples below (or if you're a switchboard developer), clone this repository and install the Python package in-place:

git clone https://github.com/zeroasiccorp/switchboard.git

cd switchboard

git submodule update --init

pip install --upgrade pip

pip install -e .

Examples

Various examples demonstrating the features of switchboard are in the examples folder. If you'd like to run them yourself, please run this command first:

pip install -r examples/requirements.txt

This clones some additional repositories that are needed by the examples.

A good starting point is the python example, where a Python script sends packets to and receives packets from a Verilator RTL simulation. The configuration is simple: there is a small RTL simulation that accepts an SB packet, increments the data payload, and transmits the result on its SB output port. On the other side, a Python script sends an SB packet to the simulation, and checks that the packet it gets back has been incremented.

To run this example, you'll need verilator (sudo apt install verilator for Ubuntu, brew install verilator for macOS). You can then run the example by changing directory to examples/python and then typing make. That should produce output similar to the following:

*** TX packet ***
dest: 123456789
last: 1
data: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31]

*** RX packet ***
dest: 123456789
last: 1
data: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32]

- ../verilog/testbench.sv:72: Verilog $finish
PASS!

To get a sense of how this works, open the Python script examples/python/test.py. The core logic is essentially:

from switchboard import PySbPacket, PySbTx, PySbRx

...

tx = PySbTx("to_rtl.q", fresh=True)
rx = PySbRx("from_rtl.q", fresh=True)

...

txp = PySbPacket(...)
tx.send(txp)

...

rxp = rx.recv()

In other words, we create an SB output port (tx) and an SB input port (rx). An SB packet is then created (txp) and sent via the output port. Finally, a new SB packet is received from the input port.

To get a sense of how switchboard is used in RTL, have a look at the Verilog part of this example in examples/python/testbench.sv. The core logic is the instantiation of queue_to_sb_sim (SB input port) and sb_to_queue_sim (SB output port), along with the initialization step to define the name of each SB connection. Notice that the Python output port is matched to the Verilog input port (to_rtl.q) and similarly the Python input port is matched to the Verilog output port (from_rtl.q).

`include "switchboard.vh"

...

`SB_WIRES(to_rtl, DW);
`QUEUE_TO_SB_SIM(to_rtl, DW, "to_rtl.q");

...

`SB_WIRES(from_rtl, DW);
`SB_TO_QUEUE_SIM(from_rtl, DW, "from_rtl.q");

Using the same name for two ports is what establishes a connection between them. You can use any name that you like for a SB connection, as long as it is a valid file name. The reason is that SB connections are visible as files on your file system. After this example runs, it will leave behind files called to_rtl.q and from_rtl.q. It's convenient to name SB connections in a way that is amenable to pattern matching, so that you can do things like rm *.q to clean up old connections.

We encourage you to explore the other examples, which demonstrate simulation with Icarus Verilog and switchboard's C++ library (minimal), bridging SB connections via TCP (tcp), and switchboard's UMI abstraction (umiram).

Build automation

We also provide build automation powered by SiliconCompiler that makes it easy to build RTL simulations with switchboard infrastructure (queue_to_sb_sim, sb_to_queue_sim, etc.). This is mainly important because Verilog DPI and VPI are used under the hood, requiring certain flags to be passed to the RTL simulator during the build. Using our build automation lets you focus on specifying RTL sources, without having to deal with these details.

As an example, we return to examples/python. The basic logic for a Verilator build is:

from switchboard import SbDut

dut = SbDut('name-of-top-level-module')

dut.input('path/to/file/1')
dut.input('path/to/file/2')
...

dut.build()

dut.simulate()

In other words, create an SbDut object, input() files, build() it to compile the Verilator simulator, and use simulate() to start the simulator. SbDut is a subclass of siliconcompiler.Chip, which allows you to invoke a range of features to control the simulator build, such as specifying include paths and `define macros. More information about siliconcompiler.Chip can be found here.

Packet format

An SB packet is a simple data structure with three parts, defined in switchboard/cpp/switchboard.hpp.

A 32-bit destination.
A 32-bit flags bit vector. Currently only bit "0" is used, providing the last flag.
A 416-bit data payload. This width was chosen to accommodate a UMI packet with a 256 bit payload, 64-bit source and destination addresses, and a 32-bit command. In the future, we may support parameterizable data widths for switchboard connections.

destination and flags control how the packet is routed. destination indicates the intended recipient of the packet as a flat, unsigned 32-bit integer. This provides a mechanism where a packet can be routed through multiple hops before reaching its final destination.

For example, consider using switchboard to build a simple topology in which packets can be sent from one HW block to one of two other blocks. One could indicate which block should receive the packet using the destination field, with a router transmitting the packet to the right one.

The last indicator (part of the flags bit vector) indicates whether there is more to come as part of a transaction. The rule is that a transmission cannot be interrupted as long as as last is zero. As an example, consider the system below, where Block A and Block B are both sending SB packets to the same port on Block C, using a router to multiplex between the two. Following the rule of unbroken transmissions, if the router starts sending a sequence of packets from Block A to Block C, it cannot switch to sending packets from Block B to Block C until it gets a packet from Block A that has last set to one. It is legal to have last=1 set in all packets, meaning that packets can be interspersed at any time.

The purpose of last is two-fold. For one, it simplifies the process of transmitting "burstable" protocols such as UMI through switchboard. It also provides opportunities for performance optimization. For example, if a long sequence of SB packets is being sent over TCP, the TCP bridge knows it can wait to fill up its transmission buffer as long as last=0. Without the last bit, the bridge would have to send each packet one at a time (or speculatively wait for more packets), since any given packet may be the last one.

UMI interface

In addition to supporting data movement directly through SB packets, we provide a higher-level interface for running UMI transactions over switchboard connections. The mechanisms for this can be seen in the examples/umi* examples. Here's a sketch of what UMI transactions look like, adapted from the definition of python_intf() in examples/umiram/test.py:

from switchboard import UmiTxRx

umi = UmiTxRx(from_client, to_client, fresh=True)

wrbuf = np.array([elem1, elem2, ...], dtype)
umi.write(wraddr, wrbuf)

rdbuf = umi.read(rdaddr, num, dtype)  # also a numpy array

We are no longer creating PySbTx and PySbRx objects, but rather a single UmiTxRx object with two SB ports: from_client, and to_client. Transactions are sent by the Python script through the from_client port, and responses are received back through the to_client port.

UMI write transactions are generated with the umi.write() method, which accepts an address and numpy array or scalar as arguments. This sends out one or more SUMI packets to implement the write request, packing the data, source address, destination address, and command into SB packets. Since an SB packet is 416 bits, and the two addresses + command take up 160 bits, each SB packet contains up to 256b data. Switchboard automatically splits up larger transactions into multiple SUMI packets as needed, incrementing the source and destination addresses automatically. Optional arguments to write() control where a ack'd or non-ack'd (posted) write is used and the maximum amount of data to send in a single SUMI packet. If an ack'd write is used, write() blocks until the response is received.

In a similar fashion, umi.read() reads a certain number of words from a given address. For example, umi.read(0x1234, 4, np.uint16) will send out a UMI read request with dstaddr=0x1234, LEN=3, SIZE=1 from the SB port from_client. When it gets the response to that query on to_client, it will return an array of 4 np.uint16 words to the Python script. A umi.atomic() method is also provided to generate UMI atomic transactions.

Sometimes it is convenient to work directly with SUMI packets, for example when testing a UMI FIFO or UMI router. For that situation, we provide send() and recv() methods for UmiTxRx, highlighted in examples/umi_fifo/test.py. In that exampe, we are sending SUMI packets into a UMI FIFO, and want to make sure that the sequence of packets read out of the FIFO is the same as the sequence of packets written in.

The main while loop is essentially:

txq = []

while ...:
    txp = random_umi_packet()
    if umi.send(txp, blocking=False):
        txq.append(txp)

    rxp = umi.recv(blocking=False)
    if rxp is not None:
        assert rxp == txq[0]
        txq.pop(0)

In other words, first try to write a random packet into the FIFO. If successful, add it to the back of a list of outstanding packets. Then, try to read a packet from the FIFO. If successful, make sure that the packet is equal to the oldest outstanding packet (since this is a first-in, first-out queue) and remove that outstanding packet from our records. Continue in a loop until a sufficient number of transactions have been checked.

This code example demonstrates several features:

send() and recv() for working with SUMI packets, represented using PyUmiPacket objects.
blocking=False for non-blocking transactions. send() returns True if successful and False otherwise; recv() returns a PyUmiPacket if successful, and None otherwise. A transaction might be unsuccessful if the underlying UMI FIFO is full or empty. For example, if we don't call umi.recv(), eventually the FIFO will fill, and subsequent send() invocations will fail (returning False). Similarly, if we keep calling umi.recv() without calling umi.send(), eventually the FIFO will be empty, and umi.recv() will fail (returning None).
The ability to generate random SUMI packets with random_umi_packet(). Various optional arguments can constrain the opcodes, addresses, and data.
PyUmiPacket objects can be compared using Python == and != operators. This checks if two packets have equal commands, addresses, and data.

Queue format

Under the hood, SB ports are implemented using shared memory queues. The data structure used is made simple enough that RTL running on FPGAs can directly read and write to these queues, without the need for bridge programs. In fact, if two FPGAs have access to the same memory space, they can communicate through a shared memory queue without any involvement from the host operating system, after the initial setup.

The layout of the queue is:

Bytes 0-3: head (int32)
Bytes 64-67: tail (int32)
Bytes 128-179: SB packet
Bytes 256-307: SB packet
Bytes 320-371: SB packet
...
Bytes 4,032-4,095: SB packet

To write an SB packet to the queue, compute next_head = head + 1. If next_head equals 62 (the end of the queue), then set next_head to 0. If next_head equals tail, then the write fails - the queue is full. Otherwise, write the SB packet to address 128 + (64 * head), and then set head to next_head.

Reading an SB packet works in a similar fashion. If tail equals head, the read fails - the queue is empty. Otherwise, read the SB packet from address 128 + (64 * tail), and then increment tail. If tail equals 62 (the end of the queue), then set tail to 0.

The queue implementation in C is in switchboard/cpp/spsc_queue.h, with care taken to avoid memory ordering hazards, and various cache-oriented optimizations. The queue implementation in Verilog (intended for FPGA-based emulation) can be found in switchboard/verilog/fpga/sb_rx_fpga.sv and switchboard/verilog/fpga/sb_tx_fpga.sv.