Cedar Version 1

Dataplane · 27 Nov 2022

Situation

When I started working for the Oregon Network Research Group, we had an initial controller built by Chris Misa that interfaced with Broadcom’s SDK for the Broadscan dataplane telemetry system, and my onboarding task was to adapt IPFIXCOL2 to parse the IPFIX packets the Broadscan was generating and spit them out as JSON blobs for the controller to consume. With that done, we had the earliest working version of the system: A controller, which allowed an operator to submit queries over HTTP in the form of a custom query language called STQL and an ID For the query. Results would then come back from IPFIXCOL2 tagged with the ID of the query that generated them, with the general structure of operation being:

Submit a query to the controller which specifies which fields to filter on, which fields to aggregate on, which fields to reduce on and what reduction function to apply, and how long to run the query for.
Allow this query to run on the switch.
The switch hardware emits IPFIX packets, which are collected by IPFIXCOL2 and re-emitted as JSON, which is sent to the controller
The controller collects all the JSON lines coming in and re-emits them as a blob to the address that submitted the initial This was a good foundation, but it only goes about halfway to being useful: You need some program that submits queries and does something interesting with the results. The natural starting point for our research was to try and emulate the Sonata query set, which we could hard code as independent programs, but this seemed like a good way to generate a lot of duplicate effort. Luckily, we had an initial version of what our advisor termed “the flowchart program”, where you would specify a flowchart of STQL queries, result parsers, and decision branches in an XML file and run. We very quickly had hit the limits of this system when trying to implement anything nontrivial, as we would have to code up new flowchart nodes when we wanted to do something extremely complex and abstract like average two values. Furthermore adding new flowchart nodes was challenging since they had to each be compiled and linked as dynamic libraries, and we did not have a good interface for actually creating flowcharts.

In retrospect this approach was a bit overkill, we were basically a C-only group at this point, but C was absolutely not a good fit for what we needed. We needed to allow users to easily add new flowchart nodes via scripts, and execute these nodes in a reasonably short time period (the switch could deliver results at most 10 times a second, so “reasonable time” is like 50ms). The former goal required some sort of interpreted language, and the latter goal meant that it we could probably use an interpreted language since 50ms is an eternity in computer years. I reached for a TypeScript+NodeJS, since I was starting to dabble in more functional languages at that point, and Python’s broken-by-design capture-by-reference-not-value closures meant that it wasn’t capable of the functional style I wanted to write in.

I ended up putting together an interpreter that would load a flowchart from JSON (we graduated from XML finally), dynamically load all the .js scripts in a flowchart_nodes folder, and execute the nodes with a bit of extra logic to forward STQL queries to the switch and collect results back. The first key improvement was that there was a rudimentary type system now (You could use Ints, Floats, IPs, Dicts, and Arrays). This ended up being super helpful for constructing and editing flowcharts as 90% of the problems we had with the C version came from type mismatches between nodes (which resulted in a segfault). The other key improvement here was that data flow and control flow were finally seperate (just like in real programming languages), so you could have queries where multiple concurrent requests would be running on the switch, and then once results from all these requests were available, the dataflow graph requesting those inputs would execute. The first goal was accomplished quite trivially as adding a node basically boiled down to exporting an object containing a specification of inputs and outputs for each node and a function to run when control flow was passed to the node. The overhead of running an individual node was measured in microseconds, so we were more gated by the latency of decoding the data coming back from the switch and processing it than we were by actually running nodes, but overall the second goal was achieved-enough.

The thing that got people very excited though was actually a webapp I built that allowed someone to make the flowchart using a visual editor based on Rete.js, which would live-update the values in every node and connection as the process ran, and highlight the currently executing nodes. I added a few nodes that had d3.js line graphs in them that would mindlessly graph the values of whatever was put in to them. The graphing nodes ended up being the biggest hit, the entire thing basically ended up being kind of like an oscilliscope for the network, where some “signal” would come out of the switch, be processed in the “circuit” layed out in the flowchart, and then the graphs served as probes measuring different effects over time. Even though all the systems described here haven’t been worked on since 2020, we still get them out occasionally for demos and troubleshooting since the live node interface ended up being so useful.

Photo Description

This is a picture I took of the control panel for the EBR-1 nuclear reactor, which was the experimental reactor where they first proved that you could generate electricity with nuclear fission, not just make bombs. I have always been in love with the dials, gauges, and control panels of the 40s-60s, and for the flowchart webapp described in this post, I originally toyed with designing everything to be skeuomorphic dials, but it ended up taking a long time to do a single gauge, so I dropped it in the interest of finishing the project in the allotted time.