Data Flow Architecture
The Most Scalable Design for Programmable Packet Processing
Forwarding of data in today’s public networks is a daunting task. Hundreds of millions of packets are processed on each line card every second. Each packet needs individual treatment; packets are classified, counted, and encapsulated into tunnels. They are dropped, terminated, and forwarded. They are associated to flows for application layer treatment and encryption. They are shaped for assured user experience. Executing these operations is actually not very complex. The challenge is to process packets at tens or even hundreds of gigabits per second with full control and deterministic behavior. This calls for a data driven design where packets are treated in a homogenous manner regardless of type and load.
How It Works
The dataflow architecture is beautifully simple. On a high level, it is like a highly rationalized assembly line found in manufacturing industries. In our case, we have changed the manufacturing parts to data packets. The assembly line consists of optimized stages, each performing one or more dedicated operation(s) on the packets before they continue to the next stage. The assembly line can be flexibly programmed to perform any type of operations on the packets. For every packet type and flow, the assembly line will execute a given set of operations, and several thousands of packet types and flows can be supported in parallel.
The dataflow architecture includes the following components:
- The Packet Instruction Set Computer (PISC) is a processor core specifically designed for packet processing. A pipeline can include several hundreds (400+) of PISCs.
- The Engine Access Point (EAP) is a specialized I/O unit for classification tasks. EAPs unify access to tables stored in embedded or external memory (TCAM, SRAM, DRAM) and include resource engines for metering, counting, hashing, formatting, traffic management and table search.
- Execution context is packet specific data available to the programmer. The execution context includes the first 256 bytes of the packet, general purpose registers, device registers and condition flags. An execution context is uniquely associated with every packet and follows the packet through the pipeline.
Packets travel through the pipeline as if advancing through a fixed-length first-in-first-out (FIFO) device. In each clock cycle, all packets in the pipeline shift one stage ahead to execute in the next processor core or EAP. The instruction is always executed to completion within a single clock cycle. Every instruction can execute up to four operations in parallel. The packet then continues to the next PISC or to an EAP.
The data plane program is compiled to instruction memory located in the processor cores eliminating the need to load instructions to the processor cores during program execution.
The pure linear pipelined approach has a number of advantages. One is its deterministic behavior. It is “wirespeed by design”—operations and classification on every packet type are always guaranteed.
This translates to significant engineering gains as engineers can focus on functionality rather than performance optimization. Deterministic performance is important for all types of systems, be it over-subscribed access pizza boxes or wirespeed line cards in chassis-based systems.
Compiling the data plane software maps it to the pipeline, and the architecture then guarantees that packets are processed at a predefined data rate. So if the software compiles, the data plane is wirespeed.
Greater Service Density
Another key dataflow advantage is service density, or the ability to support many packet types and ensure all operations are executed at wirespeed. The dataflow architecture features many hundreds of processor cores and provides six times more operations per second than any competing architecture on the market. This allows for feature-rich data plane implementations in a single chip.
The dataflow architecture has been optimized for the flow of data through the chip, rather than for the flow of instructions through a processing element. This contrasts to the load/store architecture used by multicore RISC-based architectures. While this approach has advantages in instruction-intensive applications, such as application layer processing, it falls short in data intensive applications like layer 2-4 packet processing.
The pure linear pipeline organization of the PISCs eliminates a number of design elements:
- No need for a complex, power-hungry bus or crossbar interconnect.
- No load/store of instructions to processing elements
- No memory stalls
- No replicated storage of packet data
- No branching delays or load balancing of parallel threads
- No context switching with associated synchronization stalls
This significantly reduces die area, power dissipation, and pin count. In addition, the programming model can be significantly simplified. In the load/store architecture, the programmer has to manage pools of non-deterministic processor cores and give headroom to obtain a certain level of performance across a complex matrix of available resources.
Unlike load/store architectures with pools of processor cores, the dataflow architecture never stalls. This is because all operands reside in the execution context and data plane instructions are always available in the local instruction memory.
Flexible Use of Resources
The organization of EAP classification resources and blocks of processor cores makes the dataflow architecture nicely map to OSI layered protocols where identification, classification and modification at one layer often depend on modifications made at lower layers. This also allows for flexible re-organization of the data plane software to accommodate additional feature enhancements for field upgrades.
All tables and resource engines can be accessed from any EAP and all processor cores are identical allowing for consistent programming and fully flexible use of resources for inspection, classification and modification of packets.
The dataflow architecture is presented to the programmer as a standard sequential uni-processor. The compiled code is loaded to the instruction memory located in each PISC, mapping the first instruction to the first PISC, the second instruction to the second PISC, and so on.
The program for preparing classification, analyzing lookup results, and modification of packet headers is therefore written as conventional sequential programs with direct addressing of instruction memory and indirect addressing of the packet memory via offset register.
C data types are used to simplify data structure management, and be consistent with control plane programming models. To allow the programmer to stay in full control of the processing resources, the control code is written in assembly language.
Dataflow Architecture Advantages
|Deterministic performance||Wirespeed designs at no extra engineering cost|
|Data driven processor design||Perfect mapping of identification, classifying and modification resources to layer 2-4 data plane requirements|
|Simplified programming model||Two developers and six months to complete data plane software implementation|
- Xelerated AX Family Product Brief
- Xelerated HX Family Product Brief
- Xelerated X11 Family Product Brief
- Xelerated AX and HX Reference Design Kit
- Xelerated HX Metro Ethernet Application
- Xelerated Development Suite
Product Selector Guide
Quickly and easily view product specs, compare products, and print out select information for our full suite of network processor solutions.
Public, private, or home: our leading networking, storage and computing solutions enable every point of the cloud infrastructure.
Additional product information is available under NDA on the Marvell Extranet. If your company has an NDA with Marvell, please register here. If you do not have an NDA, please contact your Marvell sales representative.