*High Level Synthesis* (HLS) has been a research topic for some three decades. Since late 1980s,
there has been continuous effort for automating the process of RTL design by generating the
HDL code such as Verilog HDL and VHDL automatically. Previously, HLS focused on how to optimize the
resource usage and/or the latency for the whole work, and treat it as an optimization problem.
Thus, some mathematical optimization models were proposed to solve the HLS problem. The earliest one of
these models might be the *
Integer Linear Programming* (ILP) model, which is able to get the
solution with optimal total latency subject to resource constraints,
or the solution with optimal resource usage subject to latency constraints.
However, the ILP model is very slow to get the optimal solution.
SDC (System of Difference Constraints)
is another model used to solve this optimization problem,
which is faster than ILP model, but still much slower than the software compiler.

Recent HLS research also focuses on loop optimization or transformation, an existing topic mainly for software compiler optimization. However, there are some problems using loop optimization or transformation for HLS. The first problem is that loop structure is never necessary for either programming or hardware description, but only especially suitable for sequential programming of CPU. It is well-known that the functional programming languages prefer recursion to loop, and often replace loops with recursive functions with tail recursion optimization. Since High-Level Synthesis is targeting hardware, which is intrinsically parallel, loops might not be a proper structure for description of intrinscally parallel hardware.

Loop optimization has another problem: it changes the architecture too much. Loop unrolling, for example, a common loop optimization technique -- if the loop is running on a CPU -- only reduces the execution time by removing the instructions used for loop bounds checking and some other trivial calculations such as pointer increment. The asymptotic time complexity for the program is always the same. However, if the loop unrolling is used for high-level synthesis, it actually copys the loop body. The asymptotic space or area will change from constant space to linear space, if the loop is fully unrolled. Although the execution time for the loop also drops from linear time to constant time, the space or area is crucial for hardware to the same extent as the execution time. It is very difficult to describe the hardware architecture clear through loop optimization or loop transformation pragmas; while recursion might be the solution.

Unfortunately, few existing high-level synthesis framework support recursion. The difficulty is that if the HLS tool inlines all the functions calls, the size of the synthesized circuit cannot be determined statically (at compile time) for recursive function calls with runtime arguments; while if the HLS tool doesn't inline functions, a call stack is required to implement function call. However, large continuous memory storage might not always be present for hardware. For example, in FPGA, there are many distributed memory elements, but the size of random access memory is limited. This difficulty causes many HLS tools completely disable recursion.

Despite the difficulty of recursive function calls, a certain type of recursion is still very useful for high level synthesis and the circuit size can be statically determined, which can be defined as structure recursion. Some literature for programming has the concept of structurally recursive functions or structural recursion, which is a very similar concept though not exactly the same idea.

The key characteristic of structure recursion is that the structure of the argument for a recursive function is recursively defined, and the function is decomposing the argument recursively. A good example of structure recursion is prefix sum, a pretty simple and fundamental operation for parallel algorithms. A simplified version of prefix sum, which only computes the sum of all N numbers rather than the first n numbers where n = [1,2,3,...,N], is shown in the figure below. Note that the addition operation can be generalized to any associative operations such as max, min, and multiplication.

We can see that the operation for sum of eight numbers is decomposing the eight numbers into two set of numbers,
with each having four numbers, then the operation for sum of four numbers is recursively decomposing them into two
set of numbers with each having two numbers, and finally, the operation for sum of two numbers terminates
the recursive decomposition and returns the sum of these two numbers.
The structure recursion can be implemented by Verilog parameterized module with `generate`

block.
For example, below is a Verilog implementation of the simplified prefix sum example.

Another example of structrue recursion is bitonic sorter, as shown in the figure below, which is a parallel implementation of merge sort and used in GPU computing for soring. Each line in the bitonic sorter represents a number and each arrow means it will sort the two numbers accroding to the direction of the arrow. The example is more complex and has two nested recursive structures.

Another type of recursion that is very useful is tail recursion. Functional programming paradigm has already used tail recursion to implement loops. Since tail recursion is actually another form of loops, the circuit size can also be statically determined if the relevant input argument of tail recursive function is constant. With respect to high-level synthesis, tail recursion is better than looping because the transformed and optimized loops can be expressed directly and more clearly using tail recursion.

*Direct-Mapping High-Level Synthesis* (DMHLS)
is an novel framework for *High Level Synthesis*,
which is fast, aiming to achieve the speed of software compiler for microprocessors.
Rather than formulating HLS as mainly an optimization problem, DMHLS considers direct mapping in the first place,
trying to eable accurate and clear architecture descriptions through high level languages such as C programming
language, without automated resource sharing and loop optimizations that changes the architecture too much or
"obscures the architecture"
(BSV by Example, p12).
DMHLS starts from LLVM Intermediate Representations (LLVM IR)
and generate Verilog HDL codes. LLVM IR is a typical
Static-Single Assignment (SSA)
form of intermediate representation. Although DMHLS only supports LLVM IR currently,
the techniques underneath the framework should
works fine for any SSA form of intermediate representations

Since it is an ongoing project, new features will be added to it, and the interface might be changing frequently. Currently, I have finished the Verilog code generation of functions without loops or recursion. I uses As-Soon-As-Possible (ASAP) scheduling and enforces pipeline parallelism, which means the generated circuits will be fully pipelined with no resource sharing. I am considering using loops exclusively for resource sharing. Next, I will focus on the recursion part. Arrays and pointers are not supported yet, becauses it is assumed that large random-access memory with continuous address exists by using arrays or pointers (e.g. pointer increment). I am still thinking about the proper way to translate them.

2018-02-13 03:29:34 UTC

by Chaofan Li