Multi core processor for QR decomposition based on FPGA

Hardware design of multicore 32-bits processor is implemented to achieve low latency and high throughput QR decomposition (QRD) based on two algorithms which they are Gram Schmidt (GS) and Givens Rotation (GR). The orthogonal matrices are computed using the first core processor by Gram Schmidt algorithm, and the upper triangular matrices are computed using the second core processor by Giv-ens Rotation algorithm. This design of multicore processor can achieve 50M QRD/s throughput for (4 × 4) matrices at running frequency 200 MHz.


Introduction
The algebraic matrix operations are very important in the signal processing algorithms. It is used in many fields and the most important of these fields are wireless communication systems, such as beam-forming, multiple input and multiple output (MIMO), cancellation, multi-user detection and so on [1]. The most useful operator in algebraic matrix operations is the QR decomposition, especially for adaptive filtering [2] and MIMO technologies [3]. Some of these systems require high throughput of QR decomposition but they are for small sizes of matrices. The implementation of QR decomposition by hardware is most widely used due to its easy parallelization and its robust numerical properties [4]. Many researchers were interested in QR decomposition based on Field-Programmable Gate-Array (FPGA). P. Luethi [5] proposed a VLSI architecture processor to detect multi-input multi-output (MIMO) up to 1.56 million complex valued of 4x4 dimensional matrices per second. Robert [6] proposed iterative decomposition architecture based on GS algorithm. Ji-Hwan Yoon [7] proposed an architecture to achieve high speed and low latency QR decomposition using Givens Rotation algorithm. Rakesh Gangarajah [8] proposed a parallel hardware architecture to perform QR decomposition using householder transformation algorithm for 5 band carrier aggregation LTE-A (Long Term Evolution-Advanced). The main objective of this paper is to design and implementation of multi core processor to perform a QR decomposition using two algorithms together which they are Gram Schmidt algorithm and Givens Rotation. In this way the latency is decreased while the throughput is increased. This paper is organized as follows: in the next section, an overview of QR decomposition with Gram Schmidt algorithm and Givens Rotation algorithm are given. In section 3, an explanation for the design of 32-bits multi core processor. In section 4, the execution and results of the designed processor are given, while in the last section, the conclusion of this work is presented.

QR decomposition
A QR-decomposition (also called QR-factorization) of a matrix is the decomposition of the matrix A as the product of A = QR, an orthogonal matrix Q (QQ T = I, Q T = Q −1 ) and an upper triangular matrix R (all entries below the main diagonal equal to 0). One note to mention here is that, the QRD of a matrix A is not unique. If A is m × n where m not equal to n, then there are both QR decompositions where Q is either m × m, R is m × n or Q is m × n and R is n × n, where m is the number of rows and n is the number of columns. Commonly, the Gram Schmidt process, Householder Transformation and Givens Rotations are known as the most widely used algorithms for QR decomposition. The Householder Transformation and givens rotations illustrate the feature of numerical stability while the Gram-Schmidt process provides an occasion to perform successive orthogonalization [9]. In this paper the Gram-Schmidt orthonormalization algorithm and Givens Rotation algorithm together are used for the decomposition of matrix (A). Where the Gram Schmidt algorithm is used to calculate an orthogonal matrix (Q) and the Givens Rotation algorithm is used to calculate an upper triangular matrix (R). In the following subsections an explanation for Gram Schmidt and Givens Rotation algorithms, based on the steps of Gram Schmidt algorithm, the elements of the Q matrix (e1, e2, e3 and e4) are computed firstly, then the R matrix is computed by using the elements (ei) of matrix Q. While the R matrix is calculated at first by using the Givens Rotation algorithm, then the Q matrix is calculated depending on the elements computed in matrix R. Fig. 1 shows the outline for these algorithms.

Gram schmidt algorithm
A matrix A can be decomposed into the product of a Q orthogonal matrix and an R upper triangular matrix, where: A = QR, Fig. 2 illustrates the part of Q matrix computation from GS algorithm [10].

Givens rotation algorithm
The QR decomposition can be computed using Givens Rrotation [11]. Givens Rotation algorithm eliminates one element in the matrix at a time.
If matrix A consists of (4 × 3) elements, A = [ a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 a 41 a 42 a 43 ] to compute the QR decomposition by Givens Rotation, the first step, is to use a 3,1 to eliminate a 4,1 by using equation (1) then, computing Cosθ and Sinθ using equations (2) and (3) respectively.
Where, a i−1k , a ik are elements of matrix A, k denotes to the column number, while, i denotes to the row number that is required to eliminate its element.
Then repeating this process five times to compute R matrix. Equation (6) shows this process, and Fig. 3 shows the givens rotation algorithm [11].
[ After finding R matrix, equation (7) is used to compute Q matrix (orthogonal matrix) from G matrices.

Multi core processor design
Specific processor design is proposed to compute the QR decomposition using Gram Schmidt algorithm and Givens Rotation algorithm. This processor design is based on multi core processor architecture, where it consists of two cores and each core is responsible on specific function as shown in Fig. 4. The first core processor is used to compute the Q matrix using a Gram Schmidt algorithm and, the second core processor is used to compute the R matrix using Givens Rotation algorithm to achieve high levels of parallelism that leads to higher throughput.
In this paper by using a multicore processor, the first core (Q core) is used to compute the Q matrix using Gram Schmidt algorithm, while the second core (R core) is used to compute the R matrix using the Givens Rotation algorithm. So that, both cores were working simultaneously to compute Q and R matrices, which gives a high level of parallelism. The details of each core processor is explained in the following subsections.

Design the Q core processor
This core processor is designed to perform QR decomposition using Gram Schmidt algorithm to compute the Q matrix only. A 32-bit architecture is used in this design for high accuracy and to decreases the percent of error in arithmetic operations where the fixed point method is used as system of numbers. As shown in Fig. 5, this design consists of several units, which they are the Register set, instruction pointer, instruction memory, data memory, control unit, stage 1, stage 2 and E-reg unit. Following a discussion and details for each of these units. Register Set, is an internal memory, which consists of a set of storage locations, where each location consists of 32 bits (32 flip flops), the design consists of 32 registers. Instruction Pointer unit (IP), is a 32-bit register used to store 32-bits address of memory location from which an instruction to be fetched. In single cycle processor the memory divided into two parts instruction and data, the first part to store the instructions and the second part to store the data, this arrangement enables programmer to read instruction and data at the same time or in the same cycle, this type of design is known as Harvard memory architecture [12]. In this processor the memory is designed as four way enter leaving and the size of instruction memory is 1K byte while the size of data memory is 256 bytes. Control unit is responsible for controlling the data flow and an interface between the units. The proposed control unit is the largest unit. The control unit is designed to produce six control signals (regwrite, qen, lq1, lq2, lq3 and eout) details of these signals are shown in table 1.  The main units of this design is stage 1 unit and stage 2 unit. Where, these two units are used to compute Q matrices using Gram Schmidt algorithm, u1 is computed by using stage 1, while, e1 is computed by using stage 2. Following details for each unit. As shown in Fig. 6, stage 1 unit is consist of three parts to compute: Where This processor is designed for matrix A (4 x 4) elements, so, equation (8) can be rewritten as in equation (10) to become suitable for each column.
Where the initial values for each is equal to zero. The stage 1 unit is designed depending on equation (8). As shown in Fig. 4 each part refers to "proj", so, the parts of stage 1 unit consist of eight multipliers and three adders connected in parallel, then, the subtraction operation is performed between and (part 1, part 2 and part 3). From this unit u vector (4 x 1) will be produced and then sends it to the next stage unit under control signals, Fig. 7 shows the block diagram of stage 2.
Where ( ) = √∑ 2 (12) Fig. 8 shows the internal architecture for stage 2 unit. As shown in equations (11) and (12) the square root and division operations are needed to perform these equations. Coordinate Rotation Digital Computer algorithm (CORDIC) is used to calculate the square root operation. This algorithm depends on shift-add operations. To increase the speed of execution, multiplier is used. Fig. 9 shows an example for this method to compute a square root for number (x) which consists of 16-bits. 16 iterations are needed to compute the square root for number consists of 32-bits. So, to decrease the latency this stage is designed to work in parallel. Fig. 10 shows the block diagram for this design.  Where: X equal to 12056 and the square root is 109. L equal to N-bits / 2. Y is square root value. The DSP divider is used to perform a division operation. This technique is used to achieve low latency. This type of divider supports only in top series of FPGA boards as Virtex-7 and Spartan-6 [13]. E-reg unit is a set of registers (16 registers) each one consists of 32-bits used to store the orthogonal matrix Q (Q = e). This unit directly connected with stage 2 to reduce the time for data transfer between them.

Design the R core processor
The R core processor is designed based on Givens Rotation algorithm to compute the R matrix. The design of the R core processor is consisting of four stages and each stage consists of cells, where these cells are responsible on rotation of columns. The classic implementation of Givens Rotation algorithm is shown in Fig. 11, beginning with zeroing the bottom most element of first column and continues up in the same column, then repeating this procedure on next column and so on to complete the R matrix. But in this paper, the higher level of parallelism is used to rotate many rows in the same time as shown in Fig. 11 [4]. It is clear from Fig. 11; four stages are required to compute R matrix for matrix A (4 × 4). Stages one and two are consisting of two cells while stages three and four are consisting of one cell only as shows in Fig. 12. Each cell consists of two parts, the function of the first part is to calculate the values of Sine and Cosine based on equations (2 and 3), as shows in Fig. 13. It is clear from Fig. 13, that the calculation of Sine and Cosine require math operations, which they are multiplication, division and square root. The CORDIC algorithm is used in calculating the square root in the design of this part. While the function of the second part is to perform the matrix multiplication using several of DSP multipliers and adders. The cells are needed to 8 clock cycles to perform its function.

Design results
The proposed design of multi core 32-bits processor has been written using Verilog Hardware Description Language and implemented using Virtex-7 FPGA board. 32-bits fixedpoint number representation is used in this design. A testbench is created for testing the results of QRD with matlab program. The Isim behavior simulation result shows that the total latency are 4 clock cycles. Matrix A shown below is given from matlab program as test matrix for this design:  Because this design of multi core processor is pipelining architecture, it is possible to produce orthogonal and upper triangular matrices every 4 clock cycles only. Table 2 shows a comparison between this work and with previous works. This comparison based on the maximum run frequency, data word length, type of algorithm (Gram Schmidt, Householder Transformation and Givens Rotation) and latency. It is clear from this table that this design has the lower latency from other works and has a higher throughput. These results an achieved because the high parallelism of the designed work. Fig. 14, shows the screen shot of RTL (Register Transfer Level) for this design by ISE Design Suite 14.7 program, while Fig. 15 shows the capture of test bench of the result.

Conclusions
In this work 32-bits multi core processor is designed to perform QR decomposition based on two algorithms which they are Gram Schmidt and Givens Rotation, where each one of them is used to implement by one core. The first core is working using Gram Schmidt and it is responsible to calculate Q matrices, while the second core is responsible to calculate R matrices. The two cores working simultaneously, so that in the same time the Q matrix and the R matrix are calculated. This design of parallelism in computing the Q and R matrices simultaneously achieves a low latency and high throughput. This design is implemented using Virtex-7 development board and Verilog HDL. By this multicore processor the throughput is increased to 50 M matrix per second at run frequency 200MHz.