Solve block tridiagonal (non-periodic) systems of equations using parallel LU decomposition: This function can solve multiple independent systems with one call. The systems need not share the same left- or right-hand-sides. The iterative substructuring method is used in this function that can be briefly described through the following 4 stages:
- Stage 1: Parallel elimination of the tridiagonal blocks on each processor comprising all points of the subdomain except the 1st point (unless its the 1st global point, i.e., a physical boundary)
- Stage 2: Elimination of the 1st row on each processor (except the 1st processor) using the last row of the previous processor.
- Stage 3: Solution of the reduced tridiagonal system that represents the coupling of the system across the processors, using blocktridiagIterJacobi() in this implementation.
- Stage 4: Backward-solve to obtain the final solution
Specific details of the method implemented here are available in:
- Ghosh, D., Constantinescu, E. M., Brown, J., "Scalable Nonlinear Compact Schemes", Technical Memorandum, ANL/MCS-TM-340, Argonne National Laboratory, April 2014, (http://debog.github.io/Files/2014_Ghosh_Consta_Brown_MCSTR340.pdf) (also available at http://debog.github.io/Files/2014_Ghosh_Consta_Brown_MCSTR340.pdf).
- Ghosh, D., Constantinescu, E. M., Brown, J., Efficient Implementation of Nonlinear Compact Schemes on Massively Parallel Platforms, SIAM Journal on Scientific Computing, 37 (3), 2015, C354–C383 (http://dx.doi.org/10.1137/140989261).
More references on this class of parallel tridiagonal solvers:
- E. Polizzi and A. H. Sameh, "A parallel hybrid banded system solver: The SPIKE algorithm", Parallel Comput., 32 (2006), pp. 177–194.
- E. Polizzi and A. H. Sameh, "SPIKE: A parallel environment for solving banded linear systems", Comput. & Fluids, 36 (2007), pp. 113–120.
Array layout: The arguments a, b, and c are local 1D arrays (containing this processor's part of the subdiagonal, diagonal, and superdiagonal) of size (n X ns X bs^2), and x is a local 1D array (containing this processor's part of the right-hand-side, and will contain the solution on exit) of size (n X ns X bs), where n is the local size of the system, ns is the number of independent systems to solve, and bs is the block size. The ordering of the elements in these arrays is as follows:
- Each block is stored in the row-major format.
- Blocks of the same row for each of the independent systems are stored adjacent to each other.
For example, consider the following systems:
\begin{equation} \left[\begin{array}{ccccc} B_0^k & C_0^k & & & \\ A_1^k & B_1^k & C_1^k & & \\ & A_2^k & B_2^k & C_2^k & & \\ & & A_3^k & B_3^k & C_3^k & \\ & & & A_4^k & B_4^k & C_4^k \\ \end{array}\right] \left[\begin{array}{c} X_0^k \\ X_1^k \\ X_2^k \\ X_3^k \\ X_4^k \end{array}\right] = \left[\begin{array}{c} R_0^k \\ R_1^k \\ R_2^k \\ R_3^k \\ R_4^k \end{array}\right]; \ \ k= 1,\cdots,ns \end{equation}
where \(A\), \(B\), and \(C\) are matrices of size bs = 2 (say), and let \( ns = 3\). In the equation above, we have
\begin{equation} B_i^k = \left[\begin{array}{cc} b_{00,i}^k & b_{01,i}^k \\ b_{10,i}^k & b_{11,i}^k \end{array}\right], X_i^k = \left[\begin{array}{c} x_{0,i}^k \\ x_{1,i}^k \end{array} \right], R_i^k = \left[\begin{array}{c} r_{0,i}^k \\ r_{1,i}^k \end{array} \right] \end{equation}
Note that in the code, \(X\) and \(R\) are the same array x.
Then, the array b must be a 1D array with the following layout of elements:
[
b_{00,0}^0, b_{01,0}^0, b_{10,0}^0, b_{11,0}^0, b_{00,0}^1, b_{01,0}^1, b_{10,0}^1, b_{11,0}^1, b_{00,0}^2, b_{01,0}^2, b_{10,0}^2, b_{11,0}^2,
b_{00,1}^0, b_{01,1}^0, b_{10,1}^0, b_{11,1}^0, b_{00,1}^1, b_{01,1}^1, b_{10,1}^1, b_{11,1}^1, b_{00,1}^2, b_{01,1}^2, b_{10,1}^2, b_{11,1}^2,
...,
b_{00,n-1}^0, b_{01,n-1}^0, b_{10,n-1}^0, b_{11,n-1}^0, b_{00,n-1}^1, b_{01,n-1}^1, b_{10,n-1}^1, b_{11,n-1}^1, b_{00,n-1}^2, b_{01,n-1}^2, b_{10,n-1}^2, b_{11,n-1}^2
]
The arrays a and c are stored similarly.
The array corresponding to a vector (the solution and the right-hand-side x) must be a 1D array with the following layout of elements:
[
x_{0,0}^0, x_{1,0}^0, x_{0,0}^1, x_{1,0}^1,x_{0,0}^2, x_{1,0}^2,
x_{0,1}^0, x_{1,1}^0, x_{0,1}^1, x_{1,1}^1,x_{0,1}^2, x_{1,1}^2,
...,
x_{0,n-1}^0, x_{1,n-1}^0, x_{0,n-1}^1, x_{1,n-1}^1,x_{0,n-1}^2, x_{1,n-1}^2
]
Notes:
- This function does not preserve the sub-diagonal, diagonal, super-diagonal elements and the right-hand-sides.
- The input array x contains the right-hand-side on entering the function, and the solution on exiting it.
- Parameters
-
a | Array containing the sub-diagonal elements |
b | Array containing the diagonal elements |
c | Array containing the super-diagonal elements |
x | Right-hand side; will contain the solution on exit |
n | Local size of the system on this processor (not multiplied by the block size) |
ns | Number of systems to solve |
bs | Block size |
r | Object of type TridiagLU (contains wall times at exit) |
m | MPI communicator |
Definition at line 107 of file blocktridiagLU.c.
121 int d,i,j,istart,iend,size;
122 int rank,nproc,bs2=bs*bs,nsbs=ns*bs;
123 struct timeval start,stage1,stage2,stage3,stage4;
129 MPI_Comm *comm = (MPI_Comm*) m;
134 MPI_Comm_size(*comm,&nproc);
135 MPI_Comm_rank(*comm,&rank);
143 fprintf(stderr,
"Error in tridiagLU(): NULL pointer passed for parameters.\n");
148 gettimeofday(&start,NULL);
150 if ((ns == 0) || (n == 0) || (bs == 0))
return(0);
152 xs1 = (
double*) calloc (nsbs,
sizeof(
double));
153 xp1 = (
double*) calloc (nsbs,
sizeof(
double));
154 for (i=0; i<nsbs; i++) xs1[i] = xp1[i] = 0;
157 istart = (rank == 0 ? 1 : 2);
159 for (i = istart; i < iend; i++) {
160 double binv[bs2], factor[bs2];
161 for (d = 0; d < ns; d++) {
179 gettimeofday(&stage1,NULL);
184 double *sendbuf, *recvbuf;
185 size = ns*bs2*(nvar-1)+nsbs;
186 sendbuf = (
double*) calloc (size,
sizeof(
double));
187 recvbuf = (
double*) calloc (size,
sizeof(
double));
188 for (d=0; d<ns; d++) {
189 for (i=0; i<bs2; i++) {
190 sendbuf[(0*ns+d)*bs2+i] = a[((n-1)*ns+d)*bs2+i];
191 sendbuf[(1*ns+d)*bs2+i] = b[((n-1)*ns+d)*bs2+i];
192 sendbuf[(2*ns+d)*bs2+i] = c[((n-1)*ns+d)*bs2+i];
194 for (i=0; i<bs; i++) sendbuf[3*ns*bs2+d*bs+i] = x[((n-1)*ns+d)*bs+i];
197 MPI_Request req[2] = {MPI_REQUEST_NULL,MPI_REQUEST_NULL};
198 if (rank) MPI_Irecv(recvbuf,size,MPI_DOUBLE,rank-1,1436,*comm,&req[0]);
199 if (rank != nproc-1) MPI_Isend(sendbuf,size,MPI_DOUBLE,rank+1,1436,*comm,&req[1]);
200 MPI_Status status_arr[2];
201 MPI_Waitall(2,&req[0],status_arr);
205 for (d = 0; d < ns; d++) {
206 double am1[bs2], bm1[bs2], cm1[bs2], xm1[bs];
207 for (i=0; i<bs2; i++) {
208 am1[i] = recvbuf[(0*ns+d)*bs2+i];
209 bm1[i] = recvbuf[(1*ns+d)*bs2+i];
210 cm1[i] = recvbuf[(2*ns+d)*bs2+i];
212 for (i=0; i<bs; i++) xm1[i] = recvbuf[3*ns*bs2+d*bs+i];
213 double factor[bs2], binv[bs2];
221 _MatrixInvert_ (b+((n-1)*ns+d)*bs2,binv,bs);
if (ierr)
return(ierr);
234 gettimeofday(&stage2,NULL);
240 zero = (
double*) calloc (ns*bs2,
sizeof(
double));
241 eye = (
double*) calloc (ns*bs2,
sizeof(
double));
242 for (d=0; d<ns*bs2; d++) zero[d] = eye[d] = 0.0;
243 for (d=0; d<ns; d++) {
244 for (i=0; i<bs; i++) eye[d*bs2+(i*bs+i)] = 1.0;
249 fprintf(stderr,
"Error in blocktridiagLU(): Gather-and-solve for reduced system not available.\n");
260 MPI_Request req[2] = {MPI_REQUEST_NULL,MPI_REQUEST_NULL};
261 for (d=0; d<nsbs; d++) xs1[d] = x[d];
262 if (rank+1 < nproc) MPI_Irecv(xp1,nsbs,MPI_DOUBLE,rank+1,1323,*comm,&req[0]);
263 if (rank) MPI_Isend(xs1,nsbs,MPI_DOUBLE,rank-1,1323,*comm,&req[1]);
264 MPI_Status status_arr[2];
265 MPI_Waitall(2,&req[0],status_arr);
269 fprintf(stderr,
"Error: nproc > 1 for a serial run!\n");
274 gettimeofday(&stage3,NULL);
278 iend = (rank == 0 ? 0 : 1);
280 for (d = 0; d < ns; d++) {
281 double binv[bs2],xt[bs];
286 for (j=0; j<bs; j++) x[(istart*ns+d)*bs+j]=xt[j];
288 for (i = istart-1; i > iend-1; i--) {
289 for (d = 0; d < ns; d++) {
290 double binv[bs2],xt[bs];
295 for (j=0; j<bs; j++) x[(i*ns+d)*bs+j] = xt[j];
300 gettimeofday(&stage4,NULL);
308 walltime = ((stage1.tv_sec * 1000000 + stage1.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec));
309 params->
stage1_time = (double) walltime / 1000000.0;
310 walltime = ((stage2.tv_sec * 1000000 + stage2.tv_usec) - (stage1.tv_sec * 1000000 + stage1.tv_usec));
311 params->
stage2_time = (double) walltime / 1000000.0;
312 walltime = ((stage3.tv_sec * 1000000 + stage3.tv_usec) - (stage2.tv_sec * 1000000 + stage2.tv_usec));
313 params->
stage3_time = (double) walltime / 1000000.0;
314 walltime = ((stage4.tv_sec * 1000000 + stage4.tv_usec) - (stage3.tv_sec * 1000000 + stage3.tv_usec));
315 params->
stage4_time = (double) walltime / 1000000.0;
316 walltime = ((stage4.tv_sec * 1000000 + stage4.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec));
317 params->
total_time = (double) walltime / 1000000.0;
char reducedsolvetype[50]
int blocktridiagIterJacobi(double *, double *, double *, double *, int, int, int, void *, void *)
#define _MatVecMultiplySubtract_(y, A, x, N)
#define _MatrixMultiply_(A, B, C, N)
#define _MatVecMultiply_(A, x, y, N)
#define _MatrixInvert_(A, B, N)
#define _MatrixZero_(A, N)
#define _MatrixMultiplySubtract_(C, A, B, N)