matrices contributed by Jin Chen, Dec 10, 2013 Jin Chen via mcs.anl.gov Dec 10 (7 days ago) to Matthew Hi, We are using superlu_dist through petsc interface to invert our diagonal block. While mumps works well and correctly, superlu_dist crashes with the following seg fault error, e.g., from process 620. Do you know what happens? Why mumps works but superlu_dist doesn't? This is on nersc edison using cray-petsc/3.4.2.0 with compiler PrgEnv-cray/5.1.18. [620]PETSC ERROR: ------------------------------ ------------------------------------------ [620]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range [620]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger [620]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [620]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors [620]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run [620]PETSC ERROR: to get more information on the crash. [620]PETSC ERROR: User provided function() line 0 in unknown directory unknown file (null) _pmiu_daemon(SIGCHLD): [NID 03546] [c2-2c1s6n2] [Tue Dec 3 20:36:29 2013] PE RANK 620 exit signal Aborted Thanks, -- Jin ----------------------- Hong Zhang 10:45 AM (4 hours ago) to Jin, petsc-maint Jin, The matrix is size of A size: M=147648 thus I tested it on a local linux machine. Here is what I get using g and O-build of petsc(master branch): mumps: ------------- mpiexec -n 10 ./ex10j -f0 ~/tmp/A22.dat -rhs ~/tmp/b22.dat -pc_type lu -pc_factor_mat_solver_package mumps 0]PETSC ERROR: [0] MatLUFactorSymbolic_AIJMUMPS line 888 /sandbox/hzhang/petsc/src/mat/impls/aij/mpi/mumps/mumps.c [0]PETSC ERROR: [0] MatLUFactorSymbolic line 2883 /sandbox/hzhang/petsc/src/mat/interface/matrix.c [0]PETSC ERROR: [0] PCSetUp_LU line 99 /sandbox/hzhang/petsc/src/ksp/pc/impls/factor/lu/lu.c [0]PETSC ERROR: [0] PCSetUp line 866 /sandbox/hzhang/petsc/src/ksp/pc/interface/precon.c [0]PETSC ERROR: [0] KSPSetUp line 192 /sandbox/hzhang/petsc/src/ksp/ksp/interface/itfunc.c [0]PETSC ERROR: --------------------- Error Message ------------------------------------ [0]PETSC ERROR: Signal received! Using a different matrix ordering: mpiexec -n 10 ./ex10j -f0 ~/tmp/A22.dat -rhs ~/tmp/b22.dat -pc_type lu -pc_factor_mat_solver_package mumps -mat_mumps_icntl_7 2 Using vector of mine for RHS A size: M=147648 b size: m=147648 c1 norm = 1.000000e+00 Number of iterations = 1 Residual norm 6.30834e-12 2. superlu_dist -------------------- mpiexec -n 10 ./ex10j -f0 ~/tmp/A22.dat -rhs ~/tmp/b22.dat -pc_type lu -pc_factor_mat_solver_package superlu_dist … c1 norm = 1.000000e+00 Number of iterations = 1 Residual norm 3.04565e-12 3. superlu --------------- ./ex10j -f0 ~/tmp/A22.dat -rhs ~/tmp/b22.dat -pc_type lu -pc_factor_mat_solver_package superlu -mat_superlu_conditionnumber Recip. condition number = 9.440944e-11 c1 norm = 1.000000e+00 Number of iterations = 1 Residual norm 9.44639e-11 Note: Recip. condition number = 9.440944e-11! Your matrix is very ill-conditioned. 4. petsc iterative solver gmres/bjacobi/lu works well: -------------------------------------------------------------------------- mpiexec -n 10 ./ex10j -f0 ~/tmp/A22.dat -rhs ~/tmp/b22.dat -ksp_monitor_true_residual -ksp_rtol 1.e-10 -sub_pc_type lu A size: M=147648 b size: m=147648 0 KSP preconditioned resid norm 7.760281413224e+03 true resid norm 2.798860304714e+02 ||r(i)||/||b|| 1.000000000000e+00 1 KSP preconditioned resid norm 2.740436965153e+03 true resid norm 1.369087432512e+02 ||r(i)||/||b|| 4.891589016451e-01 2 KSP preconditioned resid norm 8.101470860329e+02 true resid norm 8.217267177233e+01 ||r(i)||/||b|| 2.935933302349e-01 … 100 KSP preconditioned resid norm 1.884696470509e-08 true resid norm 2.264991334782e-10 ||r(i)||/||b|| 8.092548709797e-13 101 KSP preconditioned resid norm 1.356108536033e-08 true resid norm 1.466574592659e-10 ||r(i)||/||b|| 5.239899219654e-13 102 KSP preconditioned resid norm 9.569540519562e-09 true resid norm 1.180704824507e-10 ||r(i)||/||b|| 4.218520025878e-13 103 KSP preconditioned resid norm 7.234168385933e-09 true resid norm 9.470642147608e-11 ||r(i)||/||b|| 3.383749496771e-13 c1 norm = 1.000000e+00 Number of iterations = 103 Residual norm 9.47064e-11 The matrix is ill-conditioned, i.e., sensitive to solvers, compilers etc. Check the model to investigate why it gives such ill-conditioned matrix. Meanwhile, you may experiment different solvers to push application forward. Best, Hong ------------------------- Jin, The crash of mumps occurs in metis: Program received signal SIGSEGV, Segmentation fault. 0x00007ffbab2d2cd1 in libmetis__rpqUpdate (queue=0x12a4b00, node=12773, newkey=-36) at /sandbox/hzhang/petsc/externalpackages/metis-5.1.0-p1/libmetis/gklib.c:34 34 GK_MKPQUEUE(rpq, rpq_t, rkv_t, real_t, idx_t, rkvmalloc, REAL_MAX, key_g this explains why switching to a mumps matrix ordering (-mat_mumps_icntl_7 2) works. Changing to a different matrix ordering might fix your seg fault error with superlu_dist.