See
http://www.mcs.anl.gov/petsc/miscellaneous/mailing-lists.html
Writing Scientific Software: A Guide to Good Style
PETSc can be used with any kind of parallel system that supports MPI
BUT for any decent performance one needs
-
a fast, low-latency interconnect; any ethernet, even 10 gigE
simply cannot provide the needed performance.
-
high per-CPU memory performance. Each CPU (core in multi-core
systems) needs to have its own memory bandwith of roughly 2 or
more gigabytes/second. For example, standard dual processor "PC's" will
not provide better performance when the second processor is
used, that is, you will not see speed-up when you using the second
processor. This is because the speed of sparse matrix computations is
almost totally determined by the speed of the memory, not the speed of
the CPU. Smart process to core/socket binding may help you. For
example, consider using fewer processes than cores and binding
processes to separate sockets so that each process uses a different
memory bus:
- MPICH2 binding with the Hydra process manager
mpiexec.hydra -n 4 --binding cpu:sockets
- Open MPI binding
mpiexec -n 4 --bysocket --bind-to-socket --report-bindings
Other tools to manage affinity include
- taskset, part of the util-linux package
-
Usage: taskset [options] [mask | cpu-list] [pid|cmd [args...]], type man taskset for details.
Make sure to set affinity for your program, not for the mpiexec program.
- numactl
- In addition to task affinity, this tool also allows changing the default memory affinity policy.
On Linux, the default policy is to attempt to find memory on the same memory bus that serves the core that a thread is running on at whatever time the memory is faulted (not when malloc() is called).
If local memory is not available, it is found elsewhere, possibly leading to serious memory imbalances.
The option --localalloc allocates memory on the local NUMA node, similar to the numa_alloc_local() function in the libnuma library.
The option --cpunodebind=nodes binds the process to a given NUMA node (note that this can be larger or smaller than a CPU (socket); a NUMA node usually has multiple cores).
The option --physcpubind=cpus binds the process to a given processor core (numbered according to /proc/cpuinfo, therefore including logical cores if Hyper-threading is enabled).
With Open MPI, you can use knowledge of the NUMA hierarchy and core numbering on your machine to calculate the correct NUMA node or processor number given the environment variable OMPI_COMM_WORLD_LOCAL_RANK.
In most cases, it is easier to make mpiexec or a resource manager set affinities.
-
The software http://open-mx.org
provides faster speed for ethernet systems, we have not tried it but it
claims it can dramatically reduce latency and increase bandwidth on
Linux system. You must first install this software and then install
MPICH or Open MPI to use it.
-
In ${PETSC_DIR} run make streams and when requested enter the number of
cores your system has. The more the achieved memory bandwidth increases
the more performance you can expect across your multiple cores. If the
bandwidth does not increase significently then you cannot expect to get
any improvement in parallel performance.
See the
licensing notice.
C enables us to build data structures for storing sparse matrices, solver
information, etc. in ways that Fortran simply does not allow. ANSI C is
a complete standard that all modern C compilers support. The language is
identical on all machines. C++ is still evolving and compilers on different
machines are not identical. Using C function pointers to provide data
encapsulation and polymorphism allows us to get many of the advantages of
C++ without using such a large and more complicated language. It would be
natural and reasonable to have coded PETSc in C++; we opted to use
C instead.
No.
- We work very efficiently.
-
We use Emacs for all editing; the etags feature makes navigating
and changing our source code very easy.
-
Our manual pages are generated automatically from
formatted comments in the code, thus alleviating the
need for creating and maintaining manual pages.
-
We employ automatic nightly tests of PETSc on several
different machine architectures. This process helps us
to discover problems the day after we have introduced
them rather than weeks or months later.
-
We are very careful in our design (and are constantly
revising our design) to make the package easy to use,
write, and maintain.
-
We are willing to do the grunt work of going through
all the code regularly to make sure that all code
conforms to our interface design. We will never
keep in a bad design decision simply because changing it
will require a lot of editing; we do a lot of editing.
-
We constantly seek out and experiment with new design
ideas; we retain the the useful ones and discard the rest.
All of these decisions are based on practicality.
-
Function and variable names are chosen to be very
consistent throughout the software. Even the rules about
capitalization are designed to make it easy to figure out
the name of a particular object or routine. Our memories
are terrible, so careful consistent naming puts less
stress on our limited human RAM.
-
The PETSc directory tree is carefully designed to make
it easy to move throughout the entire package.
-
Our bug reporting system, based on email to [email protected],
makes it very simple to keep track of what bugs have been found and
fixed. In addition, the bug report system retains an archive of all
reported problems and fixes, so it is easy to refind fixes to
previously discovered problems.
-
We contain the complexity of PETSc by using object-oriented programming
techniques including data encapsulation (this is why your program
cannot, for example, look directly at what is inside the object Mat)
and polymorphism (you call MatMult() regardless of whether your matrix
is dense, sparse, parallel or sequential; you don't call a different
routine for each format).
- We try to provide the functionality requested by our users.
- We never sleep.
To use PETSc with complex numbers you either
./configure
with
the option
--with-scalar-type
complex and either
--with-clanguage=c++
or, the default,
--with-clanguage=c
. In our experience they will deliver very
similar performance (speed), but if one is concerned they should just try
both and see if one is faster.
Inner products and norms in PETSc are computed using the MPI_Allreduce()
command. In different runs the order at which values arrive at a given
process (via MPI) can be in a different order, thus the order in which some
floating point arithmetic operations are performed will be different. Since
floating point arithmetic arithmetic is not commutative the computed
quantity may be (slightly) different. Over a run the many slight
differences in the inner products and norms will effect all the computed
results. It is important to realize that none of the computed answers are
any less right or wrong (in fact the sequential computation is no more
right then the parallel ones), they are all equally valid.
The discussion above assumes that the exact same algorithm is being used on
the different number of processes. When the algorithm is different for the
different number of processes (almost all preconditioner algorithms except
Jacobi are different for different number of processes) then one expects to
see (and does) a greater difference in results for different numbers of
processes. In some cases (for example block Jacobi preconditioner) it may
be that the algorithm works for some number of processes and does not work
for others.
The convergence of many of the preconditioners in PETSc including the the
default parallel preconditioner block Jacobi depends on the number of
processes. The more processes the (slightly) slower convergence it has.
This is the nature of iterative solvers, the more parallelism means the
more "older" information is used in the solution process hence slower
convergence.
Here is an example:
[linux]% hg push
pushing to https://petsc.cs.iit.edu/petsc/petsc-dev
searching for changes
abort: push creates new remote branches!
This is almost always an indication that you have done serious harm to
your local repo. If you run hg heads and there are more than 1 (which
causes this), then you know its true.
Here is how it happens. You make some local changes, but do not commit.
You pull down and it aborts part way because you have "uncommitted local
changes". However, you do not hg rollback. Instead you just hg commit,
which creates another head. This is supposed to be a feature. I think it
should have a user disable.
Fixing this is complicated. Basically, you clone the repo before you made
head #2, then create the diff for the bad changeset that made head #2.
Apply it to the clone and checkin, then pull the master.
PETSc-dev has some support for running portions of the computation on
Nvidia GPUs. See
PETSc GPUs for
more information. PETSc has a Vec class VECCUSP that performs almost all
the vector operations on the GPU. The Mat class MATCUSP performs
matrix-vector products on the GPU but does not have matrix assembly on the
GPU yet. Both of these classes run in parallel with MPI. All KSP methods,
except KSPIBCGS, run all their vector operations on the GPU thus, for
example Jacobi preconditioned Krylov methods run completely on the GPU.
Preconditioners are a problem, we could do with some help for these. The
example
src/snes/examples/tutorials/ex47cu.cu
demonstates how the nonlinear function evaluation can be done on the
GPU.
Yes, with gcc 4.6 and later (and gfortran 4.6 and later)
./configure
PETSc using the options
--with-precision=__float128 --download-f2cblaslapack
.
External packages cannot be used in this mode and some print statements in
PETSc (those that use the %G format) will not print correctly.
We tried really hard but could not. The problem is that the QD c++ classes,
though they try to implement the built-in data types of double etc are not
native types and cannot "just be used" in a general piece of numerical
source code rather the code has to rewritten to live within the limitations
of QD classes.
Assuming that the PETSc libraries have been successfully built for
a particular architecture and level of optimization, a new user must
merely:
-
Set the environmental variable PETSC_DIR to the full
path of the PETSc home directory (for example,
/home/username/petsc).
-
Set the environmental variable PETSC_ARCH, which indicates the
configuration on which PETSc will be used. Note that the PETSC_ARCH is
simply a name the installer used when installing the libraries. There
many be several on a single system, like mylinux-g for the debug
versions of the library and mylinux-O for the optimized version, or
petscdebug for the debug version and petscopt for the optimized
version.
-
Begin by copying one of the many PETSc examples (in, for example,
petsc/src/ksp/examples/tutorials) and its corresponding makefile.
-
See the introductory section of the PETSc users manual for tips on
documentation.
-
The directory ${PETSC_DIR}/docs contains a set of HTML manual pages in
for use with a browser. You can delete these pages to save about .8
Mbyte of space.
-
The PETSc users manual is provided in PDF in
${PETSC_DIR}/docs/manual.pdf. You can delete this.
-
The PETSc test suite contains sample output for many of the examples.
These are contained in the PETSc directories
${PETSC_DIR}/src/*/examples/tutorials/output and
${PETSC_DIR}/src/*/examples/tests/output. Once you have run the test
examples, you may remove all of these directories to save about 300
Kbytes of disk space.
-
The debugging versions of the libraries are larger than the optimized
versions. In a pinch you can work with the optimized version although
we do not recommend it generally because finding bugs is much easier
with the debug version.
No, run ./configure with the option
--with-mpi=0
Yes. Run ./configure with the additional flag
--with-x=0
MPI is the message-passing standard. Because it is a standard, it will not
change over time; thus, we do not have to change PETSc every time the
provider of the message-passing system decides to make an interface change.
MPI was carefully designed by experts from industry, academia, and
government labs to provide the highest quality performance and capability.
For example, the careful design of communicators in MPI allows the easy
nesting of different libraries; no other message-passing system provides
this support. All of the major parallel computer vendors were involved in
the design of MPI and have committed to providing quality implementations.
In addition, since MPI is a standard, several different groups have already
provided complete free implementations. Thus, one does not have to rely on
the technical skills of one particular group to provide the message-passing
libraries. Today, MPI is the only practical, portable approach to writing
efficient parallel numerical software.
Most MPI implementations provide compiler wrappers (such as mpicc) which
give the include and link options necessary to use that verson of MPI to
the underlying compilers . These wrappers are either absent or broken in
the MPI pointed to by --with-mpi-dir. You can rerun the configure with the
additional option --with-mpi-compilers=0, which will try to auto-detect
working compilers; however, these compilers may be incompatible with the
particular MPI build. If this fix does not work, run with
--with-cc=c_compiler where you know c_compiler works with this particular
MPI, and likewise for C++ and Fortran.
By default the type that PETSc uses to index into arrays and keep sizes of
arrays is a PetscInt defined to be a 32 bit int. If your problem
- involves more than 2^31 - 1 unknowns (around 2 billion) OR
- your matrix might contain more than 2^31 - 1 nonzeros on a single process
then you need to use this option. Otherwise you will get strange crashes.
This option can be used when you are using either 32 bit or 64 bit
pointers. You do not need to use this option if you are using 64 bit
pointers unless the two conditions above hold.
You can follow these steps
- grab petsc4py-dev repo [from hg]
- install Cython
- make cython [in petsc4py-dev]
- place petsc4py-dev in PETSC_DIR/externalpackages
- export ARCHFLAGS=''
- install PETSc with --download-petsc4py etc..
(as of 11/6/2010) We recommend installing gfortran from http://hpc.sourceforge.net. They
have gfortran-4.6.0 (experimental) for Snow Leopard (10.6) and gfortran
4.4.1 (prerelease) for Leopard (10.5).
Please contact Apple at http://www.apple.com/feedback
and urge them to bundle gfortran with future versions of Xcode.
To overload just the error messages write your own MyPrintError() function
that does whatever you want (including pop up windows etc) and use it like
below.
extern "C" {
int PASCAL WinMain(HINSTANCE inst,HINSTANCE dumb,LPSTR param,int show);
};
#include <petscsys.h>
#include <mpi.h>
const char help[] = "Set up from main";
int MyPrintError(const char error[], ...) {
printf("%s", error);
return 0;
}
int main(int ac, char *av[]) {
char buf[256];
int i;
HINSTANCE inst;
PetscErrorCode ierr;
inst=(HINSTANCE)GetModuleHandle(NULL);
PetscErrorPrintf = MyPrintError;
buf[0]=0;
for (i=1; i<ac; i++) {
strcat(buf,av[i]);
strcat(buf," ");
}
PetscInitialize(&ac, &av, PETSC_NULL, help);
return WinMain(inst,NULL,buf,SW_SHOWNORMAL);
}
file in the project and compile with this preprocessor definitiions:
WIN32,_DEBUG,_CONSOLE,_MBCS,USE_PETSC_LOG,USE_PETSC_BOPT_g,USE_PETSC_STA CK,_AFXDLL
And these link options: /nologo /subsystem:console /incremental:yes
/debug /machine:I386 /nodefaultlib:"libcmtd.lib"
/nodefaultlib:"libcd.lib" /nodefaultlib:"mvcrt.lib"
/pdbtype:sept
Note that it is compiled and linked as if it was a console program. The
linker will search for a main, and then from it the WinMain will start.
This works with MFC templates and derived classes too.
Note: When writing a Window's console application you do not need to do
anything, the stdout and stderr is automatically output to the console
window.
To change where all PETSc stdout and stderr go write a function
You can also reassign PetscVFPrintf() to handle stdout and stderr any way
you like write the following function:
PetscErrorCode mypetscvfprintf(FILE *fd, const char format[], va_list Argp) {
PetscErrorCode ierr;
PetscFunctionBegin;
if (fd != stdout && fd != stderr) { /* handle regular files */
ierr = PetscVFPrintfDefault(fd,format,Argp); CHKERR(ierr);
} else {
char buff[BIG];
int length;
ierr = PetscVSNPrintf(buff,BIG,format,&length,Argp);CHKERRQ(ierr);
/* now send buff to whatever stream or whatever you want */
}
PetscFunctionReturn(0);
}
and assign
PetscVFPrintf = mypetscprintf;
before
PetscInitialize()
in your main program.
You should run with -ksp_type richardson to have PETSc run several V or
W cycles. -ksp_type of preonly causes boomerAMG to use only one V/W cycle.
You can control how many cycles are used in a single application of the
boomerAMG preconditioner with
-pc_hypre_boomeramg_max_iter
<it>
(the default is 1). You can also control the tolerance
boomerAMG uses to decide if to stop before max_iter with
-pc_hypre_boomeramg_tol <tol>
(the default is 1.e-7).
Run with
-ksp_view
to see all the hypre options used and
-help | grep boomeramg
to see all the command line options.
Just for historical reasons, the SBAIJ format with blocksize one is just as
efficient as an SAIJ would be.
PETSc includes Additive Schwarz methods in the suite of preconditioners.
These may be activated with the runtime option -pc_type asm.
Various other options may be set, including the degree of overlap
-pc_asm_overlap <number>
the type of restriction/extension
-pc_asm_type [basic,restrict,interpolate,none]
- Sets ASM type
and several others. You may see the available ASM options by using
-pc_type asm -help
Also, see the procedural interfaces in the
manual pages, with names PCASMxxxx()
and check the index of the
users manual
for PCASMxxx
().
PETSc also contains a domain decomposition inspired wirebasket or face
based two level method where the coarse mesh to fine mesh interpolation
is defined by solving specific local subdomain problems. It currently
only works for 3D scalar problems on structured grids created with PETSc
DMDAs. See the manual page for PCEXOTIC and
src/ksp/ksp/examples/tutorials/ex45.c for any example.
PETSc also contains a balancing Neumann-Neumann preconditioner, see the
manual page for PCNN. This requires matrices be constructed with
MatCreateIS()
via the finite element method. There are currently
no examples that demonstrate its use.
Sorry, this is not possible, the BAIJ format only supports a single fixed
block size on the entire matrix. But the AIJ format automatically searches
for matching rows and thus still takes advantage of the natural blocks in
your matrix to obtain good performance. Unfortunately you cannot use the
MatSetValuesBlocked()
.
- On each process create a local vector large enough to hold all the values it wishes to access
- Create a VecScatter that scatters from the parallel vector into the local vectors
- Use VecGetArray() to access the values in the local vector
- Create the scatter context that will do the communication
- VecScatterCreateToAll(v,&ctx,&w);
-
Actually do the communication; this can be done repeatedly as needed
-
Remember to free the scatter context when no longer needed
Note that this simply concatenates in the parallel ordering of the vector.
If you are using a vector from DMCreateGlobalVector() you likely want to
first call DMDAGlobalToNaturalBegin/End() to scatter the original vector
into the natural ordering in a new global vector before calling
VecScatterBegin/End() to scatter the natural vector onto all processes.
-
Create the scatter context that will do the communication
-
Actually do the communication; this can be done repeatedly as needed
-
Remember to free the scatter context when no longer needed
Note that this simply concatenates in the parallel ordering of the vector.
If you are using a vector from DMCreateGlobalVector() you likely want to
first call DMDAGlobalToNaturalBegin/End() to scatter the original vector
into the natural ordering in a new global vector before calling
VecScatterBegin/End() to scatter the natural vector onto process 0.
See the examples in src/mat/examples/tests, specifically ex72.c, ex78.c,
and ex32.c. You will likely need to modify the code slightly to match your
required ASCII format. Note: Never read or write in parallel an ASCII
matrix file, instead for reading: read in sequentially with a standalone
code based on ex72.c, ex78.c, or ex32.c then save the matrix with the
binary viewer PetscBinaryViewerOpen() and load the matrix in parallel in
your "real" PETSc program with MatLoad(); for writing save with the binary
viewer and then load with the sequential code to store it as ASCII.
If XXSetFromOptions() is used (with -xxx_type aaaa) to change the type of
the object then all parameters associated with the previous type are
removed. Otherwise it does not reset parameters.
TS/SNES/KSPSetXXX() commands that set properties for a particular type of
object (such as KSPGMRESSetRestart()) ONLY work if the object is ALREADY
of that type. For example, with
KSPCreate(PETSC_COMM_WORLD,&ksp); KSPGMRESSetRestart(ksp,10);
the restart will be ignored since the type has not yet been set to GMRES.
To have those values take effect you should do one of the following:
XXXCreate(..,&obj);
-
XXXSetFromOptions(obj)
; allow setting the
type from the command line, if it is not on the
command line then the default type is automatically
set
-
XXXSetYYYYY(obj,...)
; if the obj is the
appropriate type then the operation takes place
-
XXXSetFromOptions(obj)
; allow user to
overwrite options hardwired in code (optional)
-
The other approach is to replace the first
XXXSetFromOptions()
to XXXSetType(obj,type)
and hardwire the type at that point.
Yes, see the section of the
users manual called Makefiles
Use the FindPETSc.cmake module from
this repository.
See the CMakeLists.txt from
Dohp for example usage.
You can use the same notation as in C, just put a \n in the string. Note
that no other C format instruction is supported.
Or you can use the Fortran concatination // and char(10); for example
'some string'//char(10)//'another string on the next line'
The update in Newton's method is computed as u^{n+1} = u^n - lambda
* approx-inverse[J(u^n)] * F(u^n)]. The reason PETSc doesn't default to
computing both the function and Jacobian at the same time is
-
In order to do the line search, F (u^n - lambda * step) may need to be
computed for several lambda, the Jacobian is not needed for each of
those and one does not know in advance which will be the final lambda
until after the function value is computed, so many extra Jacobians may
be computed.
-
In the final step if || F(u^p)|| satisfies the convergence criteria
then a Jacobian need not be computed.
You are free to have your "FormFunction" compute as
much of the Jacobian at that point as you like, keep
the information in the user context (the final
argument to FormFunction and FormJacobian) and then
retreive the information in your FormJacobian()
function.
For small matrices, the condition number can be reliably computed using
-pc_type svd -pc_svd_monitor
. For larger matrices, you can
run with
-pc_type none -ksp_type gmres -ksp_monitor_singular_value
-ksp_gmres_restart 1000
to get approximations to the condition
number of the operator. This will generally be accurate for the largest
singular values, but may overestimate the smallest singular value unless
the method has converged. Make sure to avoid restarts. To estimate the
condition number of the preconditioned operator, use
-pc_type
somepc
in the last command.
It is very expensive to compute the inverse of a matrix and very rarely
needed in practice. We highly recommend avoiding algorithms that need it.
The inverse of a matrix (dense or sparse) is essentially always dense, so
begin by creating a dense matrix B and fill it with the identity matrix
(ones along the diagonal), also create a dense matrix X of the same size
that will hold the solution. Then factor the matrix you wish to invert with
MatLUFactor() or MatCholeskyFactor(), call the result A. Then call MatMatSolve(A,B,X)
to compute the inverse into X.
See also.
It is very expensive to compute the Schur complement of a matrix and very
rarely needed in practice. We highly recommend avoiding algorithms that
need it. The Schur complement of a matrix (dense or sparse) is essentially
always dense, so begin by
- forming a dense matrix Kba,
- also create another dense matrix T of the same size.
-
Then either factor the matrix Kaa directly with MatLUFactor() or
MatCholeskyFactor(), or use MatGetFactor() followed by
MatLUFactorSymbolic() followed by MatLUFactorNumeric() if you wish to
use and external solver package like SuperLU_Dist. Call the result A.
- Then call MatMatSolve(A,Kba,T).
- Then call MatMatMult(Kab,T,MAT_INITIAL_MATRIX,1.0,&S).
- Now call MatAXPY(S,-1.0,Kbb,MAT_SUBSET_NONZERO).
- Followed by MatScale(S,-1.0);
For computing Schur complements like this it does not make sense to use the
KSP iterative solvers since for solving many moderate size problems using
a direct factorization is much faster than iterative solvers. As you can
see, this requires a great deal of work space and computation so is best
avoided. However, it is not necessary to assemble the Schur complement
S in order to solve systems with it.
Use MatCreateSchurComplement(Kaa,Kaa_pre,Kab,Kba,Kbb,&S) to create
a matrix that applies the action of S (using Kaa_pre to solve with Kaa),
but does not assemble. Alternatively, if you already have a block matrix
K = [Kaa, Kab; Kba, Kbb] (in some ordering), then you can create index sets
(IS) isa and isb to address each block, then use MatGetSchurComplement() to
create the Schur complement and/or an approximation suitable for
preconditioning. Since S is generally dense, standard preconditioning
methods cannot typically be applied directly to Schur complements. There
are many approaches to preconditioning Schur complements including using
the SIMPLE approximation K_bb - Kba inv(diag(Kaa)) Kab to create a sparse
matrix that approximates the Schur complement (this is returned by default
for the optional "preconditioning" matrix in MatGetSchurComplement()). An
alternative is to interpret the matrices as differential operators and
apply approximate commutator arguments to find a spectrally equivalent
operation that can be applied efficiently (see the "PCD" preconditioners
from Elman, Silvester, and Wathen). A variant of this is the least squares
commutator, which is closely related to the Moore-Penrose pseudoinverse,
and is available in PCLSC which operates on matrices of type
MATSCHURCOMPLEMENT.
There are at least two ways to write a finite element code using PETSc
-
use the Sieve construct in PETSc, this is a high
level approach that uses a small number of
abstractions to help you manage distributing the
grid data structures and computing the elements into
the matrices.
-
manage the grid data structure yourself and use
PETSc IS and VecScatter to communicate the required
ghost point communication. See src/snes/examples/tutorials/ex10d/ex10.c
The MPI_Cart_create() first divides the mesh along the z direction, then
the y, then the x. DMDA divides along the x, then y, then z. Thus, for
example, rank 1 of the processes will be in a different part of the mesh
for the two schemes. To resolve this you can create a new MPI
communicator that you pass to DMDACreate() that renumbers the process
ranks so that each physical process shares the same part of the mesh with
both the DMDA and the MPI_Cart_create(). The code to determine the new
numbering was provided by Rolf Kuiper.
// the numbers of processors per direction are (int) x_procs, y_procs, z_procs respectively
// (no parallelization in direction 'dir' means dir_procs = 1)
MPI_Comm NewComm;
int MPI_Rank, NewRank, x,y,z;
// get rank from MPI ordering:
MPI_Comm_rank(MPI_COMM_WORLD, &MPI_Rank);
// calculate coordinates of cpus in MPI ordering:
x = MPI_rank / (z_procs*y_procs);
y = (MPI_rank % (z_procs*y_procs)) / z_procs;
z = (MPI_rank % (z_procs*y_procs)) % z_procs;
// set new rank according to PETSc ordering:
NewRank = z*y_procs*x_procs + y*x_procs + x;
// create communicator with new ranks according to
PETSc ordering:
MPI_Comm_split(PETSC_COMM_WORLD, 1, NewRank, &NewComm);
// override the default communicator (was
MPI_COMM_WORLD as default)
PETSC_COMM_WORLD = NewComm;
For nonsymmetric systems put the appropriate boundary solutions in the
x vector and use MatZeroRows() followed by KSPSetOperators(). For symmetric
problems use MatZeroRowsColumns() instead. If you have many Dirichlet
locations you can use MatZeroRows() (not MatZeroRowsColumns()) and
-ksp_type preonly -pc_type redistribute, see the manual page for
PCREDISTRIBUTE) and PETSc will repartition the parallel matrix for load
balancing; in this case the new matrix solved remains symmetric even though
MatZeroRows() is used.
An alternative approach is when assemblying the matrix, (generating values
and passing them to the matrix), never include locations for the Dirichlet
grid points in the vector and matrix, instead take them into account as you
put the other values into the load.
There are five ways to work with PETSc and MATLAB
-
Using the MATLAB Engine, this allows PETSc to automatically call MATLAB
to perform some specific computations. It does not allow MATLAB to be
used interactively by the user. See the PetscMatlabEngine.
-
To save PETSc Mat and Vecs to files that can be read from MATLAB use PetscViewerBinaryOpen()
viewer and VecView() or MatView() to save objects for MATLAB and
VecLoad() and MatLoad() to get the objects that MATLAB has saved. See
PetscBinaryRead.m and PetscBinaryWrite.m in bin/matlab for loading and
saving the objects in MATLAB.
-
You can open a socket connection between MATLAB and PETSc to allow
sending objects back and forth between an interactive MATLAB session
and a running PETSc program. See
PetscViewerSocketOpen()
for access from the PETSc side and PetscOpenSocket in bin/matlab for
access from the MATLAB side.
-
You can save PETSc Vecs (not Mats) with the PetscViewerMatlabOpen()
viewer that saves .mat files can then be loaded into MATLAB.
-
We are just being to develop in petsc-dev an API to call most of
the PETSc function directly from MATLAB; we could use help in
developing this. See bin/matlab/classes/PetscInitialize.m
Steps I used:
-
Learn how to build a Cython module
-
Go through the simple example provided by Denis here.
Note also the next comment that shows how to create numpy arrays in the Cython and pass them back.
-
Check out this page which tells you how to get fast indexing
-
Have a look at the petsc4py array source
We find this annoying as well. On most machines PETSc can use shared
libraries, so executables should be much smaller, run ./configure with the
additional option --with-shared-libraries. Also, if you have room,
compiling and linking PETSc on your machine's /tmp disk or similar local
disk, rather than over the network will be much faster.
Running the PETSc program with the option -help will print of many of the
options. To print the options that have been specified within a program,
employ -optionsleft to print any options that the user specified but were
not actually used by the program and all options used; this is helpful for
detecting typo errors.
You can use the option -info to get more details about the solution
process. The option -log_summary provides details about the distribution of
time spent in the various phases of the solution process. You can run with
-ts_view or -snes_view or -ksp_view to see what solver options are being
used. Run with -ts_monitor -snes_monitor or -ksp_monitor to watch
convergence of the methods. -snes_converged_reason and
-ksp_converged_reason will indicate why and if the solvers have converged.
See the
Performance chapter of the users manual for many tips on this.
-
Preallocate enough space for the sparse matrix. For example, rather
than calling MatCreateSeqAIJ(comm,n,n,0,PETSC_NULL,&mat); call
MatCreateSeqAIJ(comm,n,n,rowmax,PETSC_NULL,&mat); where rowmax is
the maximum number of nonzeros expected per row. Or if you know the
number of nonzeros per row, you can pass this information in instead of
the PETSC_NULL argument. See the manual pages for each of the
MatCreateXXX() routines.
-
Insert blocks of values into the matrix, rather than individual components.
Preallocation of matrix memory is crucial for good performance for large problems, see:
If you can set several nonzeros in a block at the same time, this is faster
than calling MatSetValues() for each individual matrix entry.
It is best to generate most matrix entries on the process they belong to
(so they do not have to be stashed and then shipped to the owning process).
Note: it is fine to have some entries generated on the "wrong" process,
just not many.
Use these options at runtime: -log_summary. See the
Performance chapter of the users manual
for information on interpreting the summary data. If using the PETSc
(non)linear solvers, one can also specify -snes_view or -ksp_view for
a printout of solver info. Only the highest level PETSc object used needs
to specify the view option.
Most commonly, you are using a preconditioner which behaves differently
based upon the number of processors, such as Block-Jacobi which is the
PETSc default. However, since computations are reordered in parallel, small
roundoff errors will still be present with identical mathematical
formulations. If you set a tighter linear solver tolerance (using
-ksp_rtol), the differences will decrease.
Run with
-log_summary
and
-pc_mg_log
Some makefiles use ${DATAFILESPATH}/matrices/medium and other files. These
test matrices in PETSc binary format can be found with anonymous ftp from
ftp.mcs.anl.gov in the directory
pub/petsc/matrices. The are not included with the PETSc distribution in the
interest of reducing the distribution size.
PETSc binary viewers put some additional information into .info files like
matrix block size; it is harmless but if you really don't like it you can
use -viewer_binary_skip_info or PetscViewerBinarySkipInfo() note you need
to call PetscViewerBinarySkipInfo() before PetscViewerFileSetName(). In
other words you cannot use PetscViewerBinaryOpen() directly.
This can happen for many reasons:
-
First make sure it is truely the time in KSPSolve() that is slower (by
running the code with -log_summary).
Often the slower time is in generating the matrix
or some other operation.
-
There must be enough work for each process to overweigh the
communication time. We recommend an absolute minimum of about 10,000
unknowns per process, better is 20,000 or more.
-
Make sure the communication speed of the parallel computer
is good enough for parallel solvers.
-
Check the number of solver iterates with the parallel solver against
the sequential solver. Most preconditioners require more iterations
when used on more processes, this is particularly true for block
Jaccobi, the default parallel preconditioner, you can try -pc_type asm
(PCASM)
its iterations scale a bit better for more processes. You may also consider
multigrid preconditioners like PCMG
or BoomerAMG in PCHYPRE.
PETSc does NOT do any explicit conversion of single precision to double
before performing computations; this it depends on the hardware and
compiler what happens. For example, the compiler could choose to put the
single precision numbers into the usual double precision registers and then
use the usual double precision floating point unit. Or it could use SSE2
instructions that work directly on the single precision numbers. It is
a bit of a mystery what decisions get made sometimes. There may be compiler
flags in some circumstances that can affect this.
Newton's method may not converge for many reasons, here are some of the most common.
- The Jacobian is wrong (or correct in sequential but not in parallel).
- The linear system is not solved or is not solved accurately enough.
- The Jacobian system has a singularity that the linear solver is not handling.
- There is a bug in the function evaluation routine.
- The function is not continuous or does not have continuous first derivatives (e.g. phase change or TVD limiters).
-
The equations may not have a solution (e.g. limit cycle instead of
a steady state) or there may be a "hill" between the initial guess and
the steady state (e.g. reactants must ignite and burn before reaching
a steady state, but the steady-state residual will be larger during
combustion).
Here are some of the ways to help debug lack of convergence of Newton.
- Run with the options
-snes_monitor -ksp_monitor_true_residual -snes_converged_reason -ksp_converged_reason
.
-
If the linear solve does not converge, check if the Jacobian is
correct, then see this question.
-
If the preconditioned residual converges, but the true residual
does not, the preconditioner may be singular.
-
If the linear solve converges well, but the line search fails, the
Jacobian may be incorrect.
-
Run with
-pc_type lu
or -pc_type svd
to see
if the problem is a poor linear solver
-
Run with
-mat_view
or -mat_view_draw
to see
if the Jacobian looks reasonable
-
Run with
-snes_type test -snes_test_display
to see if the
Jacobian you are using is wrong. Compare the output when you add
-mat_fd_type ds
to see if the result is sensitive to the
choice of differencing parameter.
-
Run with
-snes_mf_operator -pc_type lu
to see if the
Jacobian you are using is wrong. If the problem is too large for
a direct solve, try -snes_mf_operator -pc_type ksp -ksp_ksp_rtol
1e-12
. Compare the output when you add -mat_mffd_type
ds
to see if the result is sensitive to choice of differencing
parameter.
- Run on one processor to see if the problem is only in parallel.
-
Run with
-snes_ls_monitor
to see if the line search is failing
(this is usually a sign of a bad Jacobian) use -info in PETSc 3.1 and
older versions and -snes_linesearch_monitor
in PETSc-dev.
-
Run with
-info
to get more detailed information on the
solution process.
Here are some ways to help the Newton process if everything above checks out
-
Run with grid sequencing (
-snes_grid_sequence
if working
with a DM is all you need) to generate better initial guess on your
finer mesh
-
Run with quad precision (./configure with --with-precision=__float128
--download-f2cblaslapack with PETSc 3.2 and later and recent versions
of the GNU compilers)
-
Change the units (nondimensionalization), boundary condition scaling,
or formulation so that the Jacobian is better conditioned.
-
Mollify features in the function that do not have continuous first
derivatives (often occurs when there are "if" statements in the
residual evaluation, e.g. phase change or TVD limiters). Use
a variational inequality solver (SNESVIRS) if the discontinuities are
of fundamental importance.
-
Try a trust region method (
-ts_type tr
, may have to adjust
parameters).
-
Run with some continuation parameter from a point where you know the
solution, see TSPSEUDO for steady-states.
-
There are homotopy solver packages like PHCpack that can get you all
possible solutions (and tell you that it has found them all) but those
are not scalable and cannot solve anything but small problems.
Always run with -ksp_converged_reason -ksp_monitor_true_residual
when trying to learn why a method is not converging. Common reasons for
KSP not converging are
-
The equations are singular by accident (e.g. forgot to impose boundary
conditions). Check this for a small problem using
-pc_type svd
-pc_svd_monitor
.
-
The equations are intentionally singular (e.g. constant null space),
but the Krylov method was not informed, see KSPSetNullSpace().
-
The equations are intentionally singular and KSPSetNullSpace() was
used, but the right hand side is not consistent. You may have to call
MatNullSpaceRemove() on the right hand side before calling KSPSolve().
-
The equations are indefinite so that standard preconditioners don't
work. Usually you will know this from the physics, but you can check
with
-ksp_compute_eigenvalues -ksp_gmres_restart 1000 -pc_type none
.
For simple saddle point problems, try -pc_type fieldsplit
-pc_fieldsplit_type schur -pc_fieldsplit_detect_saddle_point
.
For more difficult problems, read the literature to find robust methods
and ask [email protected] or [email protected] if you want
advice about how to implement them.
-
If the method converges in preconditioned residual, but not in true
residual, the preconditioner is likely singular or nearly so. This is
common for saddle point problems (e.g. incompressible flow) or strongly
nonsymmetric operators (e.g. low-Mach hyperbolic problems with large
time steps).
-
The preconditioner is too weak or is unstable. See if
-pc_type
asm -sub_pc_type lu
improves the convergence rate. If GMRES is
losing too much progress in the restart, see if longer restarts help
-ksp_gmres_restart 300
. If a transpose is available, try
-ksp_type bcgs
or other methods that do not require
a restart. (Note that convergence with these methods is frequently
erratic.)
-
The preconditioner is nonlinear (e.g. a nested iterative solve), try
-ksp_type fgmres
or -ksp_type gcr
.
-
You are using geometric multigrid, but some equations (often boundary
conditions) are not scaled compatibly between levels. Try
-pc_mg_galerkin
to algebraically construct a correctly
scaled coarse operator or make sure that all the equations are scaled
in the same way if you want to use rediscretized coarse levels.
-
The matrix is very ill-conditioned. Check the condition number.
- Try to improve it by choosing the relative scaling of components/boundary conditions.
- Try
-ksp_diagonal_scale -ksp_diagonal_scale_fix
.
- Perhaps change the formulation of the problem to produce more friendly algebraic equations.
-
The matrix is nonlinear (e.g. evaluated using finite differencing of
a nonlinear function). Try different differencing parameters,
./configure --with-precision=__float128 --download-f2cblaslapack
,
check if it converges in "easier" parameter regimes.
- A symmetric method is being used for a non-symmetric problem.
-
Classical Gram-Schmidt is becoming unstable, try
-ksp_gmres_modifiedgramschmidt
or use a method that orthogonalizes differently, e.g. -ksp_type gcr
.
Immediately after calling PetscInitialize() call PetscPopSignalHandler()
Some Fortran compilers including the IBM xlf, xlF etc compilers have
a compile option (-C for IBM's) that causes all array access in Fortran
to be checked that they are in-bounds. This is a great feature but does
require that the array dimensions be set explicitly, not with a *.
On newer Mac OSX machines - one has to be in admin group to be able to use debugger
On newer UBUNTU linux machines - one has to disable ptrace_scop
with "sudo echo 0 > /proc/sys/kernel/yama/ptrace_scope" - to get start
in debugger working.
If start_in_debugger does not really work on your OS, for a uniprocessor
job, just try the debugger directly, for example: gdb ex1. You can also
use Totalview which is a good graphical parallel debugger.
You can use the -start_in_debugger option to start all processes in the
debugger (each will come up in its own xterm) or run in Totalview. Then use
cont (for continue) in each xterm. Once you are sure that the program is
hanging, hit control-c in each xterm and then use 'where' to print a stack
trace for each process.
I will illustrate this with gdb, but it should be similar on other
debuggers. You can look at local Vec values directly by obtaining the
array. For a Vec v, we can print all local values using
(gdb) p ((Vec_Seq*) v->data)->array[0]@v->map.n
However, this becomes much more complicated for a matrix. Therefore, it
is advisable to use the default viewer to look at the object. For a Vec
v and a Mat m, this would be
(gdb) call VecView(v, 0)
(gdb) call MatView(m, 0)
or with a communicator other than MPI_COMM_WORLD,
(gdb) call MatView(m, PETSC_VIEWER_STDOUT_(m->comm))
Totalview 8.8.0 has a new feature that allows libraries to provide their
own code to display objects in the debugger. Thus in theory each PETSc
object, Vec, Mat etc could have custom code to print values in the
object. We have only done this for the most elementary display of Vec and
Mat. See the routine TV_display_type() in src/vec/vec/interface/vector.c
for an example of how these may be written. Contact us if you would like
to add more.
The Intel compilers use shared libraries (like libimf) that cannot by
default at run time. When using the Intel compilers (and running the
resulting code) you must make sure that the proper Intel initialization
scripts are run. This is usually done by putting some code into your
.cshrc, .bashrc, .profile etc file. Sometimes on batch file systems that
do now access your initialization files (like .cshrc) you must include
the initialization calls in your batch file submission.
For example, on my Mac using csh I have the following in my .cshrc file
source /opt/intel/cc/10.1.012/bin/iccvars.csh
source /opt/intel/fc/10.1.012/bin/ifortvars.csh
source /opt/intel/idb/10.1.012/bin/idbvars.csh
in my .profile I have
source /opt/intel/cc/10.1.012/bin/iccvars.sh
source /opt/intel/fc/10.1.012/bin/ifortvars.sh
source /opt/intel/idb/10.1.012/bin/idbvars.sh
Many operations on PETSc objects require that the specific type of the
object be set before the operations is performed. You must call
XXXSetType() or XXXSetFromOptions() before you make the offending call. For
example, MatCreate(comm,&A); MatSetValues(A,....); will not work. You
must add MatSetType(A,...) or MatSetFromOptions(A,....); before the call to
MatSetValues();
In a previous call to VecSetSizes(), MatSetSizes(), VecCreateXXX() or
MatCreateXXX() you passed in local and global sizes that do not make sense
for the correct number of processors. For example if you pass in a local
size of 2 and a global size of 100 and run on two processors, this cannot
work since the sum of the local sizes is 4, not 100.
Sometimes it can mean an argument to a function is invalid. In Fortran
this may be caused by forgeting to list an argument in the call,
especially the final ierr.
Otherwise it is usually caused by memory corruption; that is somewhere
the code is writing out of array bounds. To track this down rerun the
debug version of the code with the option -malloc_debug. Occasionally the
code may crash only with the optimized version, in that case run the
optimized version with -malloc_debug. If you determine the problem is
from memory corruption you can put the macro CHKMEMQ in the code near the
crash to determine exactly what line is causing the problem.
If -malloc_debug does not help: on GNU/Linux and Apple Mac OS X machines
- you can try using
http://valgrind.org
to look for memory corruption. - Make sure valgrind is installed
- Recommend building PETSc with
--download-mpich --with-debugging
[debugging is enabled by default]
- Compile application code with this build of PETSc
- run with valgrind using:
${PETSC_DIR}/bin/petscmpiexec -valgrind -n NPROC PETSCPROGRAMNAME -malloc off PROGRAMOPTIONS
- or invoke valgrind directly with:
mpiexec -n NPROC valgrind --tool=memcheck -q --num-callers=20 --log-file=valgrind.log.%p PETSCPROGRAMNAME -malloc off PROGRAMOPTIONS
Notes:
- option
--with-debugging
enables valgrind to give stack trace with additional source-file:line-number info.
- option
--download-mpich
gives valgrind clean MPI - hence the recommendation.
- Wrt Other MPI impls, Open MPI should also work. MPICH1 will not work.
- if
--download-mpich
is used - mpiexec will be in PETSC_ARCH/bin
--log-file=valgrind.log.%p
option tells valgrind to store the output from each proc in a different file [as %p i.e PID, is different for each MPI proc].
- On Apple you need the additional valgrind option
--dsymutil=yes
- memcheck will not find certain array access that violate static array declarations so if memcheck runs clean you can try the
--tool=exp-ptrcheck
instead.
A zero pivot in LU, ILU, Cholesky, or ICC sparse factorization does not
always mean that the matrix is singular. You can use '-pc_factor_shift_type
NONZERO -pc_factor_shift_amount [amount]' or '-pc_factor_shift_type
POSITIVE_DEFINITE'; '-[level]_pc_factor_shift_type NONZERO
-pc_factor_shift_amount [amount]' or '-[level]_pc_factor_shift_type
POSITIVE_DEFINITE' to prevent the zero pivot. [level] is "sub" when lu,
ilu, cholesky, or icc are employed in each individual block of the bjacobi
or ASM preconditioner; and [level] is "mg_levels" or "mg_coarse" when lu,
ilu, cholesky, or icc are used inside multigrid smoothers or to the coarse
grid solver. See PCFactorSetShiftType(), PCFactorSetAmount().
This error can also happen if your matrix is singular, see KSPSetNullSpace() for how to handle this.
If this error occurs in the zeroth row of the matrix, it is likely you have
an error in the code that generates the matrix.
The libraries were compiled without support for X windows. Make sure that
./configure was run with the option
--with-x
Problem: Possibly some of the following:
- You are creating new PETSc objects but never freeing them.
- There is a memory leak in PETSc or your code.
-
Something much more subtle: (if you are using Fortran). When you
declare a large array in Fortran, the operating system does not
allocate all the memory pages for that array until you start using the
different locations in the array. Thus, in a code, if at each step you
start using later values in the array your virtual memory usage will
"continue" to increase as measured by
ps
or top
.
-
You are running with the -log, -log_mpe, or -log_all option. He a great
deal of logging information is stored in memory until the conclusion of
the run.
-
You are linking with the MPI profiling libraries; these cause logging
of all MPI activities. Another Symptom is at the conclusion of the run
it may print some message about writing log files.
Cures:
-
Run with the -malloc_debug option and -malloc_dump. Or use the commands
PetscMallocDump() and PetscMallocLogDump() sprinkled in your code to
track memory that is allocated and not later freed. Use the commands
PetscMallocGetCurrentUsage() and PetscMemoryGetCurrentUsage() to
monitor memory allocated and PetscMallocGetMaximumUsage() and PetscMemoryGetMaximumUsage()
for total memory used ass the code progresses.
- This is just the way Unix works and is harmless.
-
Do not use the -log, -log_mpe, or -log_all option, or use
PLogEventDeactivate() or PLogEventDeactivateClass(),
PLogEventMPEDeactivate() to turn off logging of specific events.
- Make sure you do not link with the MPI profiling libraries.
The graph of the matrix you are using is not symmetric. You must use symmetric matrices for partitioning.
26 KSP Residual norm 3.421544615851e-04
27 KSP Residual norm 2.973675659493e-04
28 KSP Residual norm 2.588642948270e-04
29 KSP Residual norm 2.268190747349e-04
30 KSP Residual norm 1.977245964368e-04
30 KSP Residual norm 1.994426291979e-04 <----- At restart the residual norm is printed a second time
Problem: Actually this is not surprising. GMRES computes the norm of the
residual at each iteration via a recurrence relation between the norms of
the residuals at the previous iterations and quantities computed at the
current iteration; it does not compute it via directly || b - A x^{n} ||.
Sometimes, especially with an ill-conditioned matrix, or computation of the
matrix-vector product via differencing, the residual norms computed by
GMRES start to "drift" from the correct values. At the restart, we compute
the residual norm directly, hence the "strange stuff," the difference
printed. The drifting, if it remains small, is harmless (doesn't effect the
accuracy of the solution that GMRES computes).
Cure: There realy isn't a cure, but if you use a more powerful
preconditioner the drift will often be smaller and less noticeable. Of if
you are running matrix-free you may need to tune the matrix-free
parameters.
1198 KSP Residual norm 1.366052062216e-04
1198 KSP Residual norm 1.931875025549e-04
1199 KSP Residual norm 1.366026406067e-04
1199 KSP Residual norm 1.931819426344e-04
Some Krylov methods, for example tfqmr, actually have a "sub-iteration"
of size 2 inside the loop; each of the two substeps has its own matrix
vector product and application of the preconditioner and updates the
residual approximations. This is why you get this "funny" output where it
looks like there are two residual norms per iteration. You can also think
of it as twice as many iterations.
When using DYNAMIC libraries - the libraries cannot be moved after they are
installed. This could also happen on clusters - where the paths are
different on the (run) nodes - than on the (compile) front-end. Do not use
dynamic libraries & shared libraries. Run ./configure with
--with-shared-libraries=0 --with-dynamic-loading=0
If at some point [in petsc code history] you had a working code - but the
latest petsc code broke it, its possible to determine the petsc code change
that might have caused this behavior. This is achieved by:
- using Mercurial DVCS to access petsc-dev sources [and BuildSystem sources]
- knowing the changeset number [in mercurial] for the known working version of petsc
- knowing the changeset number [in mercurial] for the known broken version of petsc
- using bisect functionality of mercurial
This process can be as follows:
- get petsc-dev and BuildSystem sources:
hg clone http://petsc.cs.iit.edu/petsc/petsc-dev
hg clone http://petsc.cs.iit.edu/petsc/BuildSystem petsc-dev/config/BuildSystem
-
Find the good and bad markers to
start the bisection process. This can be done either by checking
hg log
or hg view
or http://petsc.cs.iit.edu/petsc/petsc-dev
or http://petsc.cs.iit.edu/petsc/BuildSystem
or the web history of petsc-release clones. Lets say the known bad
changeset is 21af4baa815c and known good changeset is 5ae5ab319844
-
Now start the bisection process with these known revisions. [build PETSc, and test your code to confirm known good/bad behavior]
hg update -C 21af4baa815c
hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
- <build/test/confirm-bad>
hg bisect --bad
hg update -C 5ae5ab319844
hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
- <build/test/confirm-good>
hg bisect --good
hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
-
Now until done - keep bisecting, building PETSc, and testing your code with it and determine if the code is working or not. i.e:
-
if <build> broken:
hg bisect --skip
hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
-
if <test> good:
hg bisect --bad
hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
Notice the hg update -C --date "<`hg parent --template '{date|date}'`" -R config/BuildSystem
after each hg update
or hg bisect
. This is to update
BuildSystem to be in sync to petsc-dev. If this is not done - and
BuildSystem is out of sync with petsc-dev - configure will keep
failing.
-
After something like 5-15 iterations -
hg bisect
will
pin-point the exact code change that resulted in the difference in
application behavior
Yes.Use the
./configure --with-shared-libraries
When you link to shared libraries, the function symbols from the shared
libraries are not copied in the executable. This way the size of the
executable is considerably smaller than when using regular libraries. This
helps in a couple of ways:
- saves disk space when more than one executable is created, and
- improves the compile time immensly, because the compiler has to write a much smaller file (executable) to the disk.
By default, the compiler should pick up the shared libraries instead of the regular ones. Nothing special should be done for this.
You must run ./configure without the option --with-shared-libraries (you
can use a different PETSC_ARCH for this build so you can easily switch
between the two).
You would also need to have access to the shared libraries on this new
machine. The other alternative is to build the exeutable without shared
libraries by first deleting the shared libraries, and then creating the
executable.
PETSc libraries are installed as dynamic libraries when the ./configure
flag --with-dynamic-loading is used. The difference with this - from
shared libraries - is the way the libraries are used. From the program
the library is loaded using dlopen() - and the functions are searched
using dlsymm(). This separates the resolution of function names from
link-time to run-time - i.e when dlopen()/dlsymm() are called.
When using Dynamic libraries - PETSc libraries cannot be moved to
a different location after they are built.