First, nowadays, supercomputers offer a great variety of architectures, with many cores on nodes e. Thus, shared memory parallelism is gaining more and more attention as OpenMP offers more flexibility to parallel programming. In fact, sequential kernels can be parallelized at the shared memory level using OpenMP: one example is once more the coarse solve of iterative solvers; another example is the possibility of using dynamic load balance on shared memory nodes, as explained in [ 25 ] and introduced in Section 4.

As mentioned earlier, the parallelization of the assembly has traditionally been based on loop parallelism using OpenMP. Two main characteristics of this loop have led to different algorithms in the literature. On the one hand, there exists a race condition. The race conditions comes from the fact that different OpenMP threads can access the same degree of freedom coefficient when performing the scatter of element matrix and RHS, in step 4 of Algorithm 1.

On the other hand, spatial locality must be taken care of in order to obtain an efficient algorithm. Shared memory parallelism techniques using OpenMP. The cost of the ATOMIC comes from the fact that we do not know a priori when conflicts occur and thus this pragma must be used at each loop iteration. This lowers the IPC defined in Section 1.

## High Performance Computing Systems and Applications | Robert D. Kent | Springer

Loop parallelism using element coloring. The second method consists in coloring [ 26 ] the elements of the mesh such that elements of the same color do not share nodes [ 27 ], or such that cells of the same color do not share faces in the FV context. The main drawback is that spatial locality is lessened by construction of the coloring. In [ 28 ], a comprehensive comparison of this technique and the previous one is presented. Loop parallelism using element partitioning. In order to preserve spatial locality while disposing of the ATOMIC pragma, another technique consists in partitioning the local mesh of each MPI process into disjoint sets of elements e.

Then, one defines separators as the layers of elements which connect neighboring subdomains. By doing this, elements of different subdomains do not share nodes. Task parallelism using multidependences. Task parallelism could be used instead of loop parallelism, but the three algorithmics presented previously would not change [ 30 , 31 , 32 ]. There are two new features implemented in OmpSs a forerunner for OpenMP that are not yet included in the standard that can help: multidependences and commutative.

These would allow us to express incompatibilities between subdomains. The mesh of each MPI process is partitioned into disjoint sets of elements, and by prescribing the neighboring information in the OpenMP pragma, the runtime will take care of not executing neighboring subdomains at the same time [ 33 ]. As explained in Section 1. The x -axis is time, while the y -axis is the MPI process number, and the dark grey color represents the element loop assembly Algorithm 1.

After the assembly, the next operation is a reduction operation involving MPI, the initial residual norm of the iterative solver quite common in practice. Therefore, this is a synchronization point where MPI processes are stuck until all have reached this point. We can observe in the figure that one of the cores is taking almost the double time to perform this operation, resulting in a load imbalance. Load imbalance has many causes: mesh adaptation as described in Section 2.

The example presented in the figure is due to wrong element weights given to METIS partitioner for the partition of a hybrid mesh [ 28 ]. There are several works in the literature that deal with load imbalance at runtime. We can classify them into two main groups, the ones implemented by the application may be using external tools and the ones provided by runtime libraries and transparent to the application code.

In the first group, one approach would be to perform local element redistribution from neighbors to neighbors. Thus, only limited point-to-point communications are necessary, but this technique provides also a limited control on the global load balance. Another option consists in repartitioning the mesh, to achieve a better load distribution.

In order for this to be efficient, a parallel partitioner e. In addition, this method is an expensive process so that imbalance should be high to be an interesting option. In general, these libraries will detect the load imbalance and migrate objects or specific data structures between processes. They usually require to use a concrete programming language, programming model, or data structures, thus requiring high levels of code rewriting in the application.

Finally, the approach that has been used by the authors is called DLB [ 25 ] and has been extensively studied in [ 28 , 33 , 36 ] in the CFD context. Principles of dynamic load balance with DLB [25], via resources sharing at the shared memory level. Threads running on cores 3 and 4 are clearly responsible for the load imbalance. When using DLB, threads running in core 1 and core 2 lend their resources as soon as they enter the synchronization point, for example, an MPI reduction represented by the orange bar. Then, MPI process 2 can now use four threads to finish its element assembly.

As already noted in Section 1. This is due to the combination of: substituting loop parallelism using coloring by task parallelism, thus giving a higher IPC; using the dynamic load balance library DLB to improve the load balance at the shared memory level. Let us close this section with some basic HPC optimizations to take advantage of some hardware characteristics presented in Section 1. Spatial and temporal locality The main memory access bottleneck of the assembly depicted in Algorithm 1 is the gather and scatter operations.

For example, when assembling element , the gather is more efficient after renumbering top right part of the figure as nodal array positions are closer in memory. Data locality is thus enhanced. However, the assembly loop accesses elements successively. Therefore, when going from element 1 to 2, there is no data locality, as element 1 accesses positions 1,2,3,4 and element 2 positions , , , Therefore, renumbering the elements according to the node numbering enables one to achieve temporal locality, as shown in the bottom right part of the figure.

Data already present in cache can be reused data of nodes 3 and 4. Optimization of memory access by renumbering nodes and elements. According to the available hardware, vectorization may be activated as a data-level parallelism. However, the vectorization will be efficient if the compiler is able to vectorize the appropriate loops. Let us consider a typical element matrix assembly. Let us denote nnode and ngaus as the number of nodes and Gauss integration points of this element; Ae, Jac, and N are the element matrix, the weighted Jacobian, and the shape function, respectively.

This loop, part of step 3 of Algorithm 1 , will be carried out on each element of the mesh. Now, let us define Ne, a parameter defined in compilation time. In order to help vectorization, last loop can be substituted by the following. Finally, note that this formalism can be relatively easily applied to port the assembly to GPU architectures [ 39 ].

The matrix and the right-hand side are distributed over the MPI processes, the matrix having a partial row or full row format. The algebraic solvers are mainly responsible for the limitation of the strong and weak scalabilities of a code see Section 1. Thus, adapting the solver to a particular algebraic system is fundamental. This is a particularly difficult task for large distributed systems, where scalability and load balance enter into play, in addition to the usual convergence and timing criteria.

The section does not intend to be exhaustive, but rather to expose the experience of the authors on the topic. The main techniques to solve Eq. The explicit method can be viewed as the simplest iterative solver to solve Eq.

In practice, matrix A is not needed and only the residual r k is assembled. Semi-implicit methods are mainly represented by fractional step techniques [ 42 , 43 ]. They generally involve an explicit update of the velocity, such as Eq. Other semi-implicit methods exist, based on the splitting of the unknowns at the algebraic level. This splitting can be achieved for example by extracting the pressure Schur complement of the incompressible Navier—Stokes Eqs.

The Schur complement is generally solved with iterative solvers, which solution involves the consecutive solutions of algebraic systems involving unsymmetric and symmetric matrices SPD for the pressure. These kinds of methods have the advantage to extract better conditioned and smaller algebraic systems than the original coupled one, at the cost of introducing an additional iteration loop to converge to the monolithic original solution. Finally, implicit methods deal with the coupled system 1.

In general, much more complex solvers and preconditioners are required to solve this system than in the case of semi-implicit methods.

### File Extensions and File Formats

So, in any case, we always end up with algebraic systems like Eq. We start with the parallelization of the operation that occupies the central place in iterative solvers, namely the sparse matrix vector product SpMV. When using the partial row format, the local result of the SpMV in each MPI process is only partial as the matrices are also partial on the interface, as explained in Section 4. By applying the distributive property of the multiplication, the results of neighboring subdomains add up to the correct solution on the interface:.

Note that with this partial row format, due to the duplicity of the interface nodes, the MPI messages are symmetric in neighborhood a subdomain is a neighbor of its neighbors and size of interfaces interface of i with j involves the same degrees of freedom as that of j with i. After this matrix and RHS exchange, the solution of Eq. Nothing needs to be done with the RHS as it has been fully assembled through the presence of halos. The previous two algorithms are said to be synchronous, as the MPI communication comes before or after the complete local SpMV for the partial or full row formats, respectively.

This strategy permits to overlap communication results of the SpMV for interface nodes and work SpMV for internal nodes. The loop parallelization with OpenMP is quite simple to implement in this case. However, care must be taken with the size of the chunks, as the overhead for creating the threads may be penalizing if the chunks are too small. In the FV method, the degrees of freedom are located at the center of the elements. The partitioning into disjoint sets of elements can thus be used for both assembly and solver.

In the case of the finite element, the number of degrees of freedom involved in the SpMV corresponds to the nodes and could differ quite from the number of elements involved in the assembly. So the question of partitioning a finite-element mesh into disjoint sets of nodes may be posed, depending on which operation dominates the computation.

As an example, if one balances a hexahedra subdomain with a tetrahedra subdomain in terms of elements, the latter one will hold six times more elements than the last one. The Richardson iteration given by Eq. SpMV does not involve any global communication mechanism among the degrees of freedom DOF , from one iteration to the next one.

Accelerating iterative solvers. From top to bottom: 1 SpMV has a node-to-node influence; 2 domain decomposition DD solvers have a subdomain-to-subdomain influence; and 3 coarse solvers couple the subdomains. Such methods seek some optimality , thus providing a certain global communication mechanism. Nevertheless, this global communication mechanism is very limited and the convergence of such solvers degrades with the mesh size.

Just like the Richardson method, Krylov methods damp high-frequency errors through the SpMV, but do not have inherent low-frequency error damping. The selection of the preconditioning of Eq. Preconditioning should provide robustness at the least price, for a given problem, and in general, robustness is expensive. Domain decomposition preconditioners provide this robustness, but can result too expensive compared to smarter methods, as we now briefly analyze.

Domain Decomposition. Erhel and Giraud summarized the attractiveness of domain decomposition DD methods as follows:. One route to the solution of large sparse linear systems in parallel scientific computing is the use of numerical methods that combine direct and iterative methods. These techniques inherit the advantages of each approach, namely the limited amount of memory and easy parallelization for the iterative component and the numerical robustness of the direct part.

DD preconditioners are based on the exact or almost exact solution of the local problem to each subdomain. The different methods mainly differentiate in the way the different subdomains are coupled interface conditions and in terms of overlap between them, the Schwarz method being the most famous representative. On the one hand, SpMV is in charge of damping high frequencies. On the other hand, DD methods provide a communication mechanism at the level of the subdomains. The convergence of Krylov solvers using such preconditioners now depends on the number of subdomains. Coarse solvers try to resolve this dependence, providing a global communication mechanism among the subdomains, generally one degree of freedom per subdomain.

Let us mention the deflated conjugate gradient DCG method [ 47 ] which provides a coarse grain coupling, but which can be independent of the partition. As we have explained, solvers involving DD preconditioners together with a coarse solver aim at making the solver convergence independent of the mesh size and the number of subdomains. In terms of CPU time, this is translated into the concept of weak scalability Section 1.

This can be achieved in some cases, but hard to obtain in the general case. Multigrid solvers or preconditioners provide a similar multilevel mechanism, but using a different mathematical framework [ 48 ]. They only involve a direct solver at the coarsest level, and intermediate levels are still carried out in an iterative way, thus exhibiting good strong based on SpMV and weak scalabilities multilevel. Convergence is nevertheless problem dependent [ 49 , 50 ]. Physics and numerics based solvers. DD preconditioners are brute force preconditioners in the sense that they attack local problems with a direct solver, regardless of the matrix properties.

Smarter approaches may provide more efficient solutions, at the expense of not being weak scalable. But do we really need weak scalability to solve a given problem on a given number of available CPUs? Well, this depends. The linelet preconditioner is presented in [ 51 ]. In a boundary layer mesh, a typical situation in CFD, the discretization of the Laplacian operator tends to a tridiagonal matrix when anisotropy tends to infinity depending also on the discretization technique , and the dominant coefficients are along the direction normal to the wall. The anisotropy linelets consist of a list of nodes, renumbered in the direction normal to the wall.

By assembling tridiagonal matrices along each linelet, the preconditioner thus consists of a series of tridiagonal matrices, very easy to invert. Let us also mention finally the streamwise linelet [ 52 ]. In the discretization of a hyperbolic problem, the dependence between degrees of freedom follows the streamlines. By renumbering the nodes along these streamlines, one can thus use a bidiagonal or Gauss—Seidel solver as a preconditioner. In an ideal situation where nodes align with the streamlines, the bidiagonal preconditioner makes the problem converge in one complete sweep.

These two examples show that listening to the physics and numerics of a problem, one can devise simple and cheap preconditioners, performing local operations. The figure presents the convergence history in terms of number of iterations and time for solving the pressure equation SPD [ 53 ], with four different solvers and preconditioners: CG for Schur complement preconditioned by an additive Schwarz method; the DCG with diagonal preconditioner; the DCG with linelet preconditioner [ 51 ]; and the DCG with a block LU block Jacobi preconditioner.

However, taking a look at the CPU time, the performance is completely inverted and the best one is the DCG with linelet preconditioner. We should stress that these conclusions are problem dependent, and one should adapt to any situation. In a simulation of millions of CPU hours, a factor six in time can cost several hundred thousands of euros see [ 54 ] for a comparison of preconditioners for the pressure equation. Iterative parallel computing requires a lot of global synchronizations between processes, coming from the scalar products to compute descent and orthogonalization parameters or residual norms.

These synchronizations are very expensive due to the high latencies of the networks. They also imply a lot of wasted time if the workloads are not well balanced, as explained in Section 4. The heterogeneous nature of the machines makes such load balancing very hard to achieve, resulting in higher time loss, compared to homogeneous machines.

Pipelined solvers. Pipelined solvers consist of algorithmically equivalent solvers e. The main advantage of pipelined versions is the possibility to overlap reduction operations with some operations, like preconditioning. This enables one to hide latency, provided that the work to be overlapped is sufficient, and thus to increase the strong scaling.

Although algorithmically equivalent to their classical versions, pipelined solvers introduce local rounding errors due to the addition recurrence relations, which limit their attainable accuracy [ 57 ]. Communication avoiding solvers. Asynchronous iterations provide another mechanism to overcome the synchronism limitation. In order to illustrate the method, let us take the example of the Richardson method of Eq.

Then, let us define A ij i the matrix block of subdomain i connected to subdomain j and x i k the solution in subdomain i at iteration k. The method reads:. This means that each subdomain i updates its solution with the last available solution of its neighbors j. The main difficulty of such methods consists in establishing a common stopping criterion among all the MPI processes, minimizing the number of synchronizations. Such asynchronous Jacobi and block Jacobi solvers have been developed since [ 58 ].

Recent developments have extended these algorithms to asynchronous substructuring method [ 59 ] and to asynchronous optimized Schwarz method [ 60 ]. Scientific visualization focuses on the creation of images to provide important information about underlying data and processes. In recent decades, the unprecedented growth in computing and sensor performance has led to the ability to capture the physical world in unprecedented levels of detail and to model and simulate complex physical phenomena.

## Featured Topics

Especially for CFD data, the visualization is of great importance, as its results can be well represented in the three-dimensional representation known to us. Output of files for postmortem visualization usually represents the highest volume of output from a CFD code, as well as some possibly separate operations, especially explicit checkpointing a restart, requiring writing and reading of large datasets.

This is called checkpointing. The computation may be restarted from the state reached by reading the checkpoint from a previous run. This incurs both writing and reading. Some codes use the same file format for visualization output and checkpointing, but this assumes data required are sufficiently similar and often that the code has a privileged output format.

When restarting requires additional data such as field values at locations not exactly matching those of the visualization, or multiple time steps for smooth restart of higher order time schemes , code-specific formats are used. This may require less programming on the solver side, at the expense of larger checkpoint sizes.

In practice, BLCR does not seem to have evolved in recent years, and support in some MPI libraries has been dropped; so it seems the increasing complexity of systems has made this approach more difficult. As datasets used by CFD tools are often large, it is recommended to use mostly binary representations rather than text representations.

This has multiple advantages when done well:. As binary data are not easily human-readable, additional precautions are necessary, such as providing sufficient metadata for the file to be portable. This can be as simple as providing a fixed-size string with the relevant information, and associating a fixed-size description with name, type, and size for each array, or much more advanced depending on the needs. Many users with experience with older fields tend to feel more comfortable with text files, so this advice may seem counterintuitive, but issues which plagued older binary representations have disappeared, while text files are not as simple as they used to be, today with many possible character encodings.

Twenty years ago, some systems such as Cray used proprietary floating-point types, while many already used the IEEE standard for single-, double-, and extended-precision floating point values. Though vendors have improved compatibility over the years, Fortran binary files are not portable by default. The most simple solution is to read or write a separate file for each MPI task.

On some file systems, this may be the fastest method, but it leads to the generation of many files on large systems, and requires external tools to reassemble data for visualization, unless using libraries which can assemble data when reading it such as VTK using its own format. This approach provides the benefit of allowing checkpointing and restarting on different numbers of nodes and making parallelism more transparent for the user, though it requires additional work for the developers.

Even on machines with similar systems but different file system tuning parameters, performance may vary. In any case, for good performance on parallel file systems which should be all shared file systems on modern clusters , it is recommended to avoid funneling all data through a single node except possibly as a fail-safe mode. Visualization pipeline. The pipeline filter step includes raw data processing and image processing algorithm operations. Rendering uses computer graphics methods to generate the final image from the geometric primitives of the mapping process.

While the selection of different visualization applications is considerable, the visualization techniques in science are generally used in the following areas of the dimensionality of the data fields. A distinction is made between scalar fields temperature, density, pressure, etc. Regardless of the dimensionality of the data fields, any visualization of the whole three-dimensional volume can easily flood the user with too much information, especially on a two-dimensional display or piece of paper.

The most common technique is slicing the volume data with cut planes, which reduces three-dimensional data to two dimensions. Color information is often mapped onto these cut planes using another basic well-known technique called color mapping. Color mapping is a one-dimensional visualization technique. It maps scalar value into a color specification. The scalar mapping is done by indexing into a color reference table—the lookup table. The scalar values serve as indexes in this lookup table including local transparency. A more general form of the lookup table is the transfer function.

A transfer function is any expression that maps scalars or multidimensional values to a color specification. Color mapping is not limited to 2D objects like cut planes, but it is also often used for 3D objects like isosurfaces. Isosurfaces belong to the general visualization technique of data fields, which we focus on in the following.

Visualization of scalar fields. Isosurface extraction is a powerful tool for the investigation of volumetric scalar fields. An isosurface in a scalar volume is a surface in which the data value is constant, separating areas of higher and lower value. Given the physical or biological significance of the scalar data value, the position of an isosurface and its relationship to other adjacent isosurfaces can provide a sufficient structure of the scalar field.

The second fundamental visualization technique for scalar fields is volume rendering. Volume rendering is a method of rendering three-dimensional volumetric scalar data in two-dimensional images without the need to calculate intermediate geometries. The individual values in the dataset are made visible by selecting a transfer function that maps the data to optical properties such as color and opacity.

These are then projected and blended together to form an image. For a meaningful visualization, the correct transfer function must be found that highlights interesting regions and characteristics of the data. Finding a good transfer function is crucial for creating an informative image. Multidimensional transfer functions enable more precise delimitation from the important to the unimportant.

### Free Guide: 5 Ways IT Infrastructure Automation Can Ease Your Day-to-Day

Visualization of flame simulation results left using slicing and color mapping in the background, and isosurface extraction and volume rendering for the flame structure. Visualization of an inspiratory flow in the human nasal cavity right using streamlines colored by the velocity magnitude [68]. Visualization of vector fields. The visualization of vector field data is challenging because no existing natural representation can convey a visually large amount of three-dimensional directional information.

Visualization methods for three-dimensional vector fields must therefore bring together the opposing goals of an informative and clear representation of a large number of directional information. The techniques relevant for the visual analysis of vector fields can be categorized as follows. The simplest representations of the discrete vector information are oriented glyphs. Glyphs are graphical symbols that range from simple arrows to complex graphical icons, directional information, and additional derived variables such as rotation.

Streamlines provide a natural way to follow a vector dataset. With a user-selected starting position, the numerical integration results in a curve that can be made easily visible by continuously displaying the vector field. Streamlines can be calculated quickly and provide an intuitive representation of the local flow behavior. Since streamlines are not able to fill space without visual disorder, the task of selecting a suitable set of starting points is crucial for effective visualization.

A limitation of flow visualizations based on streamlines concerns the difficult interpretation of the depth and relative position of the curves in a three-dimensional space. One solution is to create artificial light effects that accentuate the curvature and support the user in depth perception. Stream surfaces represent a significant improvement over individual streamlines for the exploration of three-dimensional vector fields, as they provide a better understanding of depth and spatial relationships.

Conceptually, they correspond to the surface that is spanned by any starting curve, which is absorbed along the flow. FTLE enables the visualization of significant coherent structures in the flow. Texture-based flow visualization methods are unique means to address the limitations of representations based on a limited set of streamlines. They effectively convey the essential patterns of a vector field without lengthy interpretation of streamlines. Its main application is the visualization of flow structures defined on a plane or a curved surface.

The best known of these methods is the line integral convolution LIC proposed by Cabral and Leedom [ 71 ]. This work has inspired a number of other methods. In particular, improvements have been proposed, such as texture-based visualization of time-dependent flows or flows defined via arbitrary surfaces.

Some attempts were made to extend the method to three-dimensional flows. Furthermore, vector fields can be visualized using topological approaches. Topological approaches have established themselves as a reference method for the characterization and visualization of flow structures.

Topology offers an abstract representation of the current and its global structure, for example, sinks, sources, and saddle points. A prominent example is the Morse-Smale complex that is constructed based on the gradient of a given scalar field [ 72 ]. Visualization of tensor fields. Compared to the visualization of vector fields, the state of the art in the visualization of tensor fields is less advanced.

It is an active area of research. Simple techniques for tensor visualization draw the three eigenvectors by color, vectors, streamlines, or glyphs. In situ visualization. However, with each generation of supercomputers, memory and CPU performance grows faster than the access and capacity of hard disks.

This trend hinders the traditional processing paradigm. In situ visualizing is visualization that necessarily starts before the data producer finishes. These interfaces allow a fixed coupling between the simulation and the visualization and integrate large parts of the visualization libraries into the program code of the simulation. Recent developments [ 75 , 76 ] favor methods for loose coupling as tight coupling proves to be inflexible and susceptible to faults.

Here, the simulation program and visualization are independent applications that only exchange certain data among each other via clearly defined interfaces. Licensee IntechOpen. This chapter is distributed under the terms of the Creative Commons Attribution 3. Help us write another book on this subject and reach those readers. Login to your personal dashboard for more detailed statistics on your publications. Edited by Adela Ionescu.

Edited by Shaul Mordechai. We are IntechOpen, the world's leading publisher of Open Access books. Built by scientists, for scientists. Our readership spans scientists, professors, researchers, librarians, and students, as well as business professionals. Downloaded: Abstract Computational fluid dynamics CFD is the main field of computational mechanics that has historically benefited from advances in high-performance computing. Introduction to high-performance computing 1. Anatomy of a supercomputer Computational fluid dynamics CFD simulations aim at solving more complex, more real, more detailed, and bigger problems.

Software We have seen an overview of the hardware available within a supercomputer and the different levels of parallelism that it exposes. HPC concepts The principal metrics used in HPC are the second sec for time measurements, the floating point operation flop for counting arithmetic operations, and the binary term byte to quantify memory. Anatomy of a CFD simulation A CFD simulation can be divided into four main phases: 1 mesh generation, 2 setup, 3 solution, and 4 analysis and visualization. Meshing and adaptivity Mesh adaptation is one of the key technologies to reduce both the computational cost and the approximation errors of PDE-based numerical simulations.

Error estimators and adaptivity The discretization of a continuous problem leads to an approximate solution more or less representative of the exact solution according to the care given to the numerical approximation and mesh resolution. Parallel meshing and remeshing The parallelization of mesh adaptation methods goes back to the end of the s.

Dynamic load balancing In the finite element point of view, the problem to solve is subdivided into subproblems and the computational domain into subdomains. Assembly In the finite element method, the assembly consists of a loop over the elements of the mesh, while it consists of a loop over cells or faces in the case of the finite volume method. Load balance As explained in Section 1. More HPC optimizations Let us close this section with some basic HPC optimizations to take advantage of some hardware characteristics presented in Section 1.

A Heisenbug is an error that changes or disappears when an attempt is made to isolate and probe them via debugger , by adding some constructs such as synchronization requests or delay statements. Another issue is caused due to the unpredictable behavior of the scheduler. Differences in system load influence scheduler behavior. This behavior cannot be changed manually. To counter this indeterminism, the program must be executed many times under various execution environments. Still, it is not guaranteed that a bug can be reproduced. Most of the time, the program runs correctly, and the bug is visible only when specific conditions are matched.

As a result, non-repeatability of the concurrent programs is a major source of roadblock for detecting error. As an example, consider the following. Clearly, this has a problem of causing deadlocks. Yet, it may cause deadlock in some runs of the program while in others, it may run successfully.

Probe effect is seen in parallel programs when delay-statements are inserted in parallel programs facing synchronization problems. This effect, like Heisenbugs, alters behavior changes that may obscure problems.

- High-performance computing use cases and benefits in business.
- High-Performance Computing: Dos and Don’ts?
- Multiscale Modelling of Organic and Hybrid Photovoltaics;

Detecting the source of a probe effect is a great challenge in testing parallel applications. The differences between sequential and concurrent programs lead to the differences in their testing strategies. Strategies for sequential programs can be modified to make them suitable for concurrent applications. Specialized strategies have also been developed. Conventionally, testing includes designing test cases and checking that the program produces the expected results. Thus, errors in specification, functionality, etc. Using static analysis before functionality testing can save time.

Static analysis techniques can detect problems like lack of synchronization , improper synchronizations, predict occurrence of deadlocks and post-wait errors in rendezvous requests. The indeterminacy of scheduling has two sources. To make concurrent programs repeatable, an external scheduler is used.

The program under test is instrumented to add calls to this scheduler.

- What is high performance computing? - insideHPC.
- Bioethics in a Small World: 24 (Ethics of Science and Technology Assessment).
- The Puffin Book of Nonsense Verse (Puffin Poetry).

Such calls are made at the beginning and end of each thread as well as before every synchronization request. This scheduler selectively blocks threads of execution by maintaining a semaphore associated with each thread, such that only one thread is ready for execution at any given time. Thus, it converts parallel non-deterministic application into a serial execution sequence in order to achieve repeatability. The number of scheduling decisions made by the serializing scheduler is given by —. To obtain more accurate results using deterministic scheduling, an alternate approach can be chosen.

A few properly-placed pre-emptions in the concurrent program can detect bugs related to data-races. The existence of one bug establishes a high probability of more bugs in the same region of code. Thus each pass of the testing process identifies sections of code with bugs. The next pass more thoroughly scrutinizes those sections by adding scheduler calls around them. Allowing the problematic locations to execute in a different order can reveal unexpected behavior. This strategy ensures that the application is not prone to the Probe Effect.

Sources of errors that cause the Probe Effect can range from task creation issues to synchronization and communication problems. Requirements of timing related tests: [3]. This equation has exponential order. Various issues must be handled:. This method applies the concept of define-use pair, in order to determine the paths to be tested. Software verification is a process that proves that software is working correctly and is performing the intended task as designed.

Input is given to the system to generate a known result.