Parallelizing VOLVIS for Multiprocessor SGI Workstations Samuel P. Uselton Abstract The direct volume rendering program, VOLVIS, has been modified to take advantage of Silicon Graphics workstations with multiple processors. The resulting parallel program, called PVOLVIS, has been run on several different systems. The modification process is described and the results obtained are reported. Introduction: Volvis This report describes the process and results of parallelizing a direct volume rendering program called VOLVIS to take advantage of multiprocessor workstations made by Silicon Graphics (SGI). VOLVIS was designed and written to be a testbed for experiments in volume rendering data generated by computational fluid dynamics (CFD) simulations. While there were already several volume rendering modules available when this development was begun, none was able to deal with fluid dynamics data sets. The fluid flow simulations solve the Navier-Stokes equations for flow around or through a body on a discretized grid of the volume containing the fluid. These grids are most often composed of hexahedral cells whose descriptions are stored in multidimensional arrays. This strategy allows adjacency information to remain implicit, saving much storage space. The geometry of the grid is warped to fit the aerodynamic surfaces and contains cells of widely varying sizes to permit high spatial resolution in the areas where it is needed without using a large number of cells in areas where that resolution is unnecessary. These grids of hexahedral cells are generally referred to as "curvilinear" grids. The variation in grid cell size and shape means that the methods which have been developed for rapid direct volume rendering of regular grids can not be used. Volvis uses a ray casting (ray tracing without reflection or refraction) strategy because the ray-grid cell intersection could be implemented robustly without special knowledge of the grid geometry. Volvis was initially implemented in C as a serial application. Sampling issues have been deferred, so only one ray per pixel in the final image is calculated. For ray intersection purposes, each quadrilateral grid cell face is associated with a planar equation. These faces are not necessarily planar, although one of the criteria of a "good" grid is that the faces are all at least nearly planar. The approximation used is the same one widely used in aeronautics applications. The cross product of the cell face diagonals determines a normal vector, and a displacement is found by averaging the values for the planes passing through each of the four vertices. The main disadvantage of ray casting is the large computation time required. A trick for finding the initial cell intersected by each ray and grid coherence are used in the serial version to reduce the time somewhat. The trick is a variation of the item buffer idea used in ray tracing to find the first object hit by rays traced from the eye. The grid geometry is displayed, using the workstation's Z buffer, as a collection of quadrilaterals with the i, j and k grid array indices mapped to red, green and blue color components, respectively. The Z buffered display is controlled interactively to select the desired view. Once the view has been selected, the color at each pixel in the z buffered image can be decoded to determine the first cell hit by the corresponding ray (or that the ray misses the grid entirely). The alpha planes are also used, storing an indication of which of three faces is hit. From that point on, grid coherence is used to limit the ray intersection testing that must be done. Grid coherence is the notion that a ray traversing the grid always moves between adjacent cells. Knowing that the ray has entered a grid cell through one face (and assuming planar faces) it must exit through one of the other five faces of that cell. The parametric distance along the ray for each of the five planes is easily calculated. The smallest distance greater than the distance to the entry face must correspond to the exit face. (This idea is illustrated in two dimensions in Figure 1.) The location of the intersection is calculated. A data value for this location is found by a bilinear interpolation between the face vertices. This data value is used to generate color and opacity values which are composited according to a "glowing smoke" shading model. The distance along the ray between intersections can be used to modulate the attenuation, or each face can have equal weight, resulting in a bias toward smaller cells. The time required to generate the volume rendered image is completely dominated by the calculations performed for each ray-grid cell face intersection. The time per image is linearly proportional to the product of the number of image pixels actually covered by the grid and the grid "thickness," that is the number of grid cells penetrated by each ray. Trivial rejection of pixels that miss the grid is very efficient, and so adds very little to the time required to produce an image. For more information on this software, please see reference [2]. The sample grids used to test this software range from about 40,000 cells to a few hundred thousand cells. The images generated are 501 pixels in both width and height. Using a single 33 megahertz MIPS R3000 processor, the total time required for intersection, shading and compositing is roughly one quarter millisecond per intersection. Producing complete images takes tens of minutes. The code is not too finely tuned, but tuning is unlikely to produce the orders of magnitude improvement needed to make this software even moderately interactive. Ray tracing may have been the first application to be designated "embarassingly parallel." It is a frequently used demonstration for parallel computers. This ray casting program is a simpler version, and so even more amenable to parallelizing. Since achieving interactive rendering rates will require orders of magnitude reductions in the time required, using more processors seems an obvious method to try. The difficulty is not in finding a way to parallelize the code but rather choosing from among a multitude of possibilities. Resources At the NAS Systems Division, we have several SGI multiprocessor workstations. Since the program already ran on a single SGI processor it seemed reasonable to parallelize the program on the workstation and test that performance before undertaking a port to a massively parallel machine. In addition to the processor compatibility, parallelizing on the workstations with shared memory is easier than coping with the distributed memory of the massively parallel machines. The workstations are SGI 4D models, all equipped with VGX graphics hardware or better. The particular graphics hardware is only relevant to the performance of the view selection part of the application because the ray casting is all done in software. The workstations have from two (model 320) to eight (model 380) 33 megahertz R3000 microprocessors. The processors are all capable of accessing all memory on the machine directly, that is, these machines have a shared memory architecture. After the initial tests were complete, SGI made some time available on a pre-release Onyx system with four 100 Megahertz, R4400 processors and on a pre-release Challenge system with 19 similar processors. IRIX, SGI's version of Unix, supports shared memory multi-threaded programming. SGI provides a library of function calls for use in parallel programming which are slightly higher level than the basic Unix fork command, and may be used to create lightweight threads which share an address space. I found use for four of these routines [3]. The first one, m_set_procs, selects the number of threads desired. The second m_fork, spawns one less thread than the number specified by m_set_proc (the parent process is counted in the desired number). The third function is m_kill_procs, which terminates all but one thread. Finally, I used m_next to assure mutually exclusive access for the reading and incrementing of a variable used to assign different work to the otherwise identical threads. Otherwise, the parallel application is written entirely in C. The SGI GL library is used to read the image generated by the view selection part of the program and to write the final volume rendered image. Changes Required The m_fork routine has as its parameters a function and the parameters of the function. Each thread is started as an invocation of this function. The context of the calling routine is shared by the threads, but all variables declared within this function are local to the particular invocation, and therefore local to individual threads. The computation intensive part of VOLVIS was contained in a single loop which read a single scan line of the z buffered image, computed pixel values for the corresponding volume rendered scan line and wrote the scan line to the screen. This loop was isolated and placed into a separate function, to be called by means of m_fork. Some rearranging of variable declarations was required to separate variables into those for which each process needs its own copy and those which can (or must) be shared. The large data arrays containing the grid geometry, the solution values and the approximate plane equations of the cell faces are shared, which saves much memory. Pointers into these arrays are local to the individual threads. The copy of the z buffered image of the grid is also shared, but an integer indicating the scan line to operate upon is repeated in each thread. Since the arrays are only read and never written in this part of the computation, no data dependencies exist. The graphics (gl) calls for reading and writing the individual scan lines had to be pushed outside of the procedure to be parallelized, to avoid the mixing in the graphics pipeline of data from different threads. As a result, the parallel version reads the whole z-buffered image at one time and writes the whole volume rendered image at one time. While seeing the scan lines appear one at a time was excellent feedback on the progress of the computation, waiting until the complete image is available fits the parallelization and the reduced time should reduce the importance of feedback on progress. There was no file accessing done in this part of the serial code, but if there had been, it too would have had to be removed from the loop. I chose to explicitly pass all shared variables rather than trust some to global storage. However, m_fork only permits six variables to appear in the parameter list of the function it calls. Therefore, I made logical groupings of variables into structures, so that there are only six structures to be passed. Defining these structures and modifying the references to these variables to fit their new definitions was the most time consuming part of the conversion. Except for these changes, I left the software in the same form as the serial version. In particular, I did no restructuring to improve cache use, avoid memory conflicts or attempt load balancing. For the tests on the Onyx and Challenge systems, the parallel version of VOLVIS had to be ported to Irix 5.0, since that is the version of the operating system available on these machines. The port was done on an Indigo with a beta-test version of Irix 5.0; the port was relatively straight forward. Most changes were due to stricter ANSI C and UNIX System V checking. The Challenge systems have no local graphics, so the final image display was done using dgl (distributed gl). The use of dgl required no changes to the program. Results The parallelized code was run using the blunt fin [1] geometry and the local Mach number field calculated from the original solution data. The selected view is shown in Figure 2. Only part of the grid is visible, but it is the most interesting part of this field, and contains a wide range of grid cell sizes and shapes. Due to the relative aspect ratios of the window and the data set, a view showing the entire data set would contain many more pixels not covered by the data. The thickness of the data set would not change, so the time required would be less. The same view was rendered eight times with the number of processors used varying from one to eight. Interpolation method, color and opacity transfer functions and compositing method were all held constant. Eight identical images were generated. The window is 501 by 501 pixels, for a total of 251,001 pixels, 220,969 of which are covered by the grid. The grid dimensions are 32 by 32 by 40. One would expect the average number of cell walls hit by each ray to be slightly greater than the dimension of the grid most nearly parallel to the line of sight due to the oblique angle. The average number of cell walls intersected by each ray was 33.75. The times reported are the times required to compute a single image and do not include the the time to select the view, or to read or write images to the screen. Num Procs seconds elapsed cpu seconds speed up efficiency --------------------------------------------------------------------------- 1 1846.51 1846.35 1 1.00 2 928.43 1856.47 1.99 0.995 3 620.68 1861.48 2.97 0.992 4 465.66 1861.81 3.97 0.993 5 373.87 1868.16 4.94 0.988 6 312.23 1871.57 5.91 0.986 7 268.32 1875.76 6.88 0.983 8 235.56 1880.68 7.84 0.980 Table 1. SGI 4D/380 with 8 33mhz R3000 processors 2.5 x 10 **-4 seconds per intersection Table 1 summarizes the data collected, and shows both the speed-up and efficiency of this parallel implementation. Speed-up is calculated as the cpu time divided by the single processor elapsed time. The efficiency is calculated as the speed-up divided by the number of processors used. The speed-up is graphed versus number of processors in Figure 3. With an efficiency of 98% or better for all test runs, it is obvious that this application is highly parallel. Given the performance achieved, it would be wasted effort to attempt to redistribute the workload. Any more improvements must come from better serial code within the compute loop, or more and faster processors. The image takes only about 4 minutes to compute using eight processors, but this is still too long for interactivity. It is a great improvement, however, over the 32 minutes required for the single processor version. This is only a small data set, but the performance is dominated as much by pixel count as by data set size. In particular, speed is linearly proportional to number of pixels covered by the data set. The speed is also linearly proportional to the viewing thickness of the data set, which for arbitrary views and data sets of roughly equal resolution in all three dimensions, is proportional to the cube root of the data set size. Results for the tests on the Onyx system are given in Table 2 and Figure 4. The data rendered was the same and the particular view is as close as could be matched by slider control. The differences from the above description are that only 199,836 of the pixels were covered, and that the average number of cells hit per ray was 30.9, due to a difference in the viewing angle. Num Procs seconds elapsed cpu seconds speed up efficiency --------------------------------------------------------------------------- 1 737.90 730.87 N/A N/A 2 386.41 751.49 1.91 .955 3 251.98 749.28 2.93 .977 4 190.87 751.01 3.87 .968 Table 2: SGI Onyx with 4 100mhz R4400 processors 1.21 x 10**-4 seconds per intersection Even considering the slightly smaller amount of work done for the image computed on the Onyx, the R4400 processors are more than twice as fast in real use. A useful image can be produced in the neighborhood of three minutes with only four processors. Results for the tests on the Challenge system are given in Table 3 and Figure 5. The data rendered was again the same and the particular view is as close as could be matched by slider control. In this case the number of pixels covered by the data set was 219,724 and the average number of cells per ray was 33.5. Num Procs seconds elapsed cpu seconds speed up efficiency --------------------------------------------------------------------------- 1 876.98 867.59 1 NA 2 449.00 886.25 1.95 .975 3 299.09 882.49 2.93 .977 4 254.69 952.44 3.44 .861 5 180.78 882.69 4.85 .970 6 154.99 907.42 5.66 .943 7 129.88 882.08 6.75 .965 8 117.41 906.91 7.47 .934 9 101.81 882.69 8.61 .956 10 92.71 885.57 9.46 .946 11 85.63 900.76 10.24 .931 12 81.41 933.24 10.77 .898 13 71.50 883.31 12.27 .943 14 66.85 884.03 13.12 .937 15 62.59 884.65 14.01 .934 16 59.23 887.07 14.81 .925 17 55.82 884.68 15.71 .924 18 52.69 886.01 16.64 .925 19 51.42 903.35 17.06 .898 Table 3: SGI Challenge with 19 100mhz R4400 processors 1.21 x 10**-4 seconds per intersection At 16 processors, the time required became less than one minute. This threshold has been the initial goal of the parallelization work. At this speed, it is reasonable to ask a researcher to try using the software and give feedback on the results. The efficiency does drop as the number of processors increases, but the drop is relatively slow. Further Work More speed can be gained by either greater numbers of processors or faster processors. Silicon Graphics has announced availability later this year of the R4400 processor running at 150 megahertz. A fully configured Onyx can have 24 processors, and the Challenge can have as many as 36. A fully configured Onyx or Challenge machine could be expected to yield even greater improvements. However, conventional wisdom has it that there is a limit to the scalability of shared memory architectures due to bus saturation and memory contention. I plan to test this code on an Onyx workstation with the 150 megahertz R4400 processors at Silicon Graphics as soon as one is available. In the meantime I also plan to test this program on the distributed memory parallel machines at NAS beginning with the CM-5. There are several alternative ways to distribute the computation. I will begin with one that requires minimal modifications to the workstation version. If the performance is below expectations, then additional work can be done guided by the initial results. Bibliography [1] Hung, C.-M. and P.G. Buning, "Simulation of Blunt-Fin Induced Shock Wave and Turbulent Boundary Layer Separation," AIAA Paper 84-0457, AIAA Aerospace Sciences Conference, Reno NV, January 1984. [2] Uselton, Samuel P., "Volume Rendering for Computational Fluid Dynamics: Initial Results," Technical Report RNR-91-026, Applied Research Branch, NAS Systems Division, NASA Ames Research Center, September 1991. [3] "m_fork", Silicon Graphics IRIX Online Manual, release 4.0.3, August 1991.