Software:pRasterBlaster

From CyberGIS

Jump to: navigation, search

This page coordinates the effort to develop and improve the USGS parallel map reprojection software - pRasterBlaster.

Contents

Overview

  • pRasterBlaster is a parallel map projection transformation program built on the desktop version of mapIMG that will scale to thousands of processors for map projection on large raster datasets. Backgroup on mapIMG is available at http://cegis.usgs.gov/projection/highPerformanceComputing.html. That page also houses reports and presentations relative to development of pRasterBlaster.


Updates

Feb 2014 (accumulation of multiple months)

Jan 2014

- Mattli and Wendel visited UIUC/ NCSA. The agenda included: 1) Specifying EPSG projections on command line; 2) prototype app (viewshed or TauDEM) and use pRasterBlaster as reprojection tool, instead of GDALwarp; 3) Consideration of retiring or upgrading the Goodchild cluster at CEGIS; 4) separate I/O part as a parallel computing extension to GDAL GeoTIFF I/O; 5) user interface to both CyberGIS gateway and desktop GUI. - Much of the time was spent benchmarking performance of Mattli’s prasterblasterpio program, which uses his librasterblaster and Simple Parallel Tiff Writer library, against an XSEDE cluster called Trestles. Liu set up "screen" terminals against his Trestles account that allowed Mattli and Wendel I to run commands from their laptops as though we were working on the Trestles machine. Mattli and Wendel were both able to go through entire git, complile, run, analyze result... cycles on trestles, a machine with more than 300 compute nodes and 32 processors per node. - Tressles also has a lustre file system for fast IO. The prasterblasterpio program ran amazingly fast and the whole team was happy with the results, especially the performance of his Simple Parallel Tiff Writer which also uses openMPI and was able to take full advantage of the lustre based file system. - Reads of the input, and, especially, writes of the reprojected output were several orders of magnitude better than anything previously achieved. - Here are the performance results (data used is 16GB land cover dataset; the best performance is less than 4 minutes):

TestsResultsJan2014.jpg

- This was an invaluable learning experience for the CEGIS team; working directly with XSEDE clusters, as well as leaning the clusters architecture (both hardware and software). Wendel demonstrated his pRasterBlaster GUI and it was well received. We talked about future work on that program, focusing mainly on how to provide fast previews of expected reprojected results by varying the output projection. - Liu demonstrated a CyberGIS gateway application that works with GISolve to give the CEGIS team an idea of how CyberGIS gateway web based applications work with XSEDE resources.

- Mattli and Wendel were also given a tour of the NPCF (National Petascale Computing Facility) where they Blue Waters, one of the fastest computers in the world, and the fastest on any University campus. It is housed in its own building and is impressive. - Wendel demonstrated the pRasterBlaster GUI application internally within CEGIS. He added Qt widgets such as file choosers for the input and output files, combo boxes for the resampler method and output SRS parameters, as well as an input validation system that pops up message boxes when a parameter is not the correct data type or within a certain range or domain. The validation system is not complete as far as domains of values, such as all valid SRS values, but the concept is implemented.

- The other significant feature implemented is the ability to start the UI with MPIrun while specifying an arbitrary number of helper processes to do parts of the reprojection. Wendel spent time getting familiar with open MPI to understand the concepts and implement the communication between the GUI process and the helpers (mainly communication that provides the helpers with the GUI input). The program can be run any number of times without a restart while the user varies input parameters between runs. - Though not complete, the concepts are implemented. Next, he will look at the Simple Parallel TIFF Writer (SPTW) and how open MPI is used to write files in parallel.

Dec 2013

- Mattli will add a color map to the Global Land Cover input raster dataset to make interpretation easier (for validation purposes).

- He also finished the pRasterBlaster output verification tool; it works by comparing individual pixel values and verifying that they are within a threshold (less than a maximum delta apart).

Nov 2013

- Wendel got a librasterblaster GUI working that mimics the functionality found in Mattli’s prasterblasterpio command line program. It takes input form text fields, calls the code found in src/demos/prasterblaster-pio.cc and outputs program feedback to a text widget. The program showing input and output from a successful run is on the right side of the attached screenshot (below).He is using the QtCreator 2.7.2 IDE which is based on Qt 5.1.0. The IDE is on the left behind the libprasterblaster GUI, and I just included it so you'd know what I was working with. - Next, Wendel will replace the text fields with drop down lists, file choosing widgets, and whatever else would make sense for a particular parameter. Also needed is an input validation and better error reporting when something goes wrong. But the core functionality is there and Mattli’s code works.



- The EPSG option in the code now works. Mattli hasn’t thoroughly tested it so there may be some bugs associated with it. To use an EPSG you have to specify it like this "EPSG:x". (For example, this will reproject into web mercator: ./prasterblasterpio --t_srs "EPSG:900913" tests/testdata/veg_geographic_1deg.tif tests/testoutput/veg_900913_1deg.tif)

- The EPSG values that are supported depend on the GDAL installation (but the most common ones should be available).


- The CyberGIS team in eager to use pRasterBlaster in our TauDEM workflow. The primary motivation stems from the fact that the GDAL-based reprojection takes so long and it offsets the benefit of using TauDEM. This is adding extra importance to the requirement for a pRasterBlaster result validation Tool. - We need to think about how to start using pRasterBlaster in our raster-based analytics (particularly as it relates to projection choices (ESPG numbers instead of the projection system name).

Oct 2013

- The CyberGIS Toolkit (version 0.5-alpha) was released publically: (http://cybergis.cigi.uiuc.edu/cyberGISwiki/doku.php/ct). This was the work of our cross-institution CyberGIS Toolkit team for reaching this milestone. This effort has started filling the critical void of open-source and scalable CyberGIS software capabilities in the context of advanced cyberinfrastructure and high-performance computing. We recognize the likelihood of hidden issues as well in existing components.

Sep 2013

- Liu got the performance results on the latest pRasterBlaster on trestles@SDSC. Looks quite promising: http://cgwiki.cigi.uiuc.edu:8080/mediawiki/index.php/Software:pRasterBlaster:NED_reprojection#pRasterBlaster_performance_.2820130923.29

- He noticed in 1024.csv file that there are quite a few unbalanced reading times. Mattli is looking into it.

- We need better ways to verify the output is correct, including visualization (how to view a 25GB output GeoTIFF).

Item of interest, the performance of using pRasterBlaster to reproject a 16GB DEM results in: - on USGS local cluster: 70min - 256 cores@trestles: 20min - 512 cores@trestles: 17min - 1024 cores@trestles: 13min

- As long as output is correct, we already beat GDAL sequential.



- Wendel has been working on replacing the Google based Geocoder embedded in MLAD. Google stopped supporting the version of the API he was using and it stopped working. Therefore, Wendel had to replace it with something the The National Map is using. It is done deployed, so he got back to work on the pRasterBlaster GUI.

- Due to sequestration, Finn and Mattli were unable to attend the CyberGIS All-Hands meeting (http://cybergis.cigi.uiuc.edu/cyberGISwiki/doku.php/ahm13/day1). Dr. Usery did represent USGS/ CEGIS at the meeting. We were unable to participate in the “CyberGIS in Action" demos/ hands-on session at the meeting to present librasterblaster and pRasterBlaster

- Mattli added a command-line option to set the tile size (--tile-size=1024). However, oddly, it only works with powers of two. He didn't see that in the TIFF spec but GDAL rejects anything else. - Also implemented was a new optimization in the partitioner. Mattli ran a reprojection job on the 16GB NLCD raster dataset and it took way longer than expected (about 25 hours). He noticed that the processes were spending a disproportionate amount of time on file reading.

- For load balancing, pRasterBlaster has been shuffling the partitions it generates before assigning them to processes. This balances the load but results in random reads (e.g. process 1 is reading the beginning of the file and process 2 is reading the end). This worked correctly when the files fit into memory on the fileserver but the NLCD is larger than the memory on the Goodchild fileserver, so likely that was causing the slow file reads. - Currently, the implementation sorts the partitions by location in the file and shuffles them locally to try and achieve load balancing. With that change, 48 processors on Goodchild reproject the NLCD raster in 70 minutes. Still slower than it should be compared to the smaller datasets but much better. - For the NLCD raster dataset, Mattli used this call: mpirun -np 96 rasterblasterpio --tile-size 4096 --layout=tiled -n 32532800 --t_srs="+proj=moll" tests/testdata/nlcd2006_landcover_4-20-11_se5.tif tests/testoutput/nlcd2006_landcover_4-20-11_se5_moll.tif

- All the changes are in github and subversion.

- Mattli got the Tiled TIFF writing in Simple Parallel TIFF Writer (SPTW) module to work correctly. The previous noticed distortion is gone.

- TIFF writing is now implemented with two different code paths in SPTW. One is very much like the original implementation (each row of the raster dataset is written to the disk with a separate operation). If, however, the RasterChunk being written to disk has the entire width of a tile, then up to the entire tile can be written as a single operation. The result from tests is a very significant reduction in write operations. The decomposition of the output raster dataset doesn't yet take into account the file tiles. Next, he will push the tile aware decomposition that should ensure that the fast, whole tile writes are used for nearly the entire file. The other remaining work is to expose the size of the tiles to the user in the form of a settable option. This, along with partition size, is two parameters that can have a large effect on performance.

- Here is the performance Mattli got on the Goodchild cluster with the Global Land Cover dataset:

  1. processors | Total Runtime Read I/O Write I/O Resampling

8 | 418.5844s | 2.1092s | 27.7189s | 173.2376s 16 | 221.2990s | 0.6633s | 14.6931s | 86.4759s 24 | 153.0111s | 0.4751s | 10.4743s | 60.6124s 32 | 120.1347s | 0.3420s | 9.7591s | 41.4118s 40 | 112.7968s | 0.2687s | 9.2147s | 38.3606s 48 | 97.7441s | 0.2719s | 9.7051s | 33.0026s 56 | 89.0058s | 0.1924s | 10.8344s | 25.0259s 64 | 78.2209s | 0.2075s | 11.2463s | 21.6946s 72 | 75.1096s | 0.1721s | 11.8832s | 21.9639s 80 | 84.8985s | 0.1240s | 16.1605s | 18.7700s 88 | 77.7583s | 0.1278s | 15.8223s | 15.0338s 96 | 75.7510s | 0.1568s | 22.6725s | 17.1675s 104 | 80.2591s | 0.1582s | 27.7282s | 17.7222s 112 | 81.2738s | 0.1045s | 30.0465s | 15.0089s

- The code is in the repository.


- Liu checked out the code from github and tested the performance. He couldn't finish the 900MB Global Land Cover dataset on Trestles using 64 cores in 2 hours of wall time.

- Mattli has the librasterblaster and pRasterBlaster code 90% fixed but some edge cases are still not being handled correctly. The performance is looking much better though.

- Liu tested pRasterBlaster code with The Global Land Cover dataset (900MB) using 64 cores finished 96 seconds. He used the following configuration: 4 partitions on each process; so the chunk size = (number of columns / number of cores) / 2 * number of rows. For this test dataset, it is 6480000.


Aug 2013

- Liu posted another benchmark result: http://cgwiki.cigi.uiuc.edu:8080/mediawiki/index.php/Software:pRasterBlaster:NED_reprojection#Benchmark_Results

- He was able to use GDALwarp to reproject 30GB NED tiles on a large-memory machine on Tresltes@SDSC (64GB memory). It took 5 hours.

- Liu put the initial benchmark using GDALwarp on CyberGIS wiki at http://cgwiki.cigi.uiuc.edu:8080/mediawiki/index.php/Software:pRasterBlaster:NED_reprojection

- He used, as input, the GeoTIFF version of NED tiles, instead of ADF. The conversion is trivial using GDAL. We expect pRasterBlaster to beat 283 seconds in reprojecting the NED samples.


Jul 2013

Achievements include: - Separation of librasterblaster and pRasterBlaster. Mattli has done that, but there were compiling issues. These are now fixed.

- Compiled and tested librasterblaster and pRasterBlaster on Trestles and Lonestar (supercomputers). This was time consuming because of the complexity of GDAL and bad system configuration of Trestles software environment. Liu figured out a way to get around Trestles' cluster configuration. Through this process, quite a few pRasterBlaster autoconfig issues were corrected.

- Fixed a doxygen problem on the API documentation. Now the first version of doc is available at http://gwdev3.cigi.illinois.edu:8080/usgs/prasterblaster/html/ . We had a major discussion on the API design and identified key functions and code snippets to be included in the API. Mattli will work on them and revise current pRasterBlaster version to accommodate the changes.

- National Middleware Initiative (NMI) build and test: after compiling, runtime linking, and testing issues of pRasterBlaster code, Liu pushed the code to NMI. Now the librasterblaster test works well on NMI. Soltani developed a sequential reprojection test code using librasterblaster as sample app code. We will include that code as part of test codes. A successful build of librasterblaster is here http://submit-1.batlab.org/nmi/results/details?runID=141959

- More than 15 bugs/ issues were fixed. We gained insights of the new pRasterBlaster version (it's dramatically different than the version tested before). Now librasterblaster is integrated into the CyberGIS build and test framework. The scalability of the new pRasterBlaster version is a problem. We will explore more as one of the action items.


4/3/2013

Agenda for the hancking session

  1. Separating librasterblaster and prasterblaster documentation from doxygen and outlining tasks to improve the documentation
  2. Generate librasterblaster shared and static lib files
  3. Compile prasterblaster using the librasterblaster lib files (from step 2), instead of from obj files
  4. Debug prasterblaster issues in reprojection (the asymmetric issue, stripping issues)

All these items were addressed, items 2, 3 and 4 are fully resolved. Work on item 1 will conntinue and item 4 will need to be verified.

Action Items:

  1. Remove MPI dependency during build for librasterblaster (David)
  2. tiff.h is not included in gdal build, but is available in gdal source. Explore our options with this dependency. (David)
  3. Run the experiment on full scale to see if the striping issue is resolved. (Anand will run the expt, David will verify)
  4. Verify the timing on large data run. (Anand will run the expt and report back the result)
  5. Provide verification of test run using results that are known to be correct. (David)
  6. Create a wiki page for developer documentation (Anand) Developer Documentation
  7. Create developer guide that includes librasterblaster API and sample code. (Yan coordinates; David and Kornelijus (Illinois) will work on it)

02/20/2013

A Summary of recent e-mail traffic of relevance:

06 Jan 2013; Padmanabhan :

Anyway here is some information from our testing of the pRasterBlaster (special thanks Yan!)

- The build process has been improved. However ./configure should also check gdal header files and tiff library. - Parallel IO code couldn't compile. We use mpich 1.2.7. (On Trestles we were able to compile everything except the src/demos directory. i.e. ParallelI IO code). Were you guys able to get the parallel I/O code working? Error info we get is: sptw.cc: In function 'sptw::SPTW_ERROR sptw::write_subrow(sptw::PTIFF*, void*, int64_t, int64_t, int64_t)': sptw.cc:207:14: error: 'struct MPI_Status' has no member named '_count' sptw.cc:210:39: error: 'struct MPI_Status' has no member named '_count' - The documentation has function-level doc. But we still don't know how to use this library. Please provide sample codes and sample data. Sample code must include a serial code that calls librasterblaster; and a parallel code that uses parallel I/O. Main concern is whether this librasterblaster is a clean cut from prasterblaster and can be used by both serial and parallel apps. - We also noticed you use GDAL + MPI IO, which has been a headache for us in the past. If you worked it out how to use it, we would like to know how. Could you explain how it works? Maybe David could do this in a short presentation at the weekly telecon around the end of January.


09 Jan 2013; Mattli:

OK I'll add those checks (./configure should also check gdal header files and tiff library) to the configure script

I updated the code to use the standard MPI_Get_count() instead of the struct member.

The file prasterblaster/html/index.html is supposed to be the overall documentation for the library. I'm still working to make is clearer but it provides an overview of how the components fit together. The html directory is generated by doxygen.

By carefully studying the TIFF and MPI-IO specs I created a collection of small functions that allow you to correctly write TIFF files in parallel with MPI-IO. I called this the simple parallel tiff writer (sptw). The functionality provided is limited and was mostly intended as a demonstration of how librasterblaster could be used with parallel i/o.


17 Jan 2013; Padmanabhan: No change still fails to compile the demo directory.

I have looked at the doxygen generated documentation. It does provide some function level documentation but that does not tell me how to use the library. I think that is very important. Specifically a sample code of a serial code that calls librasterblaster; and a parallel code that uses parallel I/O, should be provided and documented. I think having such a documentation will be critical for anyone who wished to use the library in their apps.


08 Feb 2013; Padmanabhan:

Kornelijus was able to independently get pRasterBlaster compiled on his machine. Next step may be to do some verification of results and development of developer documentation and install guides. I am sure David and Kornelijus can work together on that to make this happen.


13 Feb 2013; Liu:

At least for me, I think it [the hackathon] was productive. We were able to fix several compiling issues that prevented uiuc team from compiling it on our machines; ran the prasterblaster program on 1G raster (global land cover) and discovered some algorithm implementation issues (output raster is not symmetric; no librasterblaster library generated); and identified todo items for the next step.

This practice will continue as our undergrad, Kornelijus (he's good), gets up to speed. He will also be the driving force to push for a librasterblaster documentation as a developer who is eager to use it.


19 Feb 2013; Mattli:

I think I've fixed the build problems that that CIGI team found but I haven't heard back definitely yet. I've also been communicating with the student Kornelijus Survila. He's been giving feedback about the documentation.




12/19/2012

David Mattli committed a stable version of librasterblaster to the CIGI subversion server.

To get the code:

1. svn co https://svn2.cigi.uiuc.edu:8443/usgs/prasterblaster/trunk prasterblaster


To compile the library:

1. cd prasterblaster

2. autoreconf -iv

3. ./configure (You might have to specify the MPI wrapper with "./configure CXX=mpiCC" or similar)

4. make

5. The library should now be found at "prasterblaster/src/librasterblaster.a"


To generate documentation:

librasterblaster includes lots of documentation about how to use the library and about specific classes and functions. It has to be generated with doxygen.

1. cd prasterblaster

2. doxygen

3. Open prasterblaster/html/index.html in your web browser.

Release Schedule

  • As of 28 Nov 2012:

V1.0 – Current.

V1.3 – End of Dec2012? Includes new libRasterBlaster; currently in testing at UIUC.

V1.5 – End of Feb2013?? To include Babak’s new I/O Library.

V2.0 – Mid-Summer 2013??? Firm up User Interface (Mike’s effort with Yan to close the gap between the CyberGIS Gateway and pRasterBlaster’s current “interface” (term used lightly), while being consistent with the Gateway look and feel, and the dRasterBlaster (mapIMG) interface (historic and current).


11/28/2012

   Presentation by Yan on behalf of David and Mike
   Introduction to map reprojection
       Map reprojection converts a raster map image from input projection to output projection. USGS map reprojection project
       Reprojection methods: coordinate transformation using GCTP; output framing; forward and inverse mapping; resampling
       Forward mapping after inverse mapping corrects wraparound problem
       Resampling recovers categorical information during inverse mapping 
   Parallel computing solution development
       MapIMG is the production software for map reprojection. But it is a sequential algorithm
       pRasterBlaster is the ongoing parallel solution for map reprojection. pRasterBlaster is based on pRPL, a MPI-based parallel raster processing library
       UIUC tested pRPL performance on Abe cluster at NCSA using up to 256 processors and datasets up to 600M cells. Experiment results showed that: 1) with small number of processors, pRPL might run out of memory on each processor for processing large datasets; 2) as the number of processors increases, communication overhead increases and eventually slows down the computation, instead of gaining more speedup; 3) Larger datasets can benefit more from using multiple processors (the number of processors to be used needs to be optimized) 
   Data transfer service. A GridFTP service has been setup at USGS to facilitate transfers between USGS and TeraGrid. Preliminary tests on a 4GB sample raster image showed that using parallel data channels (optimal numbers: 8 channels for USGS-NCSA; 16 channels for USGS-TACC and USGS-SDSC) 

       Overview of parallel map reprojection code development (David)
       The parallel computing solution to map reprojection is based on cluster computing using MPI
       Code evolution
           Sequential code: MapIMG; production-quality code; suitable for small-scale map reprojection processing.
               Obtained significant improvements over common commercial software: addressed distortion, used kNN to eliminate strange noise in output
               Applied statistical resampling for categorical data
               Achieved improved results
               Processing high-resolution raster takes hours 
       pRasterBlaster: parallel map reprojection software
           Raster data processing is highly parallelizeable by simply dividing rasters into rows, columns, or blocks
           The use of the parallel raster processing library (PRPL): PRPL uses master-slave model; master decomposes data and distributes workload using message-passing or shared file system
           Local test environment at USGS: the Goodchild cluster (128 cores, 16 nodes, NFS) 
   Performance test results on TeraGrid (Yan)
       UIUC team tested current pRasterBlaster version using two sample datasets of resolution 38888x19446 and 643x322, respectively. Both the code and datasets were prepared by David
       Two TeraGrid clusters were used: Trestles@SDSC (10,368 cores; 100 teraflop/second) and Lonestar@TACC (22,656 cores; 302 teraflop/second)
       Compiler: Intel C/C++ 11; MPI: MVAPICH2 optimized for infiniband interconnection
       Test results on the small dataset showed negligible performance variations as the number of processor cores increased
       Test results on the large dataset: 

Prb-sdsc-perf.jpg

Figure 1. Performance results on Trestles@SDSC

Prb-tacc-perf.jpg

Figure 2. Performance results on Lonestar@TACC

       I/O becomes a performance bottleneck when large number of processor cores are used in reprojection computation: as the number of cores increases, parallel file system is saturated by too many file I/O calls in a short amount of time 
   I/O issue discussion
       Two possible causes for I/O bottleneck: data servers on parallel file system was saturated; or the I/O calls in pRasterBlaster (using GDAL's geotiff library) was not efficient for large-scale parallel computing environment
       Further investigations
           verify the correctness of output on all test scenarios (David helped verify them. Output rasters are correct)
           repeat the performance test using larger datasets to eliminate the possibility that the dataset was over-decomposed and thus resulted in poorer performance in larger number of cores. From the performance diagram, it is unlikely. But we need to verify 
       Proposed solutions
           Using MPI IO to parallelize the raster write operation. Eric introduced MPI IO. David had previous experience with MPI IO
               Implementation 1: extend GDAL library to support MPI IO for Geotiff I/O functions (time-consuming)
               Implementation 2: change output format to a parallel I/O-friendly format: NetCDF or HDF5. These two formats are already deployed on TeraGrid clusters. GDAL supports these two formats, too 
                   Mike: output format can be flexible. ArcGIS 10 reads and writes NetCDF
                   Steve: NetCDF support originated from ncar/ucar UMT data. NetCDF is working toward as standard on OGC/ISO 
               Implementation 3: each process dumps its output chunk to local file system; then a master process (say MPI rank 0) aggregates local chunks into global output. This method eliminates the need to write to a single file by thousands of processes. It will be interesting to see if Trestles' SSD local file system can be leveraged to boost I/O performance of this method
               Action: David will identify the I/O part of code in prasterblaster and send to UIUC team 
           Consult cluster administrator on how to tune the underlying parallel file system in order to investigate where the I/O bottleneck occurred exactly 
   The first performance test identified significant performance slowdown on Trestles and Lonestar when the number of processor cores used exceeds a threshold (128 on Trestles, 384 on Lonestar)
   UIUC team is conducting the 2nd performance test to:
       identify performance bottlenecks: Babak uses TAU; Eric uses IPM.
       identify whether single file-locking is an issue
       test if parallel I/O approaches (i.e., MPI IO) can improve the I/O performance
       explore the solution to using parallel I/O for prasterblaster 
   Findings so far:
       Babak found that on Lonestar, if the number of cores is larger than 384, only one processor core was doing the computation. Others were idling. Dataset used has 19446 rows
       Yan modified the code to have each core output to its own output file, and observed the same behavior, i.e., only one file had output content, others had length 0
       Yan investigated the workload distribution algorithm in prasterblaster code (Chunker::getChunksByCount), and found that it has a major flaw. This flaw can explain what we observed:
           Current workload distributed algorithm follows row-wise decomposition: it sets chunk size as 50 rows and splits total number of rows into chunks; then each process, based on its process index, takes a subset of contiguous chunks and processes them one by one.
           The problem is, there are often leftover chunks. For example, if a dataset has 20,000 rows, 400 chunks will be created. If num_procs is 256, each process gets 1 chunk, but there are 144 leftover chunks. These chunks are assigned to the last process, which means the last process has to handle 145 chunks. When this happens, we see a long-tailed execution, one process was busy and all others finish earlier. If num_proc is 512, each process gets 0 chunks, all the chunks are leftovers for the last process, which explains what we observed 
       Yan modified the workload distribution algorithm to produce balanced workload for all processes (+/- 1)
       Experiments on the new workload distribution algorithm produced satisfactory results, as shown in the following two diagrams 

Prb sdsc2.jpg

Figure 1. Performance comparison of improved workload distribution algorithm, on Trestles@SDSC

Prb tacc2.jpg

Figure 2. Performance comparison of improved workload distribution algorithm, on Lonestar@TACC

       Guofeng is exploring parallel I/O solutions: parallel raster writes via NetCDF with MPI IO. There are two parallel NetCDF libraries: NetCDF4 from Unidata@UCAR that leverages the parallel HDF5; and the Parallel-NetCDF library from Argonne/Northwestern 
   Indications:
       I/O probably was not the sole bottleneck. On trestles, after I fixed the workload distribution, it scales well to 256 and 512 cores without performance penalty.
       their original workload distribution definitely affected performance. In all finished cases with different number of cores, performance with the fix was better than original.
       We will need to rethink I/O issues of prasterblaster after above findings are verified from all finished cases on both lonestar and trestles 


05/10/2012

  • Portability issues when deploying pRasterBlaster on XSEDE clusters
    • SVN version: r728
    • Cluster environment: Trestles.SDSC.edu; 32 cores per node; Infiniband interconnection; PBS job queue
    • Compiling environment: Intel compiler version 11; mvapich2 1.5.1; boost 1.46.1; gdal 1.8.0
    • Issue 1: shared_ptr. Shared pointer is now a feature in g++. But we had problem using g++ shared_ptr and intel compiler at the same time. The reason to use Intel compiler is to leverage the optimized MPI on the cluster, mvapich2, which is compiled only with Intel compiler. We found that there are 6 places in the source code that claim the use of std::shared_ptr. We changed these places to use boost::shared_ptr.
    • Issue 2: reprojector.cpp:21: header <cstdint> is a feature of g++, but not with icc. So we changed the code to use <stdint.h>

Q&As

  • Can David share the datasets he used for pRasterBlaster testing so that we can compare results? We can use ybother machine to shared the files
  • As you've seen in my previous report, which was on the older version of pRasterBlaster, ReprojectChunks() is one of the most time-consuming part of the application. However, when I run David's code(with a much bigger dataset, about 13GB), ReprojectChunks() takes a negligible time of the whole application which is surprising to me. Although, I see a bunch of "Error in reprojecting chunks #x" in the .err file. So, i'm not sure if this is because of the dataset that we're using or this is natural. I think I can answer my question by running David's code using his own dataset.

Parallel I/O implementation

The PIO RasterBlaster works like this:

  • Creates an output raster using librasterblaster in rank 0. This raster will later be opened individually by every process.
  • –– Steps below are executed in parallel ––
  • Partitions the output raster space to roughly equal amount of pixels per process (last partition might be larger).
  • For each partition:
    • Calculates input area and reads that input chunk
    • Allocates an empty output chunk
    • Reprojects the area
    • Writes output using SPTW

The SPTW (simple parallel tiff writer) is a simple wrapper around MPI-IO. It contains helper functions to open, close, and write TIFF rows. Because atomicity is disabled, each write executes in parallel and does not block another.

While opening a raster, SPTW uses GDAL to query the size and band parameters before using MPI-IO. These parameters are later used while writing rows to calculate the file offset.

Each process only reads and writes the data it needs, therefore file I/O is used optimally.

NED Re-projection

Q/A
  • Where (and which) GDAL functions are called to create the output raster file?

Output rasters are created using librasterblaster::CreateOutputRaster. Under the hood, this function uses GDALDataset::SetGeoTransform and GDALDataset::SetProjection.

  • Does rank 0 close the file before all of other processes start writing to it?

Yes. The file is automatically closed in librasterblaster::CreateOutputRaster.

  • What is the I/O function to seek the position of write operations for each process, or is it implicitly specified in MPI IO calls?

SPTW uses MPI_File_write_at which takes in a parameter for the file offset.

Resources