Software Description for the Parallel Gi*(d) Spatial Statistic (pGID)
pGID is a parallel hot spot analysis application for local clustering detection, based on Getis and Ord's Gi*(d) spatial statistic. It has similar function with ArcGIS Hotspot Analysis, but with the following enhanced features:
- Parallel computing of the Z-score (Gi*(d) value) for all points
- Parametric study on distance parameter with multi-d support
- Large-scale dataset processing support on cyberinfrastructure through parallelization
Input and output
pGID takes a point shape file as input. There are several commonly-used point shape file formats: Point, PointZ, and PointZM. Currently, pGID accepts Point shape file.
The output of pGID can be a ascii point dataset and/or point shape file with Z-score stored in dbf file.
For a large point dataset, pGID uses regular quad-tree-based domain decomposition technique to decompose the dataset into a set of subdatasets. Each subdataset carries a portion of points whose Gi*(d) value is to be computed, as well as points that are needed in Gi*(d) computation. Therefore, the execution of Gi*(d) computation for these subdatasets is embarrassingly parallel. A task scheduling algorithm is developed to request appropriate number of processors from cyberinfrastructure (currently from TeraGrid), and then allocates each processor for Gi(d) computation on one or more subdatasets. After all subdatasets are computed, results are aggregated to form the output file
pGID is implemented as a package in C, perl, and Bash scripts on Linux platform. Points with input Z-value are extracted by shape file tools we developed using shapelib. Domain decomposition and Gi*(d) calculation code is written in C. The domain decomposition code takes in the input dataset and a vector of distance values, and produces a set of subdatasets. Task scheduling is implemented in Bash and perl. MPI is used to collect computing resources for task scheduling purpose.
Current implementation takes Point shape file and uses Euclidean distance. It does not produce the p-value.
pGi*(d) visualization is implemented as a visualization module in CyberGIS Gateway, based on the use of the open source OpenLayers, Geoserver, Postgres, and PostGIS. Since Geoserver is really slow in visualizing point shape file of large size. Output points are imported into PostGIS as a feature type in Geoserver. OpenLayers are used to fetch the feature type as an overlay map in Web browser. User can visualize Gi*(d) result from multiple distance values by interactively selecting a d value in visualization interface.
- Open source: pGID code is being polished for open source
- Cyberinfrastructure integration: pGID is currently deployed on TeraGrid Abe cluster. The package is compiled using Intel compiler version 10 and mpich-vmi-2.2.0-3-intel-ofed-1.2. It will also be deployed onto other clusters that our Gateway has access to (e.g., Ranger@TACC)
- Gateway: pGID is integrated in CyberGIS Gateway as a gateway application, named "Parallel Hot Spot Analysis". Users can access pGID functions by logging in to Gateway web site
- Input point dataset: A simulated point shape file containing 20,000 points (x, y, z) as 8 clusters. Within each cluster, points follow normal distribution
- Output as d=2
- Output as d=5
- Output as d=8
- Output as d=12