From CyberGIS

Jump to: navigation, search


CyberGIS Data Integration Requirements

Copied from below for now

  • CyberGIS requirements
    • Binary format for downloading purpose - large dataset
    • Spatial indexing/query capability
    • Performance on scalability and data transfer time


WFS/WCS/WPS as service-oriented integration approach

The data integration task (a user of the application wants to upload their own data to a map service to share, or to an analysis service to run) we should be able to handle mostly through existing data standards. Using point shapefiles is fine, but does come with some limitations and people may be more inclined to want to use OGC standards. Since we’re talking about building web services, it seems WFS and WCS would be a good place to start, unless we have already found that these are not viable options.
For service integration WPS is the only standards based approach right now, which as discussed in San Jose may have some limitations. I think it would be worth exploring these limitations and how they might affect the project, if WPS can be improved within this project timeline, and if not then figure out how to build something new from scratch.
Even if the core system the team builds does not use the OGC standards, for the project to be useful long term and by people outside the project team, it will need to support the OGC protocols to have any significant use.
In our Gateway, current input/output of the parallel Gi*(d) and interpolation application is published as WFS and WCS after computation. Sharing data as WFS/WCS, feeding our applications with WFS/WCS data sources, and publishing output as WFS/WCS make perfect sense as service-oriented approach for our integration process.
In a way, these are great requirement specifications for integration, and hence great suggestions to extend WFS/GML protocol based on the CyberGIS project work. Let us not shy away from being bold to lead the geospatial community to an updated/upgraded specification for what now might be large datasets, but in the near future very realistic dataset sizes. Afterall it is a moving target, and has always been a moving target...speaking as the former technical committee chair of the national spatial data transfer committee. In 1984 there were four committee members who understood what a self-describing file was and why it would be needed for spatial data communication.
  • How best can our CyberGIS group work with the OGC DWG's to get them to consider a change in the specification?
  • What if we were to compose a new specification based upon our need? Maybe we arrive at a compromise. First, we need to document the way forward.
  • What techical papers are being discussed among the technical group?

Evaluation of WFS/WCS/WPS


  • Pros
    • Using OGC-compliant data services in CyberGIS has the advantage of leveraging vast amount of available online spatial data sources and broadening the use of CyberGIS by publishing results produced from CI-based analysis
    • WFS/WCS provides a loosely-coupled integration approach for accessing spatial data via Web service.
  • Cons
    • Performance of WFS/WCS in serving large-size data in multi-user environment is in question: GML or text format is not efficient for storing and transferring and processing large sized datasets
    • WFS/WCS data is not spatially-indexed, or ordered. Two possible solutions: 1) develop spatial indexing module after getting WFS/GML data from remote; and 2) extend WFS to enable spatial indexing
    • Data query/search capabilities is critical if raw dataset is large while users requirest a portion of data of interest. Possible solutions include WFS server

-side data partition and parallel data transfer

  • CyberGIS requirements
    • Binary format for downloading purpose - large dataset
    • Spatial indexing/query capability
    • Performance on scalability and data transfer time
  • Pros
    • Capability for specifying data processing interfaces
  • Cons
    • Not mature (testbeds are for small-scale data), therefore not practical without CI support
    • Has little semantic power


At HPDGIS workshop in last November, I discussed some fundamental issues for WPS/WFS/GML for service oriented GIS. A key conclusion is we need cyberinfrastructure to enhance such works. Lots of WFS/GML resources can be found on the Internet, but we cannot use such resource directly. Quoted from Carl Reed’s comment: normally people just download such data and convert them into other formats and then use such proprietary data on local machine.
The problem is WFS/GML did not maintain spatial index information, which is critical for spatial computation and overlay analysis. For any server that would like to perform spatial computation directly on WFS/GML, for example one WFS from USGS and the other from EPA, this server needs considerable hardware and computing resource to first maintain such remote WFS/GML data into its local space and then build spatial index before any spatial computation is processed. If any server offers a service to process remote WFS/GML directly, it is not feasible without Cyberinfrastructure support, especially considering multiple requests can be sent to the server concurrently.
WPS is not practical also because the designers did not consider the above constraints in infrastructure. Those testbeds may handle small scale datasets but not large scale datasets, or it may only handle 2-3 concurrent requests. Thus WPS is not scalable and then practical without CI support. As a common problem in service research, most researchers only deal with the interface definition [either in API or URL], but interface itself cannot resolve all problems. Particularly, WFS and GML were designed for data exchange, not for online data processing, and had little consideration about HPC or parallelization.
Benefits of using WFS/WCS for data integration in CyberGIS include:
  • WFS/WCS provides a loosely-coupled way for our service-oriented integration strategy. In CyberGIS, data access and integration are fundamental capabilities we need. Data processing and associated semantic and computational challenges, as Xuan mentioned, might need another project to deal with. But direct use of WFS/WCS is a straightforward way for us to share data between each other and access external data sources
  • WFS/WCS provides an efficient way for CI-based collaborative analysis. For example, if we want to run a spatial analysis on TeraGrid using two datasets, one from USGS, the other from EPA. We want to minimize the number of times data is copied/moved. We can then use WFS/WCS to get the data onto computing nodes directly, instead of maintaining a data repository at Gateway and involving multi-hop data movement
I think Xuan also raise a good question on spatial indexing. If the data used in an analysis is a subset of a dataset, spatial query is needed to extract data of interest out of raw dataset. I think OpenTopo team has a lot of experience in this regard. We need to elaborate more with more use cases, I think.
For WPS, we need further exploration to see if we can even contribute to it through our CI experience.
It is difficult to get the data onto computing nodes directly because spatial data may be captured and recorded randomly. When you read WFS/GML, the order of the input spatial features is meaningless since they are not sorted by either x or y coordinates. Before you read through the file completely, you have no idea about how to partition the data onto different computing nodes. Even in ArcGIS desktop, if you want to use a GML file directly, the first thing in ArcGIS is to build spatial index before any spatial analysis is implemented. For large scale data [e.g. 500,000 polygons], it may take more than 20 minutes to build the spatial index. This is not unusual since one county in Atlanta urban area may have more than 600,000 parcel polygons.
The other issue is how to maintain I/O stream when you read a large WFS/GML on the internet. Since you have no idea about data partition before you read through the whole file, you need considerable memory cache on your gateway to keep the session. If you offer such function as a service, then you may receive multiple concurrent requests to process different data from remote sources. In this case, even though TeraGrid can offer the infrastructure to support such services, there are lots of works to do in the Cyber environment, such as how to maintain the spatial index, otherwise we need to build it every time we receive such a request.
In general, I am afraid it is not mature at this moment to do WFS/GML integration considering those unknown issues. At HPDGIS workshop, I mentioned an email I got from a German professor whose team has been working on WPS. However, his testbed could not handle large dataset and he asked me to tell his team first if I want to try large dataset otherwise his server may crash. This is why I think WFS/GML needs to be re-formatted first since both were not designed for data processing even in the standalone/desktop environment, let alone in the Cyber space. But we can try something on CyberGIS using re-formatted GML if it is appropriate.
Technically WPS is an adaption of my semantic request and response [SRR] approach published by IBM, though this approach was enhanced by ontology [OSRR] in my dissertation. Given the example of “intersect” concept, it can be used to select features from one datasets that intersect with features in another dataset. However, “intersect” can also be a function to create new features that intersects one dataset upon the other dataset. WPS and other OGC standards have little semantic stuff and have to be improved.
For this CyberGIS project, if we plan to do direct integration of [remote] WFS/WCS/GML with other analytic tools or services, we need to consider such issues first. Possible solution includes
  1. Re-format GML/WFS in response to the needs of parallelization and partition;
  2. Re-build spatial index after GML/WFS retrieved from remote servers; and
  3. Develop tools and GUIs for integration testing. However, this single task may need another NSF grant considering the heavy workload for CyberGIS.
Quoted from a German professor: We have not yet made systematic performance evaluations with different input sizes, but this is definitely a good idea. You are right, performance is an issue with larger data sets and we are quite happy that nobody has used our online WPS with "real" data so far ;-) In case you intend to: please inform us ahead so that we can try to monitor the servers...
Quoted from Carl about GML: And one would never process against a GML file. The GML file would most likely be converted into some internal binary format on the server for further processing
WFS is based on GML and what user retrieve from WFS is GML. If Carl’s comment is correct, then traditionally, WFS/GML has to be converted before it can be processed. That’s why I think the current format for WFS/GML seems a challenge for direct integration over CyberGIS. But we should reformat it and explore what are required in the Cyber environment.
Yan: Does WFS support other formats for downloading purpose? From our Geoserver experience, a feature layer data can be downloaded as GML, CSV, or Shapefile. Shapefile is binary format, which can significantly reduce data transfer traffic
...getting the data to the process is a first step. That may be the way to go to start. "Direct connection" or "direct integration" is indeed desired, but might take som time to "shake out" just what workds.
I agree, WFS and WCS are not a solution for much beyond copying of data, and the lack of a spatial index in WFS is a show stopper for this project since large data is almost guaranteed.
Once we get some thoughts down on a wiki, and have others on CyberGIS project contribute, I look forward to contributing to a draft manuscript something like "cybergis integration strategy: pros and cons of an approach"