Skip to content

CG_Hadoop: A computational geometry library for MapReduce

Ahmed Eldawy edited this page Feb 26, 2016 · 6 revisions

A recent addition to SpatialHadoop is CG_Hadoop, a set of computational geometry operations written in MapReduce. This is useful to scale up computational geometry operations for datasets of tera bytes and billions of records. We started with the polygon union operation which computes the union of a set of input polygons. We then made other fundamental computation geometry operations including skyline, convex hull, farthest pair and closest pair. Each operation runs on both indexed and non-indexed files. The version that runs on non-indexed (heap) files is much slower as it needs to scan the whole file. The indexed version is much more efficient as it makes good use of the index to speed up the processing and early prune file partitions that do not need to be processed.

The technical details behind these algorithms are described in our paper in ACM SIGSPATIAL paper titled "CG_Hadoop: Computational Geometry in MapReduce". All the source code is available in our github repository in the operations package.

How to use

To use the computational geometry operations in SpatialHadoop, you need to write the correct command for each one. Below, we describe how to use each operation.

Polygon Union

Computes the union of a set of polygons in an input file. This operation only works for input files of type JTSShape. It takes one input file and computes the union of all polygons contained in this file.

  bin/shadoop union <input> <output> -overwrite
  • input and output: Paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
  • -overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Skyline

Computes the skyline of a set of points. The skyline is the set of non-dominated points in the input dataset. A point p is said to dominate another point q if p is larger than or equal to q in both x and y dimensions.

  bin/shadoop skyline <input> <output> dir:<direction> -overwrite
  • input and output: Paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
  • direction: the direction to compute the skyline. Valid directions are max-max, max-min, min-max, and min-min. For example, max-min means that a point p dominates another point q if it is larger in x and smaller in y.
  • -overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Convex Hull

Computes the convex hull of a set of points. The convex hull is the minimal convex polygon that contains all input points. This operations returns only the subset of points that form the convex hull, i.e., the corner points of the convex hull.

  bin/shadoop convexhull <input> <output> -overwrite
  • input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
  • -overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Farthest Pair

As its name states, this method compute the two points with the largest Euclidean distance. Since the two points forming the farthest pair are always two corners on the convex hull, this method assumes that the input points already form a convex hull. If the input is just a set of scattered points, you should call the convex-hull operation first. The method does not explicitly call the convex hull operation to avoid unnecessary processing in case the input is already on a convex hull.

  bin/shadoop farthestpair <input> <output> -overwrite
  • input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
  • -overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.

Closest Pair

This method is the direct opposite of farthest pair. It finds the pair of points that has the minimum Euclidean distance between them.

  bin/shadoop closestpair <input> <output> -overwrite
  • input and output: paths to input and output. If input file is indexed, the index will be accessed to speed up the query execution.
  • -overwrite: If provided, output file will be overwritten without notice. Otherwise, the job will fail if output file already exists.
Clone this wiki locally