-
Notifications
You must be signed in to change notification settings - Fork 99
Compile SpatialHadoop from source
This page describes how to compile SpatialHadoop from the source code and how to install the version that you compiled. This allows users to get the most recent updates instead of waiting for a release which might take more time.
You need to have the following tools installed to be abl to compile SPatialHadoop.
- JDK 1.6 or higher.
- Maven
- Git (to obtain the source code only)
First step: Obtain the source code: You can use git to get the source code from github using the command
git clone https://github.com/aseldawy/spatialhadoop2.git
If you do not wish to use git, you can download the source code as an archive from the project page at github.
Second step: Compile: Navigate to the source code directory and issue the command:
mvn compile
This will compile the source code based on Apache Hadoop 2.x.
Third step: Generate a runnable jar: To build a runnable jar file that contains the libraries of SpatialHadoop and can be run using the hadoop jar command, issue the following command:
mvn package
Notice that the generated jar contains only the classes of SpatialHadoop without any third party libraries (e.g., JTS). This means you cannot run this jar unless your Hadoop distribution has all the required libraries.
Fourth step: Generate a portable redistributable package: The following command will generate two types of a redistributable package of SpatialHadoop, a setup package, and a portable jar:
mvn assembly:assembly
After running this command, you will find two files in the '/target' directory.
-
A JAR file that ends in '-uber.jar'. This file includes the classes of SpatialHadoop as well as all the non-standard third part libraries such as JTS and ESRI-Geometry-API. You can use this JAR file to run in a standard Apache Hadoop version. This is also a reasonable option to use with Amazon Web Services (AWS) including Amazon Elastic MapReduce (EMR) as AWS contains only standard Hadoop distributions without SpatialHadoop.
-
An archive that ships SpatialHadoop main JAR file along with other required files. All the files are laid out in a similar structure to Apache Hadoop to make it easier to install on an existing installation of Hadoop. All you need to do is to extract this package int the Hadoop home of every cluster node and then restart the cluster to let all nodes load their libraries.
We described three ways to build the binaries of SpatialHadoop and run them. There is a tradeoff between performance and portability among the three techniques as described below.
The distribution package technique is the most efficient as it injects all SpatialHadoop classes and required libraries into the Hadoop distribution so that they are all loaded at the startup on every Hadoop node. This means when you run any SpatialHadoop command, it is served directly from the classes in memory without loading any classes from disk. The drawback is that whenever you need to change the classes of SpatialHadoop, you will need to reinstall the new libraries on every Hadoop node and restart the cluster before you can use it.
The portable runnable jar is the other extreme. It creates one runnable jar file that contains SpatialHadoop classes in addition to all required libraries. This jar file can be executed using the hadoop jar command on any Hadoop distribution. This means it has to distribute the jar file to all cluster nodes before running every job. In addition, each machine has to load all classes from the jar file on each run. Although this adds some overhead on Hadoop, it has the advantage of being able to run on any cluster even if you do not have administrator access to it as you do not need to restart the cluster or add any files to the home directory.
The runnable jar balances the tradeoff between the two other techniques. In this case, you only install the third party libraries in your Hadoop distribution and then restart the cluster to have these libraries loaded. However, SpatialHadoop classes are not installed as part of the cluster. This makes it easy to modify the code of SpatialHadoop, recompile it and run it without the need to restart the cluster. This technique still requires administrative access to the cluster to install thrid party libraries which are not part of the default Hadoop distribution.