Getting Started with GalaxyGIS
The Clemson Center for Geospatial Technologies (CCGT) cyberinfrastructure includes a High Throughput Computing pool, called GalaxyGIS, to address the needs of desktop GIS users who needs additional computational power for their GIS analysis.
The GIS Cluster consists of 34 Windows computers each with:
Installed GIS programs
Intel Xeon E3-1271 Processor
16GB of RAM
Up to 2 terabyte of usable hard disk space per Condor enabled PC
The entire Cluster includes over 700 computational nodes and a High Throughput scheduler called HTCondor that distributes GIS jobs across the nodes for parallel processing. All of these resources are free to all Clemson students, faculty, and staff. Flocking will also enable more potential processing power for users who have an account with the Palmetto Cluster.
Dealing with Large Vector Data
When working with ArcGIS, sometimes jobs can get very large and take hours to process. Using GalaxyGIS and HTCondor to break jobs into smaller pieces can really cut down on processing time. In order to use Condor, you must first write 3 Python scripts. If you have never used Python or never coded at all, don't worry! We have examples and tutorials for helping you write your own scripts. Here is an example workflow using large vector data that you can try for yourself in order to practice writing scripts and using Condor:
One common use case for Condor is to analyze the intersection between two feature classes often a polyline feature class and a point feature class. In this description, we will explain how Condor can be used for a project like this, as well as provide you with the resources to test this workflow yourself and apply it to your own ArcGIS project.
One such project involved calculating all possible intersects between analyzed traffic routes (1.9 million observations) and all the traffic data collection sites that are spread throughout the city of Greenville. In order to solve this problem, we can used Condor and took a three-step approach:
1. Break up the large 1.9 million entry data set of roadways into smaller chunks of 5,000 roads each. This added up to 395 individual road data sets.
2. Submit each separate data set through Condor to be processed separately and concurrently to calculate the intersection of each road with each collection site.
3. Merge all those observations back into one large data set.
Using Condor, the processing time was cut from over 4 days to about 3 hours!
If you want to try Condor, we have a subset of this project linked below for practice. Further instructions can be found there if you decide you want to apply Condor to your project.
Flocking to Wisconsin-Madison
Flocking is HTCondor's way of allowing jobs that cannot immediately run (within the pool of machines where the job was submitted) to instead run on a different HTCondor pool. If a machine within HTCondor pool A can send jobs to be run on HTCondor pool B, then we say that jobs from machine A flock to pool B. Flocking can occur in a one way manner, such as jobs from machine A flocking to pool B, or it can be set up to flock in both directions.
You can utilize flocking by sending jobs to the University of Wisconsin-Madison's pool, conveniently named "UW-Madison CS". UW-Madison is where HTCondor what originally developed. To use flocking, you’ll first need access to the Palmetto Cluster. Once you have an account, you’ll then need to request a home directory be made for you so you can use the condor-cm machine. These requests can be made by contacting us and requesting permission to flock.
The only code you need to modify is your Condor submission file so that it replaces/includes the following lines:
requirements = regexp("ad.wisc.edu$",Machine) == True && TARGET.OpSys == "WINDOWS"
+WantFlocking = true
Next, use an SSH client to connect to “condor-cm.palmetto.clemson.edu” and login with your Clemson password and username. Then use SFTP to transfer all your data and scripts to your home directory. Finally, submit the jobs to Condor and they will flock to University of Wisconsin-Madison.
If you want to use HTCondor to submit jobs to GalaxyGIS, there are two ways you can do that. The first is through the CCGT computer lab and the second is through your personal machine.
CCGT Computer Lab
Step 1: Creating Your Data Folder
First, log onto a computer and make a folder on the D drive containing:
Your Python scripts
The Condor submit file
An empty log folder
Step 2: Split, Submit, and Merge Your Data
Next, run your split script to split up your job. Then open a command prompt and use the "condor_store_cred add" command, entering your Clemson user password when prompted. Run "condor_submit <your submit file>" command and wait for your job to finish. Calling "condor_q" will return the status of your jobs. Finally, run your merge script to gather the final product in your geodatabase.
Your Personal Machine
Step 1: Installing Condor
To install HTCondor on your machine, navigate to https://research.cs.wisc.edu/htcondor/downloads/.
From there, click on the install instructions link to the right of Current Stable Release. Make sure you read the instructions thoroughly before installing. Back on the downloads page, click the link to the right of Current Stable Release under the column UW Madison that says “HTCondor [version number]. Download the tarball version corresponding to your operating system. After downloading, install it by opening the file and using all default options when it prompts you for information.
Step 2: Download the GalaxyGIS Condor Configuration File
After installation is complete, you will need to replace the configuration file in C:\condor\condor_config with our condor configuration file. To receive access to this file, please email Palak Matta at pmatta[at]clemson[dot]edu or Patrick Claflin at pat[at]clemson[dot]edu.
Step 3: Using a VPN to Submit Condor Jobs
Finally, log onto Clemson's VPN (only if you are not already connected to the Clemson network) and issue the "condor_store_cred add" command through the command prompt. When prompted for a password, enter your local machine password (not your Clemson user password) and then issue the "condor_q" command. If you get something similar to this:
-- Schedd: PSICOM2 : <18.104.22.168:49844?...
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
then congrats, you are ready to use HTCondor!
Step 4: Creating Your Data Folder
Make a folder somewhere on your drive containing:
Your Python scripts
The Condor submit file
An empty log folder
Step 5: Split, Submit, and Merge Your Data
Next, run your split script to split up your job. Run "condor_submit <your submit file>" command and wait for your job to finish. Calling "condor_q" will return the status of your jobs. Finally, run your merge script to gather the final product in your geodatabase.
Submitting Jobs to GalaxyGIS
The GalaxyGIS Cluster is available to all Clemson students, faculty, and staff. If you are interested in learning more about High Throughput Computing, we offer two workshops you can attend:
Visit us on Github!
You can also download tutorials, processing scripts, and sample data for the workflows on this website to test out yourself. These are all available from the CCGT Github page.
If you have any questions about connecting to the cluster, submitting jobs, or any other GIS related questions, please feel free to contact one of the staff members below:
Patrick Claflin email@example.com
Elham Masoomkhah firstname.lastname@example.org
Blake Lytle email@example.com
Patricia Carbajales-Dale firstname.lastname@example.org