Streamline Data Processing Using Kafka

  1. Copy the Producer script to a folder above the files you are trying to transfer. For example, we want to transfer the files in flight_1 so the Producer script is in the same folder as flight_1.

  2. Make sure you have python and kafka-python installed. To install python follow this link: https://www.python.org/downloads/ and install python 3. Open the installer and choose customize installation. Then uncheck IDLE and for all user. Take note of the path that the installer defaults to. If you do not have write permission to the default install path, change it and take note of the new path.

  3. Then use “set PATH=%PATH%[your_path_here];” to change your path environment variable to include your recently installed python. Replace “[your_path_here]” with the path you had when installing python. Then run the previous command again and this time include \Scripts at the end to include pip to your path as well. Run echo %PATH% to double check that you made the modification correctly.

  4. To install kafka-python type “pip install kafka-python” in a terminal. If you run into trouble installing kafka-python, more details are available here: https://pypi.org/project/kafka-python/

  5. Once you have python and kafka-python installed, you can run the script from the terminal/command line by typing "python Producer.py [folder_you_wish_to_produce] [project_name]" Try to make the project name as unique as you can. The broker uses this to distinguish your files from other people's files. A good way to come up with unique project name is to use your Clemson username as a prefix. For example, with flight_1 I can modify it to be xiang3_flight_1 as the project name in Kafka to ensure that there’s no topic collision.

  6. The Producer script will automatically send the data to the broker to be consumed later.

  7. On your consuming computer, copy the script to the directory you wish the folder to be. ie if you want the filght_1 folder to be on your desktop, then copy the Consumer.py script to your desktop.

  8. Make sure you have python and kafka-python installed on the consuming computer as well.

  9. Run the Consumer script by typing "python Consumer.py [project_name]" in the directory of your script. The project name is the same one provided to the producer.

  10. The consumer script should be pulling the files from the broker and put them in the same folder as the one on the producing computer. In our example it would be flight_1

 

Notes:

* The producer and consumer script will produce a text file named meta.txt it is a by product, you may remove it once you have finished with either producing or consuming.

* To change the IP of the broker, open up the script and find the broker IP. The default one in the script is not always up. To get the CCGT broker up, use remote desktop connection to connect to the machine and log in with proper credentials. The script for starting the broker is at D:\kafka_project\kafka\kafka_2.11-1.0.0\start_zkandkafka.bat. To start the broker, make sure that there are no files in the kafka-logs folder.


 

* Both scripts should work on either Linux or Windows. If running on Palmetto, you would need to setup a conda environment. To do so run the following commands.

module add anaconda3/5.1.0

conda create –n kafka_env python=3.6

conda activate kafka_env

pip install kafka-python

* After setting up the environment, you may reactivate the conda environment using “conda activate kafka_env”

Project Members

If you have any question about how this project was conducted or the results, please feel free to contact one of the project members below:

© 2019 by Clemson Center for Geospatial Technologies