项目作者: ddaedalus

项目描述 :
MapReduce, PySpark and HDFS
高级语言: Python
项目地址: git://github.com/ddaedalus/advanced-databases-ntua.git
创建时间: 2020-03-12T19:17:51Z
项目社区:https://github.com/ddaedalus/advanced-databases-ntua

开源协议:

下载


advanced-databases-ntua: MapReduce, PySpark, HDFS, K-Means

An ML system developed with PySpark and MapReduce, that reads from HDFS a dataset of trip records from all trips completed in yellow taxis in NYC from 1/2015 to 6/2015, and finds the coordinates of the top five client-pickup locations using K-Means clustering.

How to run

  1. Upload data in Hadoop Distributed File System (HDFS)
    1. hadoop fs -put ./yellow_tripdata_1m.csv hdfs://master:9000/yellow_tripdata_1m.csv
  2. Install pyspark
    1. pip install pyspark
  3. Submit task in Spark environment
    1. spark-submit map_reduce_kmeans.py
  4. Get Results
    1. hadoop fs -getmerge hdfs://master:9000/map_reduce_kmeans.res ./map_reduce_kmeans.res
    2. cat map_reduce_kmeans.res