MapReduce, PySpark and HDFS
An ML system developed with PySpark and MapReduce, that reads from HDFS a dataset of trip records from all trips completed in yellow taxis in NYC from 1/2015 to 6/2015, and finds the coordinates of the top five client-pickup locations using K-Means clustering.
hadoop fs -put ./yellow_tripdata_1m.csv hdfs://master:9000/yellow_tripdata_1m.csv
pip install pyspark
spark-submit map_reduce_kmeans.py
hadoop fs -getmerge hdfs://master:9000/map_reduce_kmeans.res ./map_reduce_kmeans.res
cat map_reduce_kmeans.res