项目作者: grycap

项目描述 :
MApReduce on AWS LAmbda
高级语言: Shell
项目地址: git://github.com/grycap/marla.git
创建时间: 2017-05-17T10:07:50Z
项目社区:https://github.com/grycap/marla

开源协议:Apache License 2.0

下载


MARLA - MApReduce on AWS Lambda

MARLA is a tool to create and configure a serverless MapReduce processor on AWS by means of a set of Lambda functions created on AWS Lambda. Files are uploaded to Amazon S3 and this triggers the execution of the functions using the user-supplied Mapper and Reduce functions.

Architecture

Alt text

Installation

MARLA requires:

  • An AWS account
  • AWS CLI (version 1.11.76+), used to create the Lambda functions and S3 buckets
  • An IAM Role on AWS with permissions to create, delete and list keys on the used S3 buckets and permissions to invoke Lambda functions. See an example of such an IAM role in the examples/iam-role.json file.

The code of the Lambda functions and user-defined Mapper and Reduce functions is written in Python.

MARLA can be retrieved by issuing this command:

git clone https://github.com/grycap/marla

Usage

First you need to create your own Mapper and Reduce functions in the same file (as shown in the example/example_functions.py file).

This functions must satisfy some constraints, explained below.

Mapper Function

The mapper function must adhere to the following signature:

def mapper(chunk):

where chunk is the raw text from the input file to be mapped..

After executing the mapper function returns the name-value pairs respectively. That is, a list of 2D tuples with the pairs name-value (Pair[i][0] correspond to the name of the element i, Pairs[i][1] correspond to the value of the element i) extracted in the mapper function.

Reducer Function

The reducer function must adhere to the following signature:

def reducer(Pairs):

where Pairs is a list of 2D tuples with the pairs name-value (in the same format of the mapper function) extracted in the mapper function. Pairs is sorted alphabetically by names.

After executing the reduce function returns a list of name-value pairs (Results[i][0] correspond to the name of the element i, Results[i][1] correspond to the value of the element i).

Configuration

In addition to the aforementioned functions, the user must specify some parameters in a configuration file. This configuration file must follow the structure of the provided example examples/config.in. The order of the keys is not important and its meaning is explained here:

  • ClusterName: An identified for this “Lambda cluster”.

  • FunctionsDir: The directory containing the file that defines the Mapper and Reduce functions.

  • FunctionsFile: The name of the file with the Mapper and Reduce functions.

  • Region: The AWS region where the AWS Lambda functions will be created.

  • BucketIn: The bucket for input files. It must exist.

  • BucketOut: The bucket for output files. We strongly recommend using different buckets for input and output to avoid unwanted recursions.

  • RoleARN: The ARN of the role under which the Lambda functions will be executed.

  • MapperNodes: The desired number of concurrent mapper functions.

  • MinBlockSize: The minimum size, in KB, of text that every mapper will process.

  • MaxBlockSize: Maximum size, in KB, of text that every mapper will process.

  • KMSKeyARN: The ARN of KMS key used to encript environment variables. (Optional)

  • MapperMemory: The memory of the mapper Lambda functions. The maximum text size to process by every Mapper will be restricted by this amount of memory.

  • ReducerMemory: The memory of the reduce Lambda functions.

  • TimeOut: The elapsed time for a Lambda function to run before terminating it.

  • ReducersNumber: Number of reducers to use

Creating and Processing the Data

Once fulfilled the previous steps, assumming that you modified the config.in file in the example directory, issue:

$ sh marla_create.sh example/config.in

where config.in is the path to the configuration file.

The script will create and configure the Lambda functions and add permissions to the S3 buckets. If the script finishes successfully, you will find a folder with the cluster name in the bucket specified in configuration file, such as this one: BucketIn/ClusterName

Every file you upload in this folder will be processed via MapReduce. The output of the MapReduce process will be stored in the BucketOut S3 bucket in the following path: BucketOut/ClusterName/NameFile/results

where NameFile is the name of the uploaded input file without the extension (for example .txt) and “results” is the file with the MapReduce results.

Deleting

To remove a “Lambda cluster”, use the script “marla_remove.sh” with the name of “cluster”

$ sh marla_remove.sh ClusterName

This will remove all the created Lambda functions, but not the files in S3.

Acknowledgement

Please acknowledge the use of MARLA by citing the following publication:

  1. Giménez-Alventosa, V., Moltó, G., Caballer, M., 2019. A framework and a performance assessment for serverless MapReduce on AWS Lambda. Futur. Gener. Comput. Syst. 97, 259274. https://doi.org/10.1016/j.future.2019.02.057