Infrastucure for the AWS Analytical Environment
Infrastucure for the AWS Analytical Environment
The long-running EMR cluster is currently deployed directly by terraform. The cluster is restarted every night by the taint-emr
Concourse job to apply any outstanding user changes (see Authorisation and RBAC
).
The user batch cluster is deployed by the emr-launcher
(GitHub) lambda with the configurations in the batch_cluster_config
directory. The cluster is launched on-demand by Azkaban using the custom DataWorks EMR Jobtype or DataWorks EMR Azkaban plugin. The clusters automatically shut down after a period of inactivity by scheduled Concourse jobs (<env>-stop-waiting
).
As part of the EMR Launcher Lambda, when a Batch EMR cluster is deployed, it has a new security configuration copied from the previous security configuration and associated with the new EMR cluster. As per (DW-6602) and (DW-6602), these security configurations are copied by the EMR Launcher Lambda for the Batch EMR clusters only. The reason for doing this is described in the tickets, but this can mean we have many security configurations. If the number of EMR security configurations reaches the maximum of 600, we will be unable to launch any more EMR clusters. This can lead to outages of the user facing aws-analytical-env EMR cluster and the batch clusters if these aren’t periodically cleaned up.
The following (Concourse Job) is responsible for ensuring security configurations are periodically cleaned up.
Both clusters output their logs to the Cloudwatch log group
/app/analytical_batch/step_logs
The logs from user submitted steps via Azkaban output to the Cloudwatch log group
/aws/emr/azkaban
Authentication is mostly handled by Cognito. There are 2 different authentication mechanisms:
The custom authentication flow(AWS docs) is used for implementing additional security checks on top of the default Cognito ones. This is not needed for federated users. It uses the dataworks-analytical-custom-auth-flow lambdas triggered by Cognito hooks (aws-analytical-env repo).
Authentication and authorisation checking happens at multiple points throughout the Analytical Environment:
The RBAC system uses EMR security configurations to assign a unique IAM role for each user for S3 EMRFS requests. At the moment there is no RBAC at the Hive metastore level, so users can see all database and table metadata. RBAC is performed when users try to access data in S3 based on the corresponding IAM role specified in the security configuration.
Security configurations match a local Linux PAM user to an IAM role, therefore all users must exist as Linux users to be able to access data. All users are set up using a custom EMR step which only runs when the EMR cluster is started. The EMR cluster is restarted by the taint-emr
job every night to ensure all users exist on the cluster.
The users and permissions are stored in a MySQL database. The database assigns RBAC policies at the user group level, and each user can be assigned to a group to inherit the group’s permissions. Currently permissions cannot be attached directly to a user.
The RBAC sync lambda (#TODO: add link) synchronises the users from the Cognito User Pool to the MySQL database. The lambda is invoked by Concourse (admin-sync-and-munge/sync-congito-users-<env>
) daily at 23:00 UTC.
The RBAC ‘munge’ lambda takes all the access policies for a given user and combines them to the least number of AWS IAM policies, taking into account the resource limits imposed by AWS. The lambda is invoked by Concourse (admin-sync-and-munge/create-roles-and-munged-policies-<env>
) after the sync job succeeds.
There is a requirement for our data products to start using Hive 3 instead of Hive 2. Hive 3 comes bundled with EMR 6.2.0
along with other upgrades including Spark. Below is a list of steps taken to upgrade Analytical-env and batch to EMR 6.2.0
Make sure you are using an AL2 ami
Point analytical-env
clusters at the new metastore: hive_metastore_v2
in internal-compute
instead of the old one in the configurations.yml
The values below should resolve to the new metastore, the details of which are an output of internal-compute
"javax.jdo.option.ConnectionURL": "jdbc
//${hive_metastore_endpoint}:3306/${hive_metastore_database_name}?createDatabaseIfNotExist=true"
"javax.jdo.option.ConnectionUserName": "${hive_metsatore_username}"
"javax.jdo.option.ConnectionPassword": "${hive_metastore_pwd}"
Alter the security group deployment to the new security group for hive-metastore-v2
hive_metastore_sg_id = data.terraform_remote_state.internal_compute.outputs.hive_metastore_v2.security_group.id
Rotate the analytical-en
user from the internal-compute
pipeline so that when analytical-env
or batch
starts up it can login to the metastore.
Make sure to fetch the new Secret as the secret name has changed
data "aws_secretsmanager_secret_version" "hive_metastore_password_secret" {
provider = aws
secret_id = "metadata-store-v2-analytical-env"
}
Bump the version of sparklyR from 2.4 to 3.0-2.12
Make sure that the first time anything uses the metastore it initialises with Hive 3, otherwise it will have to be rebuilt.