Wed, Jan 25, 2023
A couple months ago, we received a request from one of our enterprise financial clients looking to build their internal data lake capabilities. The client wanted to know more about security best practices related to the AWS data lake management tool, AWS Lake Formation, and asked our team for help. One of our principal security consultants specializing in cloud got to work, preparing an overview of critical security considerations when architecting a data lake system. We thought this overview was too good to keep a secret, and there isn’t a lot of public information about data lake security, so here is a contribution from us!
By AWS’s definition, a data lake is a, “a centralized repository that allows you to store all your structured and unstructured data at any scale.” Consider a situation in which an organization needs to retain a large blob of data that has various levels of sensitivity and that needs to be accessed by various individuals. This is where a data lake becomes useful.
Going a bit deeper, data lakes are huge data storage “databases”, created with the primary objective of storage, but they must also be able to be searched and queried. Data lakes enable computation tasks (ML/AI, business analytics, etc.) to be performed on data in storage. The data lake provides access to the data, but the computational tasks are done by other services. Data lakes generally don’t have as robust query capabilities as a relational database.
An example might be a large store of employees' personal data that contains details such as names, dates of birth, and paystub information. Different individuals may need to have access to different types of data. If this data is all in one location, creating a seamless solution can be difficult, as can scaling that solution over time.
Data lakes are a technology that is fundamentally enabled by cloud services although similar capabilities can be built in house/on-premises. The security of the data lake is focused on architecting for integration with other services, since that is what makes the data valuable and provides access vectors creating security issues.
This article is specific to AWS Lake Formation, which is AWS’s data lake management tool, although you can see there are several services that are required to make a data lake work. Similar concepts should be considered for other cloud providers of data lake technology, but the services and implementation will differ.
Lake Formation is an AWS service that provides a management layer over this data. It allows administrators to create databases to manage data within a data lake, to create and manage user permissions and to leverage several services to either import data or have data made accessible to end users.
Several services are used to accomplish these tasks:
AWS S3 is a widely used service that has great public documentation for security considerations. Most automated AWS configuration tools cover the basic test cases. Most security in S3 can be configured directly by enabling security options:
More information about these options can be found at the following link: https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html
When an administrator configures a data lake, they must register an S3 location. Once the location has been set, the data at that location can be used with data catalogs. An example of a data location follows:
s3://bucketname/hr/employees/canada
When a location is configured, that location and everything under it are added to the data lake. This data includes all the files that are present within the S3 file structure. When a new data location is being created, the service console prompts users to review existing location permissions for that location. Conducting this review is strongly recommended because these settings can easily be misconfigured.
Once a location has been registered, an administrator can restrict access by using data location and data access permissions.
Data location permissions are granted to principals (e.g., IAM users or roles) via a grant command such as the following:
aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=arn:aws:iam::111122223333:user/datalake_user1 --permissions "DATA_LOCATION_ACCESS" --resource '{ "DataLocation": {"ResourceArn":"arn:aws:s3:::/bucketname/hr/employees/canada"}}'
Once a given user has DATA_LOCATION_ACCESS permission to a given location, they can create a metadata database or a table at that location (assuming they have the proper Lake Formation permissions to do so). The security impact of this access is that if a given principal has DATA_LOCATION_ACCESS for a given location, they have a path to access all the data in that location (and its child directories).
Principals can also be granted data access permissions (SELECT, INSERT, DELETE). These permissions are covered in the Lake Formation section.
When Lake Formation is being configured for an S3 bucket or when a review that covers this configuration is being performed, it is important to have a clear folder structure and plan of data segregation. The following steps can be followed during a review:
Lake Formation has a very complex permission system. For any specific task that a customer performs, the customer needs to have both IAM and Lake Formation permissions. For example, when reading data from an S3 location, the customer needs to have the underlying S3 permissions as well as the SELECT data access permissions. They obtain the IAM permissions indirectly via the role that is attached to the data location in S3, and their principal needs to have read permission for that data.
Data lake administrators are a list of users within a data lake. These users have very privileged access, including full read access to all resources, full data location permissions, and the ability to modify permissions across the entire catalog.
The following two permissions are required to create a data lake administrator:
These permissions are given by the administrative access policy. As such, a user with this policy can be assumed to have implicit access to the data in the data lake. Unlike for other services in AWS, the data that is contained in the data lake may be particularly sensitive, so it is important to identify all possible users that have this permission.
In a normal configuration, it is common for an IT administrator to have full, unrestricted admin access to the AWS account. However, there may be regulatory or compliance reasons that this access may need to be limited in an account that has a sensitive data lake.
Database creators are users that have had the permission to create a database (https://docs.aws.amazon.com/lake-formation/latest/dg/lf-permissions-reference.html#perm-create-database).
If a user has database creator permission and data location access for a given location, then they have implicit access to all the data at that location.
In general, the number of users that are allowed to create databases should be limited.
Once a location has been added and a database created for that location, it is possible to define data permissions for databases and tables.
The following screenshot shows the permission types that can be defined for a given database. They include granting principals the ability to create a new table and to alter, drop, or describe tables. There is also an option to allow this principal to grant permissions to other users.
The following screenshot shows, for a given table, the possible permissions a principal can have. A user with select permissions can read data from the table, and a user with describe permissions can describe a table, for example.
Generally, the principals defined here are services that integrate with Lake Formation. For example, if an administrator wants to configure Amazon Redshift to be able to read only from a Lake Formation table, they would create a role, grant it permissions to SELECT for a given table and then launch the Redshift cluster with that role.
It is also possible to set data filtering and column-specific restrictions, which can be used to further limit which data a user is able to access.
These services allow end users to access data from Lake Formation by using the scoped-down and limited permissions that are granted to their principals. These services mostly work as expected, and a user should be able to access only resources that their principal is restricted to if the proper role is used to call the integrated services.
Some caveats should be considered:
By employing these techniques, organizations can harden both their AWS Lake Formation security, as well as the supporting data lake environment. For most organizations, Lake Formation is one part of a larger cloud environment and strategy, and it is important to take a wholistic view of cloud security controls across your environment. To get additional help with your AWS Lake Formation or cloud security strategy, visit: Cyber Security Services | Cyber Risk | Kroll
Incident response, digital forensics, breach notification, managed detection services, penetration testing, cyber assessments and advisory.
Kroll’s multi-layered approach to cloud security consulting services merges our industry-leading team of AWS and Azure-certified architects, cloud security experts and unrivalled incident expertise.
Proactively identify your highest-risk exposures and address key gaps in your security posture. As the No. 1 Incident Response provider, Kroll leverages frontline intelligence from 3000+ IR cases a year with adversary intel from deep and dark web sources to discover unknown exposures and validate defenses.
Kroll delivers more than a typical incident response retainer—secure a true cyber risk retainer with elite digital forensics and incident response capabilities and maximum flexibility for proactive and notification services.
Manage cyber risk and information security governance issues with Kroll’s defensible cyber security strategy framework.