Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security] CredentialProviderAzure should not implicitly retrieve and use shared storage account keys #20634

Open
sugibuchi opened this issue Jan 9, 2025 · 7 comments
Assignees
Labels
enhancement New feature or an improvement of an existing feature

Comments

@sugibuchi
Copy link

sugibuchi commented Jan 9, 2025

Description

CredentialProviderAzure, introduced by #20384, implicitly tries to retrieve and use a shared storage account key and then falls back to the Entra ID-based authentication using DefaultAzureCredential.

However, I have strong concerns about this behaviour as it is not compliant with Microsoft's security recommendation and has other drawbacks related to security, performance and the practices followed by other libraries.

Microsoft recommends Entra ID-based authentication over shared key-based auth

For optimal security, Microsoft recommends using Microsoft Entra ID with managed identities to authorize requests against blob, queue, and table data, whenever possible.
https://learn.microsoft.com/en-us/azure/storage/common/authorize-data-access

See also https://learn.microsoft.com/en-us/azure/storage/blobs/security-recommendations#identity-and-access-management

However, the current behaviour of CredentialProviderAzure is the opposite (shared key first, then the Entra ID-based auth).

Accessing Azure Blob with a shared key bypasses all fine-grained data access control

A shared key for an Azure storage account is a master key granting permission to read/write all files. Accessing with a shared key does not respect data access control configured by using Azure RBAC, ABAC and ACL.

Implicit use of shared keys may lead to unexpected exposure or modification of data.

Calling Azure CLI to retrieve shared keys imposes unnecessary latency

The current CredentialProviderAzure always executes Azure CLI to retrieve shared keys regardless of whether shared keys are available or not. However, the execution of Azure CLI is slow (particularly in Windows environments) compared to fetching access tokens from Entra ID. The current behaviour imposes unnecessary latency, particularly when we use an authentication flow that directly fetches access tokens from Entra ID (client credential flow, etc.).

None of the other existing libraries implements such behaviour

The other existing libraries accessing Azure Blob, including hadoop-azure for Hadoop HDFS, Apache Arrow, Python fsspec, DuckDB, etc., do not implicitly retrieve and use shared keys. When using the shared key-based authentication with those libraries, we must explicitly specify a shared key.

The current behaviour, inconsistent with most existing libraries, potentially leads to confusion and unexpected incidents.

Proposal

Drop the implicit use of shared storage account keys from CredentialProviderAzure. Instead, shared keys should be explicitly specified in storage_options.

  • Delete
    if self.account_name is not None:
    try:
    creds = {
    "account_key": self._get_azure_storage_account_key_az_cli(
    self.account_name
    )
    }
    if self._verbose:
    print(
    "[CredentialProviderAzure]: Retrieved account key from Azure CLI",
    file=sys.stderr,
    )
    except Exception as e:
    if self._verbose:
    print(
    f"[CredentialProviderAzure]: Could not retrieve account key from Azure CLI: {e}",
    file=sys.stderr,
    )
    else:
    return creds, None # type: ignore[return-value]
  • Delete self._get_azure_storage_account_key_az_cli() as (1) the Entra ID-based authentication does not require account names (2) a shared key should be specified in conjunction with an account name, like most other libraries.
  • Delete self._get_azure_storage_account_key_az_cli(), self._azcli_version(), self._azcli() etc. as those util functions for Azure CLI are used only to retrieve shared keys.
@sugibuchi sugibuchi added the enhancement New feature or an improvement of an existing feature label Jan 9, 2025
@alexander-beedie
Copy link
Collaborator

@sugibuchi: Thanks for the detailed breakdown.
@nameexhaustion: One for you to have a think about? :)

@nameexhaustion
Copy link
Collaborator

I don't know about removing the functionality as can improve the user experience by making the authentication seamless, but I can make it opt-in to reduce security concerns.

@ritchie46
Copy link
Member

I am a bit surprised of a security concern if you are able to access those keys. Shouldn't that key not be available then in the first place?

Is it possible for us to swap the order @nameexhaustion , first Entra-id then (maybe opted in) shared key.

@nameexhaustion
Copy link
Collaborator

Is it possible for us to swap the order @nameexhaustion , first Entra-id then (maybe opted in) shared key.

I have made this change in the PR - https://github.com/pola-rs/polars/pull/20652/files#r1910147426

@sugibuchi
Copy link
Author

sugibuchi commented Jan 10, 2025

Removing the implicit use of shared keys has literally no impact on the UX.

The current CredentialProviderAzure uses az storage account keys list to retrieve shared keys. This approach assumes that the Azure CLI has already logged in to Azure using az login, which authenticates users/service principals with Entra ID.

"az login" (Entra ID-based auth) -> "az storage account keys list" -> Shared key -> Storage account (shared key-based auth)

Even though the Entra ID-based authentication is available, why do we need to use the shared key-based authentication? Why don't we directly use Entra ID-baed auth to access storage accounts?

"az login" (Entra ID-based auth) -> DefaultAzureCredential -> access token -> Storage account (Entra ID-based auth)

In other words, once you run az login, you have a key to open a door. However, the current CredentialProviderAzure uses the same key to open a vault to get a master key in there and uses the master key to open a door. It is simply redundant and not a good practice.

Shared key-based authentication should only be used as a last resort when Entra ID-based authentication is unavailable. We don't need to maintain the current logic that retrieves shared keys by using Entra ID-based auth, even as an opt-in option or a fallback logic.

Instead, we need an option to specify shared keys explicitly.

df = pl.read_parquet(source_abfs_url, srorage_options={"account_name": account_name, "account_key": account_key})

@nameexhaustion
Copy link
Collaborator

nameexhaustion commented Jan 10, 2025

Instead, we need an option to specify shared keys explicitly.

We have this functionality available as an option as well (passing account_name, account_key to storage_options works).

We don't need to maintain the current logic that retrieves shared keys by using Entra ID-based auth, even as an opt-in option or a fallback logic.

It's definitely not a necessity, but for users that work in environments where Entra ID has not been configured, having this available can save the user from having to manually figure out where/how to get the keys from, so I definitely think it's worth it.

Additionally, the most important issue regarding security concerns will be resolved after we update it to be opt-in.

@sugibuchi
Copy link
Author

sugibuchi commented Jan 10, 2025

The version in #20652 has a "permission check" that tries to access the metadata of a file system root with Entra ID-based authorization. The version in #20652 suggests activating a fallback mechanism to automatically use the shared key-based auth if it can't access the metadata.

However, this complex logic looks over-engineering and potentially leads to unexpected results.

For example, we can configure ABAC to allow access to a specific file and refuse to access the file system root. The version in #20652 cannot access blob containers having such ABAC since the "permission check" fails. More importantly, the suggested fallback to shared keys is inappropriate in this case (as the ABAC is configured intentionally not to allow access to the file system root).

In general, users should consider RBAC/ABAC/ACL settings first if data access with Entra ID is refused. I don't believe automatic fallback to shared key-based authentication is a good suggestion.

There are various implementations of ABFS file system clients and applications accessing Azure storage accounts. However, use of shared key-based auth and Entra ID-based auth are usually exclusive and explicit. Such simplicity makes the behaviour of blob client authentication more predictable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants