Skip to content

Conversation

@alexrenz
Copy link
Member

@alexrenz alexrenz commented Nov 1, 2024

Add support for running on Azure inside Snowflake.

Comment on lines +176 to +185
let mut locked = self.cached.lock().await;

match locked.as_ref() {
Some(creds) => {
if matches!(creds.as_ref(), AzureCredential::SASToken(pairs) if *pairs == new_pairs) {
return Ok(Arc::clone(creds));
}
}
_ => {}
}
Copy link
Member Author

@alexrenz alexrenz Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrebsguedes: I adapted this from the S3 code + your variant, but don't follow why we have this caching here.

Speaking about the S3 variant: the request to Snowflake in current_upload_info (which happens before the cache check) should be more expensive than creating an object_store::aws::AwsCredential (which happens after), right? I also don't have a good understanding of when object_store calls get_credential, maybe that is my problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right that this caching here is very confusing and my Azure version of this was pretty dumb to be honest, let me try to explain the rationale:

  • object_store calls this for every operation, even multiple times in some cases, so we want this to be fast
  • current_upload_info is a cached call to fetch_upload_info which means that most of the time calling it should just be a few atomic operations (to read the upload_info cache and perform the Arc::clone) and thus fast.
  • So we call current_upload_info but instead of cloning all the strings to build AwsCredential we cache it so that we can Arc::clone it most of the time

Some critical analysis after looking at my S3 code again:

  • In reality even if everything goes great the extra contention from the lock may be way worse than the allocations for constructing AwsCredential.
  • We would have to employ way more clever tricks with arc_swap and use the address of the current_upload_info result to detect change instead of checking the contents

TLDR: I think this caching is not worth it neither in S3 or Azure and we should drop it for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to drop it. Let's do that in another PR. So I will keep this as is here

@alexrenz alexrenz marked this pull request as ready for review November 5, 2024 09:46
@alexrenz alexrenz changed the title Snowflake Azure support Support Azure in Snowflake Nov 5, 2024
Copy link
Member

@andrebsguedes andrebsguedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! Even more so when we take into account this is your first PR in the project.

Some minor comments:

Comment on lines +176 to +185
let mut locked = self.cached.lock().await;

match locked.as_ref() {
Some(creds) => {
if matches!(creds.as_ref(), AzureCredential::SASToken(pairs) if *pairs == new_pairs) {
return Ok(Arc::clone(creds));
}
}
_ => {}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right that this caching here is very confusing and my Azure version of this was pretty dumb to be honest, let me try to explain the rationale:

  • object_store calls this for every operation, even multiple times in some cases, so we want this to be fast
  • current_upload_info is a cached call to fetch_upload_info which means that most of the time calling it should just be a few atomic operations (to read the upload_info cache and perform the Arc::clone) and thus fast.
  • So we call current_upload_info but instead of cloning all the strings to build AwsCredential we cache it so that we can Arc::clone it most of the time

Some critical analysis after looking at my S3 code again:

  • In reality even if everything goes great the extra contention from the lock may be way worse than the allocations for constructing AwsCredential.
  • We would have to employ way more clever tricks with arc_swap and use the address of the current_upload_info result to detect change instead of checking the contents

TLDR: I think this caching is not worth it neither in S3 or Azure and we should drop it for now

# object_store = { version = "0.10.1", features = ["azure", "aws"] }
# Pinned to a specific commit while waiting for upstream
object_store = { git = "https://github.com/andrebsguedes/arrow-rs.git", tag = "v0.10.2-beta1", features = ["azure", "aws", "experimental-azure-list-offset", "experimental-arbitrary-list-prefix"] }
hickory-resolver = "0.24"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrebsguedes Is there a way to say sth like "use whatever version reqwest is using"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. We can add a test later that at least breaks if the versions diverge but for this I don't think it is even needed as reqwest does not care about the version of a custom resolver

Copy link
Member

@andrebsguedes andrebsguedes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I still have to go through the raicode tests but it is very unlikely that the review there will cause any changes at this level so I will approve this one already.

# object_store = { version = "0.10.1", features = ["azure", "aws"] }
# Pinned to a specific commit while waiting for upstream
object_store = { git = "https://github.com/andrebsguedes/arrow-rs.git", tag = "v0.10.2-beta1", features = ["azure", "aws", "experimental-azure-list-offset", "experimental-arbitrary-list-prefix"] }
hickory-resolver = "0.24"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not. We can add a test later that at least breaks if the versions diverge but for this I don't think it is even needed as reqwest does not care about the version of a custom resolver

@alexrenz alexrenz merged commit 92c8ef5 into main Nov 22, 2024
4 checks passed
alexrenz added a commit to RelationalAI/RustyObjectStore.jl that referenced this pull request Nov 26, 2024
Add a few tests for SPCS Azure, as introduced by [object_store_ffi#28](RelationalAI/object_store_ffi#28)

---------

Co-authored-by: André Guedes <andre.guedes@relational.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants