Skip to content

Conversation

@sbernauer
Copy link
Member

Description

Please add a description here. This will become the commit message of the merge request later.

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes
# Author
- [ ] Changes are OpenShift compatible
- [ ] CRD changes approved
- [ ] CRD documentation for all fields, following the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [ ] Helm chart can be installed and deployed operator works
- [ ] Integration tests passed (for non trivial changes)
- [ ] Changes need to be "offline" compatible
# Reviewer
- [ ] Code contains useful comments
- [ ] Code contains useful logging statements
- [ ] (Integration-)Test cases added
- [ ] Documentation added or updated. Follows the [style guide](https://docs.stackable.tech/home/nightly/contributor/docs/style-guide).
- [ ] Changelog updated
- [ ] Cargo.toml only contains references to git tags (not specific commits or branches)
# Acceptance
- [ ] Feature Tracker has been updated
- [ ] Proper release label has been added
- [ ] [Roadmap](https://github.com/orgs/stackabletech/projects/25/views/1) has been updated

@sbernauer sbernauer changed the title fix: For cluster internal scopes, also add variant without trailing dot fix: For cluster internal scopes also add variant without trailing dot Jan 21, 2025
Copy link
Contributor

@nightkr nightkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the places that depend on the old value here, not just blindly add both.

Comment on lines +188 to +191
let mut cluster_domains = vec![cluster_domain.to_string()];
if let Some(cluster_domain_without_trailing_dot) = cluster_domain.strip_suffix('.') {
cluster_domains.push(cluster_domain_without_trailing_dot.to_owned());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, Kerberos wants consistency and TLS doesn't really care. Either should be helped by doing both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to respond in #547 (comment)

Comment on lines 175 to 179
[domain_realm]
cluster.local = {realm_name}
cluster.local. = {realm_name}
.cluster.local = {realm_name}
.cluster.local. = {realm_name}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IME, this shouldn't be necessary at all (probably since we set the default realm before). But if we do keep it then we should read the actual cluster domain, not hard-code cluster.local specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this PR was mostly about trying out if we can fix the TLS cert problems we have.
Fixed it in 5781e66

@sbernauer
Copy link
Member Author

General comment on why we are thinking of adding both (with and wo trailing dot) to scopes - both for TLS and Kerberos:

  1. We don't know what users will be entering. Performance sensitive users might choose a trailing dot, users with problems might not be able to add a dot
  2. The zookeeper ./scripts/run-tests --test smoke_zookeeper-3.9.2_use-server-tls-true_use-client-auth-tls-true_openshift-false tests failed with [ERROR] Could not establish secure connection using client certificates!
    Everything was configured correctly with trailing dots. We could only get the zk client happy by adding a SAN without the dot
  3. Let's imaging a 24.11 users first updates secret-op. Until he is able to update all other product operators everything can potentially break, because every talks cluster.local, but secret-op only hands out cluster.local.
  4. Is there a downside of adding both? - At the first glance this PR seems like a "let's be better safe than sorry"

That being said this is a WIP, I would leave it totally up on @dervoeti and @maltesander to decide how to proceed, as they looked at the issue at the first place. I just happened to bump op-rs and run into a failing test

@nightkr
Copy link
Contributor

nightkr commented Jan 21, 2025

We don't know what users will be entering. Performance sensitive users might choose a trailing dot, users with problems might not be able to add a dot

Wouldn't secret-operator know just as well as any other operator? Are you planning on supporting mixed environments?

We could only get the zk client happy by adding a SAN without the dot

curl was also happy IME too without the dot, so I guess this is one indication that TLS SANs should never have it.

Let's imaging a 24.11 users first updates secret-op. Until they are able to update all other product operators everything can potentially break, because every talks cluster.local, but secret-op only hands out cluster.local.

Migration is a fair concern, that's true. We should explicitly document those migration paths in the comments.

Is there a downside of adding both? - At the first glance this PR seems like a "let's be better safe than sorry"

We should know what credentials we're asking for, and why. Whether they need to be included when provisioning manually, and so on.

It's fine to add things to that list, it just shouldn't be something we do blindly.

@dervoeti
Copy link
Member

I agree that, in general, we should prefer not adding the hostname without the dot if it's not really necessary / we can work around it.

I'm not sure what exactly the scenario was (@sbernauer and/or @maltesander did the research on this) but I think one reason was that Zookeeper does a reverse DNS lookup on the client IP and complains that the client cert is not valid for the returned hostname (without the trailing dot).

That would be a reason to add the alternative hostname to the SANs. Other ways to solve this are trying to fix this in Zookeeper or maybe explicitly not supporting Zookeeper mTLS if you use a cluster domain with a trailing dot. I'm fine with either solution, adding the alternative hostname to the SAN was probably just the easiest way make it work.

@dervoeti
Copy link
Member

dervoeti commented Feb 6, 2025

So, we have to make a decision. I can't really comment on the Kerberos related changes, but as far as I understand it, they are not strictly necessary but would make migration easier. In that case I would be fine with not merging these changes if they are controversial, an easy migration path is nice to have but I think it's okay if we don't have it.
Regarding the TLS related change, I can't see a practical negative effect on security if we add the non-FQDN DNS name to the SANs if a cert for a FQDN DNS name is requested. It would ease the migration path (at least in one direction) and, more importantly, it would fix the Zookeeper mTLS issue.
So I'd be in favor of merging at least the TLS related change.

But I'm also fine with not merging this at all and explicitly listing Zookeeper mTLS as "known not to work with FQDN cluster domains yet". In that case we probably still support many setups with FQDN cluster domains with 25.3, so it's better than before.

Opinions @nightkr @sbernauer @maltesander ?

@nightkr
Copy link
Contributor

nightkr commented Feb 6, 2025

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

For Kerberos I'm not sure. I think the argument for having both makes sense, at least during the transitional period (though we should probably make sure we have both variants in both cases). Maybe an exception here would be if we can centralize this logic in listener-op, and have that be what decides the Flag Day™️.

@dervoeti
Copy link
Member

dervoeti commented Feb 6, 2025

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

I'll also do some tests with this later, if it works I'm fine with that solution as well.

@maltesander
Copy link
Member

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

Yeah we definitly need the non-FQDN in there. That fixed most of the problems i had. IIRC zookeeper required the FQDN in the certificate.

I would punt on Kerberos as well. Main thing is to fix the certs?

@dervoeti
Copy link
Member

dervoeti commented Feb 7, 2025

I did some tests with Zookeeper yesterday, including mTLS tests with and without FQDN cluster domains, adding just the non-FQDN hostname to the SANs worked fine. Will do some more testing today with other products.
I only changed one line in secret-op: b5f3d51

@dervoeti
Copy link
Member

@maltesander @nightkr @sbernauer I created a PR that only adds the non-FQDN variant to the SANs, works fine for me:
#564

@nightkr
Copy link
Contributor

nightkr commented Jun 11, 2025

Replaced by #564, which was merged.

@nightkr nightkr closed this Jun 11, 2025
@lfrancke lfrancke deleted the fix/scopes-wo-trailing-dot branch July 3, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants