Skip to content

Conversation

@ok2c
Copy link
Member

@ok2c ok2c commented Mar 31, 2025

Presently DNS normalization is performed inconsistently and often multiple times over the same input

This change-set

  1. Optimizes DNS normalization and ensures it is performed once per API call
  2. Ensures DNS hames / identities always get represented in Unicode

@arturobernalg Please review.

@ok2c ok2c requested a review from arturobernalg March 31, 2025 13:44
Comment on lines 287 to 294
checkPublicSuffix("xn--85x722f.Com.Cn", "食狮.com.cn");
checkPublicSuffix("xn--85x722f.xn--55qx5d.CN", "食狮.公司.cn");
checkPublicSuffix("www.xn--85x722f.xn--55qx5d.cn", "食狮.公司.cn");
checkPublicSuffix("shishi.xn--55qx5d.cn", "shishi.公司.cn");
checkPublicSuffix("xn--55qx5d.cn", null);
checkPublicSuffix("xn--85x722f.xn--fiqs8s", "xn--85x722f.xn--fiqs8s");
checkPublicSuffix("www.xn--85x722f.xn--fiqs8s", "xn--85x722f.xn--fiqs8s");
checkPublicSuffix("shishi.xn--fiqs8s", "shishi.xn--fiqs8s");
checkPublicSuffix("xn--85x722f.xn--fiqs8s", "食狮.中国");
checkPublicSuffix("www.xn--85x722f.xn--fiqs8s", "食狮.中国");
checkPublicSuffix("shishi.xn--fiqs8s", "shishi.中国");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes break the definition of the formal algorithm as defined by their unit tests (which are copied here): https://github.com/publicsuffix/list/blob/main/tests/test_psl.txt#L89

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage And why is that a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on this implementation of the PSL formal algorithm downstream, it's the whole reason we're using the PublicSuffixMatcher. If doesn't implement the PSL according to the spec then we might as well use something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage Do feel free to use something else by all of means. The algorithm description however does not mention anything about results being in Punycode. To the contrary it states "The list uses Unicode, not Punycode"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would use something else but this is the best one that doesn't require constant updating (the one in Guava seems to be a hardcoded list of regular expressions that they maintain and we're trying to avoid a dependency on Guava). The alternative is to fork this internally which is what we had done but then we contributed the code here as we thought others might also want the "formal" behaviour. If you think the formal definition unit tests are wrong and should look like what you have here we could raise a discussion there? To me the formal one seems to respect "punycode in, punycode out" and "unicode in, unicode out" not "punycode in, unicode out" which is what this now does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the spec could be a lot clearer and would benefit from having a list of examples, but failing that all we have to go on is the test suite that they use and other implementations of the standard.

Just to be clear, the change from Punycode to Unicode isn't my personal need, it appears to be how all the libraries which deal with the PSL listed at https://publicsuffix.org/learn/ work. For comparison I did the following with xn--85x722f.com.cn as input to see if xn--85x722f.com.cn would be returned as per the test across a number of them and they all behaved the same:

Using https://github.com/hamano/regdom4j/:

❯ java -jar regdom4j-1.0.3.jar xn--85x722f.com.cn
xn--85x722f.com.cn

Using Guava

InternetDomainName domain = InternetDomainName.from("xn--85x722f.com.cn");
System.out.println(domain.publicSuffix());
//output com.cn        
System.out.println(domain.topDomainUnderRegistrySuffix());
//output xn--85x722f.com.cn

Using https://github.com/whois-server-list/public-suffix-list

PublicSuffixListFactory factory = new PublicSuffixListFactory();
PublicSuffixList suffixList = factory.build();
System.out.println(suffixList.getPublicSuffix("xn--85x722f.com.cn"));
//output com.cn
System.out.println(suffixList.getRegistrableDomain("xn--85x722f.com.cn"));
//output xn--85x722f.com.cn

Using https://pypi.org/project/publicsuffix2/

from publicsuffix2 import get_public_suffix

  print(get_sld('xn--85x722f.com.cn'))

# output xn--85x722f.com.cn

Using https://pypi.org/project/publicsuffixlist/

from publicsuffixlist import PublicSuffixList

  psl = PublicSuffixList()
  print(psl.suffix('xn--85x722f.com.cn'))

# output xn--85x722f.com.cn    

Using https://pypi.org/project/tldextract/

import tldextract
    print(tldextract.extract('xn--85x722f.com.cn'))
# output ExtractResult(subdomain='', domain='xn--85x722f', suffix='com.cn', is_private=False)    

So I think most users would be surprised if the commons matcher worked differently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage I find "all cool kids do that" line of argument rather unconvincing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage Anyways, I kept the original behavior with somewhat non-elegant hack but this is a good as it gets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage I find "all cool kids do that" line of argument rather unconvincing.

I'm not sure they're cool kids ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@massdosage Anyways, I kept the original behavior with somewhat non-elegant hack but this is a good as it gets.

OK, looks like the PSL tests are all unchanged and passed so that's good, thanks!

@ok2c ok2c force-pushed the dns_normalization_cleanup branch from ed76b8e to 26ace65 Compare April 1, 2025 19:32
Copy link
Member

@arturobernalg arturobernalg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ok2c LGTM

@ok2c ok2c merged commit ffc12f1 into apache:master Apr 5, 2025
10 checks passed
@ok2c ok2c deleted the dns_normalization_cleanup branch April 5, 2025 11:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants