-
Notifications
You must be signed in to change notification settings - Fork 983
DNS host / identity normalization should be performed only once per public API call #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| checkPublicSuffix("xn--85x722f.Com.Cn", "食狮.com.cn"); | ||
| checkPublicSuffix("xn--85x722f.xn--55qx5d.CN", "食狮.公司.cn"); | ||
| checkPublicSuffix("www.xn--85x722f.xn--55qx5d.cn", "食狮.公司.cn"); | ||
| checkPublicSuffix("shishi.xn--55qx5d.cn", "shishi.公司.cn"); | ||
| checkPublicSuffix("xn--55qx5d.cn", null); | ||
| checkPublicSuffix("xn--85x722f.xn--fiqs8s", "xn--85x722f.xn--fiqs8s"); | ||
| checkPublicSuffix("www.xn--85x722f.xn--fiqs8s", "xn--85x722f.xn--fiqs8s"); | ||
| checkPublicSuffix("shishi.xn--fiqs8s", "shishi.xn--fiqs8s"); | ||
| checkPublicSuffix("xn--85x722f.xn--fiqs8s", "食狮.中国"); | ||
| checkPublicSuffix("www.xn--85x722f.xn--fiqs8s", "食狮.中国"); | ||
| checkPublicSuffix("shishi.xn--fiqs8s", "shishi.中国"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes break the definition of the formal algorithm as defined by their unit tests (which are copied here): https://github.com/publicsuffix/list/blob/main/tests/test_psl.txt#L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage And why is that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We rely on this implementation of the PSL formal algorithm downstream, it's the whole reason we're using the PublicSuffixMatcher. If doesn't implement the PSL according to the spec then we might as well use something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage Do feel free to use something else by all of means. The algorithm description however does not mention anything about results being in Punycode. To the contrary it states "The list uses Unicode, not Punycode"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would use something else but this is the best one that doesn't require constant updating (the one in Guava seems to be a hardcoded list of regular expressions that they maintain and we're trying to avoid a dependency on Guava). The alternative is to fork this internally which is what we had done but then we contributed the code here as we thought others might also want the "formal" behaviour. If you think the formal definition unit tests are wrong and should look like what you have here we could raise a discussion there? To me the formal one seems to respect "punycode in, punycode out" and "unicode in, unicode out" not "punycode in, unicode out" which is what this now does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the spec could be a lot clearer and would benefit from having a list of examples, but failing that all we have to go on is the test suite that they use and other implementations of the standard.
Just to be clear, the change from Punycode to Unicode isn't my personal need, it appears to be how all the libraries which deal with the PSL listed at https://publicsuffix.org/learn/ work. For comparison I did the following with xn--85x722f.com.cn as input to see if xn--85x722f.com.cn would be returned as per the test across a number of them and they all behaved the same:
Using https://github.com/hamano/regdom4j/:
❯ java -jar regdom4j-1.0.3.jar xn--85x722f.com.cn
xn--85x722f.com.cn
Using Guava
InternetDomainName domain = InternetDomainName.from("xn--85x722f.com.cn");
System.out.println(domain.publicSuffix());
//output com.cn
System.out.println(domain.topDomainUnderRegistrySuffix());
//output xn--85x722f.com.cn
Using https://github.com/whois-server-list/public-suffix-list
PublicSuffixListFactory factory = new PublicSuffixListFactory();
PublicSuffixList suffixList = factory.build();
System.out.println(suffixList.getPublicSuffix("xn--85x722f.com.cn"));
//output com.cn
System.out.println(suffixList.getRegistrableDomain("xn--85x722f.com.cn"));
//output xn--85x722f.com.cn
Using https://pypi.org/project/publicsuffix2/
from publicsuffix2 import get_public_suffix
print(get_sld('xn--85x722f.com.cn'))
# output xn--85x722f.com.cn
Using https://pypi.org/project/publicsuffixlist/
from publicsuffixlist import PublicSuffixList
psl = PublicSuffixList()
print(psl.suffix('xn--85x722f.com.cn'))
# output xn--85x722f.com.cn
Using https://pypi.org/project/tldextract/
import tldextract
print(tldextract.extract('xn--85x722f.com.cn'))
# output ExtractResult(subdomain='', domain='xn--85x722f', suffix='com.cn', is_private=False)
So I think most users would be surprised if the commons matcher worked differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage I find "all cool kids do that" line of argument rather unconvincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage Anyways, I kept the original behavior with somewhat non-elegant hack but this is a good as it gets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage I find "all cool kids do that" line of argument rather unconvincing.
I'm not sure they're cool kids ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@massdosage Anyways, I kept the original behavior with somewhat non-elegant hack but this is a good as it gets.
OK, looks like the PSL tests are all unchanged and passed so that's good, thanks!
ed76b8e to
26ace65
Compare
arturobernalg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ok2c LGTM
Presently DNS normalization is performed inconsistently and often multiple times over the same input
This change-set
@arturobernalg Please review.