-
Notifications
You must be signed in to change notification settings - Fork 273
TEXT-155: Add a generic OverlapSimilarity measure #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
505066c to
964077a
Compare
|
@aherbert I will have another play with the code later with more time. Another library also implemented helper class/method for the intersection. I think the design here looks similar. However, I think it would make more sense to have the Wouldn't it be possible to use We can leave the What do you think? |
964077a to
8b92150
Compare
|
This class could be used internally by the But I do not see why it should be package private. This functionality is generic and may be of use to someone. However I am fine with either. My motivation was to try and unify the idea of splitting a If this class is public then the user can write their own method to create the However limiting to N-grams then stops someone from computing a The API link you gave (thanks) just uses functions already in Google's Guava for working with |
+1
Oh, good point. In that case I think I now understand your remark to maybe rename that implementation in the pull request to something like BigramSorensonDice. Good food for thought, no idea what's the best right now, but will think about it for a little while.
Good arguments @aherbert . Then maybe leave it public, and move the F1/jaccard to the related classes. How does that sound? |
|
I'd like to change this a bit to support the I will update the API so that the This will get it working. If a dependency on commons-collections is not a problem then the internal class can be removed later without breaking binary compatibility. I'll also change F1Score to SorensenDiceCoefficient so the name matches the parallel work in #103. |
|
The new API using I have added a test to show the class can produce the same result as a case insensitive word bigram algorithm documented here: How to Strike a Match. Note: Somewhere between switching computers the git history broke and causes a conflict when trying to rebase. It is only 4 files so when finished (merge or not) I'll drop the branch and redo with the final files. |
| * <code>2|A ∩ B| / (|A| + |B|) </code> | ||
| * </pre> | ||
| * | ||
| * <p>This is also known as the F1 score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing end </p>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I also noticed that but had stopped firing up insignificant changes to GitHub. It is not detected by checkstyle and I ran mvn javadoc OK. It would have to be a JDK < 1.8 to break building the javadoc IIUC.
| */ | ||
| public double getJaccardIndex() { | ||
| return intersection == 0 ? 0.0 : (double) intersection / getUnion(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. The IntersectionResult is also able to calculate the jaccard index. In my opinion, this is responsibility of somebody else, and the intersection result responsibility should be only store the intersection result.
Here's what I mean in a diff.
diff --git a/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java b/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
index 823d5a9..df24017 100644
--- a/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
+++ b/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
@@ -105,21 +105,6 @@ public class IntersectionResult {
return (long) sizeA + sizeB - intersection;
}
- /**
- * Gets the Jaccard index. The Jaccard is the intersection divided by the union.
- *
- * <pre><code>|A ∩ B| / |A ∪ B| </code></pre>
- *
- * <p>This implementation defines the result as zero if there is no intersection,
- * even when the union is zero to avoid a {@link Double#NaN} result.</p>
- *
- * @return the Jaccard index
- * @see <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard index</a>
- */
- public double getJaccardIndex() {
- return intersection == 0 ? 0.0 : (double) intersection / getUnion();
- }
-
/**
* Gets the Sørensen-Dice coefficient. The coefficient is twice the size of the intersection
* divided by the size of both sets.
diff --git a/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java b/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
index 2e88dd2..20084b7 100644
--- a/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
+++ b/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
@@ -16,8 +16,10 @@
*/
package org.apache.commons.text.similarity;
+import java.util.Collection;
import java.util.HashSet;
import java.util.Set;
+import java.util.function.Function;
/**
* Measures the Jaccard similarity (aka Jaccard index) of two sets of character
@@ -62,22 +64,18 @@ public class JaccardSimilarity implements SimilarityScore<Double> {
* @return index
*/
private Double calculateJaccardSimilarity(final CharSequence left, final CharSequence right) {
- final int leftLength = left.length();
- final int rightLength = right.length();
- if (leftLength == 0 || rightLength == 0) {
- return 0d;
- }
- final Set<Character> leftSet = new HashSet<>();
- for (int i = 0; i < leftLength; i++) {
- leftSet.add(left.charAt(i));
- }
- final Set<Character> rightSet = new HashSet<>();
- for (int i = 0; i < rightLength; i++) {
- rightSet.add(right.charAt(i));
- }
- final Set<Character> unionSet = new HashSet<>(leftSet);
- unionSet.addAll(rightSet);
- final int intersectionSize = leftSet.size() + rightSet.size() - unionSet.size();
- return 1.0d * intersectionSize / unionSet.size();
+ final Function<CharSequence, Collection<Character>> converter = cs -> {
+ final int length = cs.length();
+ final Set<Character> set = new HashSet<>(length);
+ for (int i = 0; i < length; i++) {
+ set.add(cs.charAt(i));
+ }
+ return set;
+ };
+ IntersectionSimilarity<Character> intersectionSimilarity = new IntersectionSimilarity<>(converter);
+ IntersectionResult intersectionResult = intersectionSimilarity.apply(left, right);
+ int intersection = intersectionResult.getIntersection();
+ long union = intersectionResult.getUnion();
+ return intersection == 0 ? 0.0 : (double) intersection / union;
}
}The change above is replacing our existing logic to calculate the jaccard index in JacardSimilarity class. Now it is calling the IntersectionSimilarity class to get an IntersectResult, and then calculating the jaccard index. All unit tests still pass this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is the correct thing to do. Make Jaccard use this class.
| */ | ||
| public class IntersectionSimilarity<T> implements SimilarityScore<IntersectionResult> { | ||
| /** The converter used to create the elements from the characters. */ | ||
| private final Function<CharSequence, Collection<T>> converter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we also need a DEFAULT_CHARACTER_CONVERTER in some class/interface? I needed one for characters, so just grabbed the one from the unit tests. I imagine other users with two CharacterSequence would probably use the default one too?
Something like
public static final Function<CharSequence, Collection<Character>> DEFAULT_CHARACTER_CONVERTER = cs -> {
final int length = cs.length();
final Set<Character> set = new HashSet<>(length);
for (int i = 0; i < length; i++) {
set.add(cs.charAt(i));
}
return set;
};I think it is Commons CSV that provides some default parsers or readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code gets pulled into use in JaccardSimilarity and SorensonDice then I can add a package private CharSequenceUtils that will have, e.g.:
Set toCharacterSet(CharSequence);
Set toBigramSet(CharSequence);
List toCharacterList(CharSequence);
List toBigramList(CharSequence);
That should be enough to handle the current use cases.
The unit test for creating words of uppercase bigrams is another algorithm but with no name. I do not think we want to support it.
| */ | ||
| Stream<Map.Entry<T, BagCount>> parallelStream() { | ||
| return map.entrySet().parallelStream(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a parallelStream might be bad here I think. Instead we should either default to normal stream, or give the user a way to choose whether to use a normal or a parallel stream (example discussion about it with some good references).
I had an issue with a parallel stream in a library some months ago, and took a while to identify where the problem was (it involved returning gridded data from an ArcGIS server in a JAXB web service in a legacy app being ported to Java 8... not the best of the experiences troubleshooting that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good info.
I was just playing with the API. I'd be happy with a standard for loop over the collection KeySet or EntrySet:
intersection = 0;
for (Entry<T, BagCount> entry : bagA.map.entrySet()) {
final T element = entry.getKey();
final int count = entry.getValue().count;
intersection += Math.min(count, bagB.getCount(element));
}
I do not know about performance. It would have to be tested using JMH verses the stream. But the stream would be doing a lot more work pushing objects around, converting to int then summing them.
| * | ||
| * @return <code>|A ∪ B|</code> | ||
| */ | ||
| public long getUnion() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the class description, Container class to store the intersection results between two sets., it makes sense to have a getIntersection() method. But wouldn't the union belong to something like UnionResult?
This class appears to have a similar interface to that class from Simmetrics I mentioned some time ago. Perhaps the class and the similarity should be named something else?
|
|
||
| // The following is adapted from commons-collections for a Bag. | ||
| // A Bag is a collection that can store the count of the number | ||
| // of copies of each element. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work if we use the Bag from Commons Collections too? If so, we can add it as a dependency as we did with Commons Lang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
But once I got going I found the amount of code was minimal. This implementation is faster as it does not support fail-fast concurrent modification checking for threads and doesn't even know it's own size (saving unneeded counter incrementation).
I'd favour not adding a dependency just for this 15 lines of code and using the faster implementation.
However if we do use Bag from collections it offers the possibility that the user can pass one in (via the converter function) and the algorithm detects it.
Does it mean that the code in this pull request can be used to calculate the jaccard index, F1 score/sorensen-dice, as well as sorensen-dice with bigrams? If so we can think later what to do with #103
All good 👍 Added a few comments. Thanks for the link to the How to Strike a Match article. Very interesting! For that problem, at the moment, I would know only the solution using a more complete NLP library and something like word embedding combined with some machine learning algorithm to train (which requires a lot of data, and still gives weird results). Having an edit distance that does something similar sounds quite useful for prototyping or even as a simpler solution. Recently - digressing - I needed to remove contractions in Python, and the best out-of-the-box solution I found was pycontractions which does not scale well for thousands/million requests (maybe not even hundreds) and takes a long time to initialize some models. Plus there are some catches with one of its approaches for matching regexes for things like U.S.. So I had to build a simpler solution for my own case, but that won't work in other projects. |
|
Hi Bruno, I'll cover most things in this comment.
Since this class computes the intersect I called it How about
I had added it mainly during testing to check against known results. But there are a lot more metrics than can be computed using the true-positive (intersect) and false-positive (size A - intersect) and false negatives (size B - intersect). Since we are not going to add them all then change to:
It could also define falsePositives (using A) and falseNegatives (using B) as utility methods. Or leave them out as we do not need them at the moment.
This class can then be used within existing similarity scores in the library (Jaccard) and for the new Sorensen-Dice similarity. I thought that was the intension. It will be up to those classes to appropriately use the Alex |
|
@aherbert updated the JIRA issue, and was going to merge and add the Could you fix the build, and then if you would like too, merge & resolve the ticket, please? Here's a snippet of the log in Travis: Cheers |
@aherbert this used to happen this with me quite a lot, then I started practicing to simply type |
9a7d018 to
639e4ab
Compare
|
I have tried to clean up the history into a single commit. I have changed the name back to I dropped the computation of the metrics and the union from the I removed the use of streams and use a classic iteration over the smaller of the two sets of keys to get the intersection. This is now a generic set similarity which just requires a function to split up a |
It was my bad as I usually have checkstyle enabled in the IDE. So I have added that for the commons-text project. I must remember to do |
|
Not a problem @aherbert , some times I forget to run clirr and break backward compatibility (which is hundreds of times worse than checkstyle 🙂 ). And thanks for the explanation on |
|
@aherbert , are there more improvement on this ? if everything looks good please merge to trunk so I can make changes from my side. Thank you. |
This class adds a generic Similarity measure that allows the user to define how a
CharSequenceis split into aSet. The two sets are then compared and the results returned.