Skip to content

Conversation

@aherbert
Copy link
Contributor

@aherbert aherbert commented Mar 7, 2019

This class adds a generic Similarity measure that allows the user to define how a CharSequence is split into a Set. The two sets are then compared and the results returned.

@coveralls
Copy link

coveralls commented Mar 7, 2019

Coverage Status

Coverage increased (+0.03%) to 97.917% when pulling 639e4ab on aherbert:feature-TEXT-155 into ea15c5a on apache:master.

@kinow
Copy link
Member

kinow commented Mar 7, 2019

@aherbert I will have another play with the code later with more time. Another library also implemented helper class/method for the intersection. I think the design here looks similar.

However, I think it would make more sense to have the IntersectionResult being used in other metrics.

Wouldn't it be possible to use IntersectionResult in the Jaccard and even in the new Sorensen-Dice metrics?

We can leave the IntersectionSimilarity but maybe use it as an internal or package protected class? Moving the F1 score and Jaccard to its own classes (in the Jaccard case, I believe it means replacing the code in the existing JaccardSimilarity by IntersectionResult + IntersectionSimilarity, then in the return of the JaccardSimilarity#apply simply have the code we have now in IntersectionResult#getJaccard ).

What do you think?

@aherbert
Copy link
Contributor Author

aherbert commented Mar 7, 2019

This class could be used internally by the JaccardSimilarity. It could also be used by a new Sorenson-Dice Similarity (although I prefer F1Score). Note: that new implementation for the Sorenson-Dice Similarity uses bigrams and not single characters.

But I do not see why it should be package private. This functionality is generic and may be of use to someone. However I am fine with either. My motivation was to try and unify the idea of splitting a CharSequence into things and then computing an intersection of two sets of things.

If this class is public then the user can write their own method to create the Set from the CharSequence. The rest of the similarity metric is implemented. Alternatively we hide it at the package level and create the SorensonDiceSimilarity equivalent to the JaccardSimilarity. I'd suggest adding parameterised constructors to denote what size N-gram to use. The default can be 1. Then add a package level utils class to handle creating the Set<N-gram> for whatever size N-gram is requested.

However limiting to N-grams then stops someone from computing a Jaccard using words, ref. the CosineSimilarity which splits using whitespace. If the class were public this is possible.

The API link you gave (thanks) just uses functions already in Google's Guava for working with Set and their own version of commons-collections Bag named Multiset. It would be easy to extend this helper class to use a Bag approach but I did not do it because commons-text did not depend on commons-collections.

@kinow
Copy link
Member

kinow commented Mar 7, 2019

This class could be used internally by the JaccardSimilarity. It could also be used by a new Sorenson-Dice Similarity (although I prefer F1Score).

+1

Note: that new implementation for the Sorenson-Dice Similarity uses bigrams and not single characters.

Oh, good point. In that case I think I now understand your remark to maybe rename that implementation in the pull request to something like BigramSorensonDice. Good food for thought, no idea what's the best right now, but will think about it for a little while.

But I do not see why it should be package private.

Good arguments @aherbert . Then maybe leave it public, and move the F1/jaccard to the related classes. How does that sound?

@asfgit asfgit closed this in 4fa483c Mar 8, 2019
@aherbert aherbert reopened this Mar 8, 2019
@aherbert
Copy link
Contributor Author

aherbert commented Mar 8, 2019

I'd like to change this a bit to support the Bag functionality. One big nit I had was that:

"aa" vs "aa"       == 1
"aa" vs "aaaaaaaa" == 1            (with no duplicates)
"aa" vs "aaaaaaaa" == 2 / 8 = 0.25 (with duplicates)

I will update the API so that the Function is altered to a Function<CharSequence, Collection<T>>. The user just then need to split the CharSequence into a collection. If it is a Set then the assumption would be that the user would like to perform an intersection without duplicates. Anything else can be converted internally to a Bag representation. I'll put the bag functionality as a private class and only borrow what is needed from commons-collections. The amount of code will be minimal as it is just a HashMap<T, MutableInteger> so the count of the items in the set can be adjusted.

This will get it working. If a dependency on commons-collections is not a problem then the internal class can be removed later without breaking binary compatibility.

I'll also change F1Score to SorensenDiceCoefficient so the name matches the parallel work in #103.

@aherbert
Copy link
Contributor Author

aherbert commented Mar 8, 2019

The new API using Collection is done. The class can now support duplicates.

I have added a test to show the class can produce the same result as a case insensitive word bigram algorithm documented here: How to Strike a Match.

Note: Somewhere between switching computers the git history broke and causes a conflict when trying to rebase. It is only 4 files so when finished (merge or not) I'll drop the branch and redo with the final files.

* <code>2|A &#8745; B| / (|A| + |B|) </code>
* </pre>
*
* <p>This is also known as the F1 score.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing end </p>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I also noticed that but had stopped firing up insignificant changes to GitHub. It is not detected by checkstyle and I ran mvn javadoc OK. It would have to be a JDK < 1.8 to break building the javadoc IIUC.

*/
public double getJaccardIndex() {
return intersection == 0 ? 0.0 : (double) intersection / getUnion();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. The IntersectionResult is also able to calculate the jaccard index. In my opinion, this is responsibility of somebody else, and the intersection result responsibility should be only store the intersection result.

Here's what I mean in a diff.

diff --git a/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java b/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
index 823d5a9..df24017 100644
--- a/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
+++ b/src/main/java/org/apache/commons/text/similarity/IntersectionResult.java
@@ -105,21 +105,6 @@ public class IntersectionResult {
         return (long) sizeA + sizeB - intersection;
     }
 
-    /**
-     * Gets the Jaccard index. The Jaccard is the intersection divided by the union.
-     *
-     * <pre><code>|A &#8745; B| / |A &#8746; B| </code></pre>
-     *
-     * <p>This implementation defines the result as zero if there is no intersection,
-     * even when the union is zero to avoid a {@link Double#NaN} result.</p>
-     *
-     * @return the Jaccard index
-     * @see <a href="https://en.wikipedia.org/wiki/Jaccard_index">Jaccard index</a>
-     */
-    public double getJaccardIndex() {
-        return intersection == 0 ? 0.0 : (double) intersection / getUnion();
-    }
-
     /**
      * Gets the S&#248;rensen-Dice coefficient. The coefficient is twice the size of the intersection
      * divided by the size of both sets.
diff --git a/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java b/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
index 2e88dd2..20084b7 100644
--- a/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
+++ b/src/main/java/org/apache/commons/text/similarity/JaccardSimilarity.java
@@ -16,8 +16,10 @@
  */
 package org.apache.commons.text.similarity;
 
+import java.util.Collection;
 import java.util.HashSet;
 import java.util.Set;
+import java.util.function.Function;
 
 /**
  * Measures the Jaccard similarity (aka Jaccard index) of two sets of character
@@ -62,22 +64,18 @@ public class JaccardSimilarity implements SimilarityScore<Double> {
      * @return index
      */
     private Double calculateJaccardSimilarity(final CharSequence left, final CharSequence right) {
-        final int leftLength = left.length();
-        final int rightLength = right.length();
-        if (leftLength == 0 || rightLength == 0) {
-            return 0d;
-        }
-        final Set<Character> leftSet = new HashSet<>();
-        for (int i = 0; i < leftLength; i++) {
-            leftSet.add(left.charAt(i));
-        }
-        final Set<Character> rightSet = new HashSet<>();
-        for (int i = 0; i < rightLength; i++) {
-            rightSet.add(right.charAt(i));
-        }
-        final Set<Character> unionSet = new HashSet<>(leftSet);
-        unionSet.addAll(rightSet);
-        final int intersectionSize = leftSet.size() + rightSet.size() - unionSet.size();
-        return 1.0d * intersectionSize / unionSet.size();
+        final Function<CharSequence, Collection<Character>> converter = cs -> {
+            final int length = cs.length();
+            final Set<Character> set = new HashSet<>(length);
+            for (int i = 0; i < length; i++) {
+                set.add(cs.charAt(i));
+            }
+            return set;
+        };
+        IntersectionSimilarity<Character> intersectionSimilarity = new IntersectionSimilarity<>(converter);
+        IntersectionResult intersectionResult = intersectionSimilarity.apply(left,  right);
+        int intersection = intersectionResult.getIntersection();
+        long union = intersectionResult.getUnion();
+        return intersection == 0 ? 0.0 : (double) intersection / union;
     }
 }

The change above is replacing our existing logic to calculate the jaccard index in JacardSimilarity class. Now it is calling the IntersectionSimilarity class to get an IntersectResult, and then calculating the jaccard index. All unit tests still pass this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is the correct thing to do. Make Jaccard use this class.

*/
public class IntersectionSimilarity<T> implements SimilarityScore<IntersectionResult> {
/** The converter used to create the elements from the characters. */
private final Function<CharSequence, Collection<T>> converter;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we also need a DEFAULT_CHARACTER_CONVERTER in some class/interface? I needed one for characters, so just grabbed the one from the unit tests. I imagine other users with two CharacterSequence would probably use the default one too?

Something like

        public static final Function<CharSequence, Collection<Character>> DEFAULT_CHARACTER_CONVERTER = cs -> {
            final int length = cs.length();
            final Set<Character> set = new HashSet<>(length);
            for (int i = 0; i < length; i++) {
                set.add(cs.charAt(i));
            }
            return set;
        };

I think it is Commons CSV that provides some default parsers or readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code gets pulled into use in JaccardSimilarity and SorensonDice then I can add a package private CharSequenceUtils that will have, e.g.:

Set toCharacterSet(CharSequence);
Set toBigramSet(CharSequence);
List toCharacterList(CharSequence);
List toBigramList(CharSequence);

That should be enough to handle the current use cases.

The unit test for creating words of uppercase bigrams is another algorithm but with no name. I do not think we want to support it.

*/
Stream<Map.Entry<T, BagCount>> parallelStream() {
return map.entrySet().parallelStream();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a parallelStream might be bad here I think. Instead we should either default to normal stream, or give the user a way to choose whether to use a normal or a parallel stream (example discussion about it with some good references).

I had an issue with a parallel stream in a library some months ago, and took a while to identify where the problem was (it involved returning gridded data from an ArcGIS server in a JAXB web service in a legacy app being ported to Java 8... not the best of the experiences troubleshooting that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good info.

I was just playing with the API. I'd be happy with a standard for loop over the collection KeySet or EntrySet:

intersection = 0;
for (Entry<T, BagCount> entry : bagA.map.entrySet()) {
    final T element = entry.getKey();
    final int count = entry.getValue().count;
    intersection += Math.min(count, bagB.getCount(element));
}

I do not know about performance. It would have to be tested using JMH verses the stream. But the stream would be doing a lot more work pushing objects around, converting to int then summing them.

*
* @return <code>|A &#8746; B|</code>
*/
public long getUnion() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the class description, Container class to store the intersection results between two sets., it makes sense to have a getIntersection() method. But wouldn't the union belong to something like UnionResult?

This class appears to have a similar interface to that class from Simmetrics I mentioned some time ago. Perhaps the class and the similarity should be named something else?


// The following is adapted from commons-collections for a Bag.
// A Bag is a collection that can store the count of the number
// of copies of each element.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work if we use the Bag from Commons Collections too? If so, we can add it as a dependency as we did with Commons Lang.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

But once I got going I found the amount of code was minimal. This implementation is faster as it does not support fail-fast concurrent modification checking for threads and doesn't even know it's own size (saving unneeded counter incrementation).

I'd favour not adding a dependency just for this 15 lines of code and using the faster implementation.

However if we do use Bag from collections it offers the possibility that the user can pass one in (via the converter function) and the algorithm detects it.

@kinow
Copy link
Member

kinow commented Mar 9, 2019

The new API using Collection is done. The class can now support duplicates.

I have added a test to show the class can produce the same result as a case insensitive word bigram algorithm documented here: How to Strike a Match.

Does it mean that the code in this pull request can be used to calculate the jaccard index, F1 score/sorensen-dice, as well as sorensen-dice with bigrams? If so we can think later what to do with #103

Note: Somewhere between switching computers the git history broke and causes a conflict when trying to rebase. It is only 4 files so when finished (merge or not) I'll drop the branch and redo with the final files.

All good 👍

Added a few comments. Thanks for the link to the How to Strike a Match article. Very interesting! For that problem, at the moment, I would know only the solution using a more complete NLP library and something like word embedding combined with some machine learning algorithm to train (which requires a lot of data, and still gives weird results). Having an edit distance that does something similar sounds quite useful for prototyping or even as a simpler solution.

Recently - digressing - I needed to remove contractions in Python, and the best out-of-the-box solution I found was pycontractions which does not scale well for thousands/million requests (maybe not even hundreds) and takes a long time to initialize some models. Plus there are some catches with one of its approaches for matching regexes for things like U.S.. So I had to build a simpler solution for my own case, but that won't work in other projects.

@aherbert
Copy link
Contributor Author

aherbert commented Mar 9, 2019

Hi Bruno,

I'll cover most things in this comment.

  1. Name

Since this class computes the intersect I called it IntersectionSimilarity. However it also computes the union. A separate class for compute the union does not makes sense since the two things are intertwined. It is a classic Venn diagram of two sets. If you know the size of each set then if you compute the intersect you can know the union and vice versa.

How about OverlapSimilarity?

  1. Computing the Jaccard + Sorenson Dice

I had added it mainly during testing to check against known results. But there are a lot more metrics than can be computed using the true-positive (intersect) and false-positive (size A - intersect) and false negatives (size B - intersect). Since we are not going to add them all then change to:

OverlapResult - defines set size A, B, intersect and union (defined using the other 3 values).

It could also define falsePositives (using A) and falseNegatives (using B) as utility methods. Or leave them out as we do not need them at the moment.

  1. Shared functionality

This class can then be used within existing similarity scores in the library (Jaccard) and for the new Sorensen-Dice similarity. I thought that was the intension.

It will be up to those classes to appropriately use the OverlapResult to compute their score.

Alex

@kinow
Copy link
Member

kinow commented Mar 10, 2019

Looks good to me! Now we can re-use your work here in #103. Thanks a lot @aherbert !

@kinow kinow changed the title TEXT-155: Add a generic IntersectionSimilarity measure TEXT-155: Add a generic OverlapSimilarity measure Mar 10, 2019
@kinow
Copy link
Member

kinow commented Mar 10, 2019

@aherbert updated the JIRA issue, and was going to merge and add the changes.xml entry, but noticed the broken Travis-CI build.

Could you fix the build, and then if you would like too, merge & resolve the ticket, please? Here's a snippet of the log in Travis:

[INFO] There are 2 errors reported by Checkstyle 8.18 with /home/travis/build/apache/commons-text/checkstyle.xml ruleset.
[ERROR] src/test/java/org/apache/commons/text/similarity/OverlapSimilarityTest.java:[179] (regexp) RegexpSingleline: Line has trailing spaces.
[ERROR] src/test/java/org/apache/commons/text/similarity/OverlapSimilarityTest.java:[192] (regexp) RegexpSingleline: Line has trailing spaces.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 43.295 s
[INFO] Finished at: 2019-03-09T21:22:18Z
[INFO] Final Memory: 57M/734M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.0.0:check (default-cli) on project commons-text: You have 2 Checkstyle violations. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
The command "mvn" exited with 1.

Cheers
Bruno

@ameyjadiye
Copy link
Contributor

@aherbert updated the JIRA issue, and was going to merge and add the changes.xml entry, but noticed the broken Travis-CI build.

Could you fix the build, and then if you would like too, merge & resolve the ticket, please? Here's a snippet of the log in Travis:

[INFO] There are 2 errors reported by Checkstyle 8.18 with /home/travis/build/apache/commons-text/checkstyle.xml ruleset.
[ERROR] src/test/java/org/apache/commons/text/similarity/OverlapSimilarityTest.java:[179] (regexp) RegexpSingleline: Line has trailing spaces.
[ERROR] src/test/java/org/apache/commons/text/similarity/OverlapSimilarityTest.java:[192] (regexp) RegexpSingleline: Line has trailing spaces.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 43.295 s
[INFO] Finished at: 2019-03-09T21:22:18Z
[INFO] Final Memory: 57M/734M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-checkstyle-plugin:3.0.0:check (default-cli) on project commons-text: You have 2 Checkstyle violations. -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
The command "mvn" exited with 1.

Cheers
Bruno

@aherbert this used to happen this with me quite a lot, then I started practicing to simply type mvn which triggers defaultGoal in pom.xml and ensures you nothing will fail in travis-ci. happy coding 😄

@aherbert
Copy link
Contributor Author

I have tried to clean up the history into a single commit.

I have changed the name back to IntersectionSimilarity as it was pointed out to me that Overlap has a specific meaning in the combinatorics on words space, an “overlap” is a specific repeated pattern. Also there is an OverlapCoefficient between sets which is the intersection over the min size of the two sets.

I dropped the computation of the metrics and the union from the IntersectionResult. This class now has no logic but just holds data.

I removed the use of streams and use a classic iteration over the smaller of the two sets of keys to get the intersection.

This is now a generic set similarity which just requires a function to split up a CharSequence. A place to provide such functions, as contained in the example units test, is best left to another block of new functionality.

@aherbert
Copy link
Contributor Author

this used to happen this with me quite a lot, then I started practicing to simply type mvn which triggers defaultGoal in pom.xml and ensures you nothing will fail in travis-ci.

It was my bad as I usually have checkstyle enabled in the IDE. So I have added that for the commons-text project. I must remember to do mvn in the future.

@kinow
Copy link
Member

kinow commented Mar 10, 2019

Not a problem @aherbert , some times I forget to run clirr and break backward compatibility (which is hundreds of times worse than checkstyle 🙂 ). And thanks for the explanation on Overlap back to Intersection. Great to have someone in our team with background and connections that can provide this kind of feedback . Thanks!

@ameyjadiye
Copy link
Contributor

@aherbert , are there more improvement on this ? if everything looks good please merge to trunk so I can make changes from my side. Thank you.

@asfgit asfgit merged commit 639e4ab into apache:master Mar 13, 2019
asfgit pushed a commit that referenced this pull request Mar 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants