Friday, September 20, 2024

You Kon’t Know Jacc(ard). When the Jaccard similarity index isn’t… | by Eleanor Hanna | Jul, 2024

Must read


When the Jaccard similarity index isn’t the proper device for the job, and what to do as a substitute

Towards Data Science

I’ve been considering currently about one in every of my go-to knowledge science instruments, one thing we use fairly a bit at Aampe: the Jaccard index. It’s a similarity metric that you simply compute by taking the scale of the intersection of two units and dividing it by the scale of the union of two units. In essence, it’s a measure of overlap.

The formula for the Jaccard similarity index. The size of the intersection of set A and set B is divided by the size of the union of set A and set B.

For my fellow visible learners:

On the left, two Venn diagrams are stacked on top of each other as if dividing one by the other. In the numerator position, the intersection of the Venn diagram is highlighted. In the denominator position, the union is highlighted. On the right is a similar image, but the Venn diagram has a larger overlap.
Picture by creator

Many (myself included) have sung the praises of the Jaccard index as a result of it turns out to be useful for lots of use circumstances the place you have to work out the similarity between two teams of components. Whether or not you’ve received a comparatively concrete use case like cross-device id decision, or one thing extra summary, like characterize latent person curiosity classes primarily based on historic person conduct — it’s actually useful to have a device that quantifies what number of parts two issues share.

However Jaccard will not be a silver bullet. Typically it’s extra informative when it’s used together with different metrics than when it’s used alone. Typically it’s downright deceptive.

Let’s take a better take a look at a couple of circumstances when it’s not fairly acceptable, and what you would possibly need to do as a substitute (or alongside).

The issue: The larger one set is than the opposite (holding the scale of the intersection equal), the extra it depresses the Jaccard index.

In some circumstances, you don’t care if two units are reciprocally related. Possibly you simply need to know if Set A principally intersects with Set B.

Let’s say you’re attempting to determine a taxonomy of person curiosity primarily based on searching historical past. You have got a log of all of the customers who visited http://www.luxurygoodsemporium.com and a log of all of the customers who visited http://superexpensiveyachts.com (neither of that are dwell hyperlinks at press time; fingers crossed nobody creepy buys these domains sooner or later).

Say that out of 1,000 customers who browsed for tremendous costly yachts, 900 of them additionally seemed up some luxurious items — however 50,000 customers visited the posh items website. Intuitively, you would possibly interpret these two domains as related. Practically everybody who patronized the yacht area additionally went to the posh items area. Looks as if we is perhaps detecting a latent dimension of “high-end buy conduct.”

However as a result of the variety of customers who had been into yachts was a lot smaller than the variety of customers who had been into luxurious items, the Jaccard index would find yourself being very small (0.018) though the overwhelming majority of the yacht-shoppers additionally browsed luxurious items!

What to do as a substitute: Use the overlap coefficient.

The overlap coefficient is the scale of the intersection of two units divided by the scale of the smaller set. Formally:

The formula for the overlap coefficient. The size of the intersection of set A and set B is divided by the minimum of the sizes of set A and set B.

Let’s visualize why this is perhaps preferable to Jaccard in some circumstances, utilizing essentially the most excessive model of the issue: Set A is a subset of Set B.

When Set B is fairly shut in dimension to Set B, you’ve received a good Jaccard similarity, as a result of the scale of the intersection (which is the scale of Set A) is near the scale of the union. However as you maintain the scale of Set A relentless and enhance the scale of Set B, the scale of the union will increase too, and…the Jaccard index plummets.

The overlap coefficient doesn’t. It stays yoked to the scale of the smallest set. That implies that whilst the scale of Set B will increase, the scale of the intersection (which on this case is the entire dimension of Set A) will at all times be divided by the scale of Set A.

A line plot, titled “Similarity over various sizes of union when intersection = 1000.” The y axis is the labeled “similarity.” The x axis is labeled “size of the union.” The plot shows one function representing Jaccard similarity, which starts at a value of 1 and then sharply plummets as the size of the union increases. The plot shows another function representing the overlap coefficient, which starts at a value of 1 and then remains there as the size of the union increases.
Picture by creator

Let’s return to our person curiosity taxonomy instance. The overlap coefficient is capturing what we’re curious about right here — the person base for yacht-buying is linked to the posh items person base. Possibly the search engine optimisation for the yacht web site isn’t any good, and that’s why it’s not patronized as a lot as the posh items website. With the overlap coefficient, you don’t have to fret about one thing like that obscuring the connection between these domains.

Comparison of the formulas of the Jaccard similarity vs the overlap coefficient for the example in the text. The Jaccard similarity is computed by dividing 900 by 50,100, and is equal to 0.018. The overlap coefficient is computed by dividing 900 by the minimum of 1,000 and 50,000, which is 1,000, so the overlap coefficient is equal to 0.9.

Professional tip: if all you might have are the sizes of every set and the scale of the intersection, you will discover the scale of the union by summing the sizes of every set and subtracting the scale of the intersection. Like this:

Formula showing that the size of the union of set A and set B can be found by summing the size of set A and the size of set B, and then subtracting the size of the intersection of set A and set B.

Additional studying: https://medium.com/rapids-ai/similarity-in-graphs-jaccard-versus-the-overlap-coefficient-610e083b877d

The issue: When set sizes are very small, your Jaccard index is lower-resolution, and typically that overemphasizes relationships between units.

Let’s say you’re employed at a start-up that produces cellular video games, and also you’re growing a recommender system that means new video games to customers primarily based on their earlier enjoying habits. You’ve received two new video games out: Mecha-Crusaders of the Cyber Void II: Prisoners of Vengeance, and Freecell.

A spotlight group most likely wouldn’t peg these two as being very related, however your evaluation exhibits a Jaccard similarity of .4. No nice shakes, but it surely occurs to be on the upper finish of the opposite pairwise Jaccards you’re seeing — in any case, Bubble Crush and Bubble Exploder solely have a Jaccard similarity of .39. Does this imply your cyberpunk RPG and Freecell are extra intently associated (so far as your recommender is anxious) than Bubble Crush and Bubble Exploder?

Not essentially. Since you took a better take a look at your knowledge, and solely 3 distinctive gadget IDs have been logged enjoying Mecha-Crusaders, solely 4 have been logged enjoying Freecell, and a pair of of them simply occurred to have performed each. Whereas Bubble Crush and Bubble Exploder had been every visited by lots of of units. As a result of your samples for the 2 new video games are so small, a probably coincidental overlap makes the Jaccard similarity look a lot larger than the true inhabitants overlap would most likely be.

Two Venn diagrams that have almost the same general shape. On the left is a Venn diagram showing the overlap in users playing the Mecha game and Freecell; 1 user plays only the Mecha game, 2 users play only Freecell, and 2 users play both games. On the right is a Venn diagram showing the overlap in users playing Bubble Crush and Bubble Exploder; 450 users play only Bubble Crush, 790 users play only Bubble Exploder, and 780 users play both games.
Picture by creator

What to do as a substitute: Good knowledge hygiene is at all times one thing to bear in mind right here — you may set a heuristic to wait till you’ve collected a sure pattern dimension to contemplate a set in your similarity matrix. Like all estimates of statistical energy, there’s a component of judgment to this, primarily based on the standard dimension of the units you’re working with, however keep in mind the overall statistical finest observe that bigger samples are typically extra consultant of their populations.

However an alternative choice you might have is to log-transform the scale of the intersection and the scale of the union. This output ought to solely be interpreted when evaluating two modified indices to one another.

When you do that for the instance above, you get a rating fairly near what you had earlier than for the 2 new video games (0.431). However since you might have so many extra observations within the Bubble style of video games, the log-transformed intersection and log-transformed union are quite a bit nearer collectively — which interprets to a a lot increased rating.

Two equations of the modified Jaccard similarity, the one on the left for the overlap between users of Mecha game and Freecell and the one on the right for the overlap between users of Bubble Crush and Bubble Exploder.

Caveat: The trade-off right here is that you simply lose some decision when the union has a number of components in it. Including 100 components to the intersection of a union with 1000’s of components may imply the distinction between an everyday Jaccard rating of .94 and .99. Utilizing the log rework strategy would possibly imply that including 100 components to the intersection solely strikes the needle from a rating of .998 to .999. It is dependent upon what’s vital to your use case!

The issue: You’re evaluating two teams of components, however collapsing the weather into units ends in a lack of sign.

That is why utilizing a Jaccard index to check two items of textual content will not be at all times an ideal thought. It may be tempting to take a look at a pair of paperwork and need to get a measure of their similarity primarily based on what tokens are shared between them. However the Jaccard index assumes that the weather within the two teams to be in contrast are distinctive. Which flattens out phrase frequency. And in pure language evaluation, token frequency is commonly actually vital.

Think about you’re evaluating a ebook about vegetable gardening, the Bible, and a dissertation concerning the life cycle of the white-tailed deer. All three of those paperwork would possibly embrace the token “deer,” however the relative frequency of the “deer” token will fluctuate dramatically between the paperwork. The a lot increased frequency of the phrase “deer” within the dissertation most likely has a special semantic impression than the scarce makes use of of the phrase “deer” within the different paperwork. You wouldn’t need a similarity measure to simply overlook about that sign.

What to do as a substitute: Use cosine similarity. It’s not only for NLP anymore! (But in addition it’s for NLP.)

Briefly, cosine similarity is a option to measure how related two vectors are in multidimensional house (no matter the magnitude of the vectors). The route a vector goes in multidimensional house is dependent upon the frequencies of the size which are used to outline the house, so details about frequency is baked in.

To make it simple to visualise, let’s say there are solely two tokens we care about throughout the three paperwork: “deer” and “bread.” Every textual content makes use of these tokens a special variety of instances. The frequency of those tokens develop into the size that we plot the three texts in, and the texts are represented as vectors on this two-dimensional airplane. As an illustration, the vegetable gardening ebook mentions deer 3 instances and bread 5 instances, so we plot a line from the origin to (3, 5).

A graph plotting the frequency of the word “deer” against the frequency of the word “bread” in three texts: the KJV Bible (abridged), a vegetable gardening book, and a dissertation about white-tailed deer. The angle of the vectors representing the Bible and the gardening book is small. The vector representing the dissertation about deer makes a large angle with either one of the other two vectors.
Picture by creator

Right here you need to take a look at the angles between the vectors. θ1 represents the similarity between the dissertation and the Bible; θ2, the similarity between the dissertation and the vegetable gardening ebook; and θ3, the similarity between the Bible and the vegetable gardening ebook.

The angles between the dissertation and both of the opposite texts is fairly giant. We take that to imply that the dissertation is semantically distant from the opposite two — at the very least comparatively talking. The angle between the Bible and the gardening ebook is small relative to every of their angles with the dissertation, so we’d take that to imply there’s much less semantic distance between the 2 of them than from the dissertation.

However we’re speaking right here about similarity, not distance. Cosine similarity is a metamorphosis of the angle measurement of the 2 vectors into an index that goes from 0 to 1*, with the identical intuitive sample as Jaccard — 0 would imply two teams don’t have anything in widespread, and nearer you get to 1 the extra related the 2 teams are.

* Technically, cosine similarity can go from -1 to 1, however we’re utilizing it with frequencies right here, and there could be no frequencies lower than zero. So we’re restricted to the interval of 0 to 1.

Cosine similarity is famously utilized to textual content evaluation, like we’ve performed above, however it may be generalized to different use circumstances the place frequency is vital. Let’s return to the posh items and yachts use case. Suppose you don’t merely have a log of which distinctive customers went to every website, you even have the counts of variety of instances the person visited. Possibly you discover that every of the 900 customers who went to each web sites solely went to the posh items website a few times, whereas they went to their yacht web site dozens of instances. If we consider every person as a token, and subsequently as a special dimension in multidimensional house, a cosine similarity strategy would possibly push the yacht-heads a bit additional away from the posh good patrons. (Be aware you can run into scalability points right here, relying on the variety of customers you’re contemplating.)

Additional studying: https://medium.com/geekculture/cosine-similarity-and-cosine-distance-48eed889a5c4

I nonetheless love the Jaccard index. It’s easy to compute and customarily fairly intuitive, and I find yourself utilizing it on a regular basis. So why write an entire weblog publish dunking on it?

As a result of nobody knowledge science device can provide you an entire image of your knowledge. Every of those completely different measures let you know one thing barely completely different. You will get beneficial data out of seeing the place the outputs of those instruments converge and the place they differ, so long as you understand what the instruments are literally telling you.

Philosophically, we’re in opposition to one-size-fits-all approaches at Aampe. After on a regular basis we’ve spent what makes customers distinctive, we’ve realized the worth of leaning into complexity. So we expect the broader the array of instruments you need to use, the higher — so long as you know the way to make use of them.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article