Discovering high-value patterns in transactions
On this publish, I’ll give an alternative choice to well-liked strategies in market basket evaluation that may assist practitioners discover high-value patterns fairly than simply probably the most frequent ones. We’ll acquire some instinct into completely different sample mining issues and take a look at a real-world instance. The complete code may be discovered right here. All pictures are created by the creator.
I’ve written a extra introductory article about sample mining already; in the event you’re not accustomed to a number of the ideas that come up right here, be happy to verify that one out first.
In brief, sample mining tries to search out patterns in information (duuh). More often than not, this information comes within the type of (multi-)units or sequences. In my final article, for instance, I appeared on the sequence of actions {that a} person performs on a web site. On this case, we would care concerning the ordering of the objects.
In different instances, such because the one we’ll talk about beneath, we don’t care concerning the ordering of the objects. We solely record all of the objects that have been within the transaction and the way typically they appeared.
So for instance, transaction 1 contained 🥪 3 instances and 🍎 as soon as. As we see, we lose details about the ordering of the objects, however in lots of eventualities (because the one we’ll talk about beneath), there is no such thing as a logical ordering of the objects. That is much like a bag of phrases in NLP.
Market Basket Evaluation (MBA) is an information evaluation approach generally utilized in retail and advertising to uncover relationships between merchandise that clients have a tendency to buy collectively. It goals to determine patterns in clients’ buying baskets or transactions by analyzing their buying conduct. The central thought is to know the co-occurrence of things in buying transactions, which helps companies optimize their methods for product placement, cross-selling, and focused advertising campaigns.
Frequent Itemset Mining (FIM) is the method of discovering frequent patterns in transaction databases. We will take a look at the frequency of a sample (i.e. a set of things) by calculating its help. In different phrases, the help of a sample X is the variety of transactions T that comprise X (and are within the database D). That’s, we’re merely taking a look at how typically the sample X seems within the database.
In FIM, we then need to discover all of the sequences which have a help larger than some threshold (typically referred to as minsup). If the help of a sequence is larger than minsup, it’s thought-about frequent.
Limitations
In FIM, we solely take a look at the existence of an merchandise in a sequence. That’s, whether or not an merchandise seems two instances or 200 instances doesn’t matter, we merely signify it as a one. However we regularly have instances (resembling MBA), the place not solely the existence of an merchandise in a transaction is related but additionally what number of instances it appeared within the transaction.
One other downside is that frequency doesn’t at all times suggest relevance. In that sense, FIM assumes that each one objects within the transaction are equally essential. Nevertheless, it’s affordable to imagine that somebody shopping for caviar may be extra essential for a enterprise than somebody shopping for bread, as caviar is doubtlessly a excessive ROI/revenue merchandise.
These limitations straight convey us to Excessive Utility Itemset Mining (HUIM) and Excessive Utility Quantitative Itemset Mining (HUQIM) that are generalizations of FIM that attempt to handle a number of the issues of regular FIM.
Our first generalization is that objects can seem greater than as soon as in a transaction (i.e. we now have a multiset as a substitute of a easy set). As mentioned earlier than, in regular itemset mining, we remodel the transaction right into a set and solely take a look at whether or not the merchandise exists within the transaction or not. So for instance the 2 transactions beneath would have the identical illustration.
t1 = [a,a,a,a,a,b] # repr. as {a,b} in FIM
t2 = [a,b] # repr. as {a,b} in FIM
Above, each these two transactions can be represented as [a,b] in common FIM. We shortly see that, in some instances, we might miss essential particulars. For instance, if a and b have been some objects in a buyer’s buying cart, it might matter loads whether or not we now have a (e.g. a loaf of bread) 5 instances or solely as soon as. Due to this fact, we signify the transaction as a multiset through which we write down, what number of instances every merchandise appeared.
# multiset illustration
t1_ms = {(a,5),(b,1)}
t2_ms = {(a,1),(b,1)}
That is additionally environment friendly if the objects can seem in a lot of objects (e.g. 100 or 1000 instances). In that case, we want not write down all of the a’s or b’s however merely how typically they seem.
The generalization that each the quantitative and non-quantitative strategies make, is to assign each merchandise within the transaction a utility (e.g. revenue or time). Beneath, we now have a desk that assigns each potential merchandise a unit revenue.
We will then calculate the utility of a particular sample resembling {🥪, 🍎} by summing up the utility of these objects within the transactions that comprise them. In our instance we might have:
(3🥪 * $1 + 1🍎 * $2) +
(1 🥪 * $1 + 2🍎 * $2) = $10
So, we get that this sample has a utility of $10. With FIM, we had the duty of discovering frequent patterns. Now, we now have to search out patterns with excessive utility. That is primarily as a result of we assume that frequency doesn’t suggest significance. In common FIM, we’d have missed uncommon (rare) patterns that present a excessive utility (e.g. the diamond), which isn’t true with HUIM.
We additionally must outline the notion of a transaction utility. That is merely the sum of the utility of all of the objects within the transaction. For our transaction 3 within the database, this might be
1🥪 * $1 + 2🦞*$10 + 2🍎*$2 = $25
Word that fixing this downside and discovering all high-utility objects is tougher than common FPM. It’s because the utility doesn’t comply with the Apriori property.
The Apriori Property
Let X and Y be two patterns occurring in a transaction database D. The apriori property says that if X is a subset of Y, then the help of X have to be at the very least as large as Y’s.
Which means if a subset of Y is rare, Y itself have to be rare because it should have a smaller help. Let’s say we now have X = {a} and Y = {a,b}. If Y seems 4 instances in our database, then X should seem at the very least 4 instances, since X is a subset of Y. This is smart since we’re making the sample much less basic / extra particular by including an merchandise which implies that it’ll match much less transactions. This property is utilized in most algorithms because it implies that if {a} is rare all supersets are additionally rare and we will get rid of them from the search area [3].
This property doesn’t maintain once we are speaking about utility. A superset Y of transaction X might have kind of utility. If we take the instance from above, {🥪} has a utility of $4. However this doesn’t imply we can not take a look at supersets of this sample. For instance, the superset we checked out {🥪, 🍎} has a better utility of $10. On the identical time, a superset of a sample gained’t at all times have extra utility because it may be that this superset simply doesn’t seem fairly often within the DB.
Concept Behind HUIM
Since we will’t use the apriori property for HUIM straight, we now have to give you another higher sure for narrowing down the search area. One such sure known as Transaction-Weighted Utilization (TWU). To calculate it, we sum up the transaction utility of the transactions that comprise the sample X of curiosity. Any superset Y of X can’t have a better utility than the TWU. Let’s make this clearer with an instance. The TWU of {🥪,🍎} is $30 ($5 from transaction 1 and $5 from transaction 3). After we take a look at a superset sample Y resembling {🥪 🦞 🍎} we will see that there is no such thing as a means it might have extra utility since all transactions which have Y in them even have X in them.
There at the moment are varied algorithms for fixing HUIM. All of them obtain a minimal utility and produce the patterns which have at the very least that utility as their output. On this case, I’ve used the EFIM algorithm since it’s quick and reminiscence environment friendly.
For this text, I’ll work with the Market Basket Evaluation dataset from Kaggle (used with permission from the unique dataset creator).
Above, we will see the distribution of transaction values discovered within the information. There’s a whole of round 19,500 transactions with a median transaction worth of $526 and 26 distinct objects per transaction. In whole, there are round 4000 distinctive objects. We will additionally make an ABC evaluation the place we put objects into completely different buckets relying on their share of whole income. We will see that round 500 of the 4000 objects make up round 70% of the income (A-items). We then have a protracted right-tail of things (round 2250) that make up round 5% of the income (C-items).
Preprocessing
The preliminary information is in a protracted format the place every row is a line merchandise inside a invoice. From the BillNo we will see to which transaction the merchandise belongs.
After some preprocessing, we get the info into the format required by PAMI which is the Python library we’re going to use for making use of the EFIM algorithm.
information['item_id'] = pd.factorize(information.Itemname)[0].astype(str) # map merchandise names to id
information["Value_Int"] = information["Value"].astype(int).astype(str)
information = information.loc[data.Value_Int != '0'] # exclude objects w/o utilitytransaction_db = information.groupby('BillNo').agg(
objects=('item_id', lambda x: ' '.be a part of(record(x))),
total_value=('Worth', lambda x: int(x.sum())),
values=('Value_Int', lambda x: ' '.be a part of(record(x))),
)
# filter out lengthy transactions, solely use subset of transactions
transaction_db = transaction_db.loc[transaction_db.num_items < 10].iloc[:1000]
We will then apply the EFIM algorithm.
import PAMI.highUtilityPattern.fundamental.EFIM as efim obj = efim.EFIM('tdb.csv', minUtil=1000, sep=' ')
obj.startMine() #begin the mining course of
obj.save('out.txt') #retailer the patterns in file
outcomes = obj.getPatternsAsDataFrame() #Get the patterns found right into a dataframe
obj.printResults()
The algorithm then returns an inventory of patterns that meet this minimal utility criterion.