Simo Wu

Hello , welcome to my ‘blog’!

This series of study notes will mainly focus on the algorithm perspective of building online recommendation system.

Suppose you are creating one app, in which the users can search and review(grading) restaurants or foods. In order to give your user a better experience, like: you want to give some suggestions based on the user’s history preferences(people usually will have no idea when they needs to decide what to eat), what can you do with the data you’ve already got?

There are two ways you can play(mine) with the data:

First one, you can find and grouping similar ,or more accurately, partial similar users, since they have kind of same taste, and telling your customer, that the people who have same preference as you may also like…

Second one, you can give each customer an estimate preference with respect to many latent factors(like the food style, price, location…) based on history data, and give each restaurant an estimate on these same latent factors, after that, use the correlations in-between to tell whether the user may like or dislike a certain restaurant.

This time, I will focus on the algorithm of first model, the main idea is to : First, find similar users from massive dataset and Second, grouping similar users.

Jaccard Similarity of sets:

The Jaccard Similarity between set S and set T is:, that is the ratio of the size of intersection of two sets to the size of union of these two sets.

Matrix Representation of sets:

Suppose, we have 4 sets: S1,S2,S3,S4, and 5 elements:a,b,c,d,e and each set contains several elements, the belonging relationship can be represented in a matrix way:

This is a visualization of the characteristic matrix for a collection of sets. The column of the matrix correspond to the sets and the rows correspond to the elements, in our case the sets can be users and the elements can be restaurants, 1 represents the users like/visited the restaurant and 0 has the opposite meaning.

The characteristic matrix is unlikely to be the way the data is stored because usually this matrix will be sparse.

Minhashing and Minhash Signature

the matrix is usually big , say, millions of columns and hundred thousands of rows, to characterize each user(each column,each set, whatever you like, although they have the same meaning right now) we desire to construct ‘signatures’ for each users, composed of several hundreds of ‘minhash’, it is to say, hashing down the users’ purchasing information in order to do the comparison efficiently.

To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the first row, in the permuted order , in which the column has 1.

Example : we permute the previous showed characteristic matrix in the order of rows as beadc, then we get:

so the minhashed value of S1 is : a , h(S2)=c, h(S3)=b, h(S4)=a.

There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equal the Jaccard similarity of those sets.

this will be a useful result in the upcoming construction.

Next, we will introduce Minhash Signatures, it is used for downsizing the set representation and at same time conserve the set’s information. Again, think of one characteristic Matrix M. to represent a set we pick at random some number n of permutations of the row of M. For example n=100, those minhash function induced by the permutations are: h1,h2,…h100, call the minhash signature of a set S is the vector[h1(S),h2(S),…,h100(S)], so we obtain a smaller data(vector of size n) to represent the previous set S.

Note: Computing Minhash Signatures need special technique.

So by downsizing the sets, we can make the comparison between sets doable. But even if the number of sets is small enough , we still have significantly large number of pairs to compare, remember, we are always dealing with big data. We need another algorithm to deal with this situation.

Locality- Sensitive Hashing for Minhash Signatures

Noticing that if we ‘hash’ items(in our case, sets) several times, in such way that similar items(sets) are more likely to be hashed into the same bucket than dissimilar items are. We can consider any pair that hashed to the same bucket for any of the hashings to be candidate pair. We check only the candidate pairs for similarity. So we hope most of the dissimilar pairs will never hash into same buckets, otherwise , we call them false positives and we also hope that truly similar pairs will be hashed into same buckets, otherwise, we call them false negatives.

Locality – Sensitive means ‘close’ items are more likely to be hashed into same bucket, vice versa.

So how to construct Locality Sensitive for minhash signatures? If we have minhash signatures for the items, an effective way to choose the hashing is to divide the signature matrix to b bands consisting of r rows each(n=b*r). For each band , there is a hash function that hashes vectors of r integers into large number of buckets, we can use the same hash for all the bands, but we use separate bucket array for each band , so columns with the same vector in different bands will not hash to the same bucket. recall that The probability that the minhash function for a random permutation of rows produces the same value for two sets equal the Jaccard similarity of those sets.

So the probability that the signatures of two sets with Jaccard similarity s agree in all rows of one particular band is s^r

The probability that the signatures do not agree in at least one row of a particular band is 1-s^r

The probability that the signatures do not agree in all rows of any of the bands is (1-s^r)^b

The probability that the signatures agree in all the row of at least one band, and therefore become a candidate pair, is 1-(1-s^r)^b

so by plotting this figure, considering the probability of becoming candidate pair as a function of the Jaccard Similarity s

we get an S-curve:

so we can see: similar items are more likely to be hashed into candidate pairs, if we choose different valve values for choosing candidate pairs, we can either deduce false negatives or false positives.

the threshold is where the slope is steepest, where the probability is 1/2,approximately (1/b)^(1/r).

So , let’s do a summary, on how to find similar items:

1 Construct the Matrix Representation of the Customer-Restaurant relationship

2 pick n for minhash signatures.

3 choose a threshold to define how similar sets have to be in in order for them to be regarded as a desired ‘similar pair’, pick b and r such that b*r=n, t is approximately (1/b)^(1/r)

4 construct candidate pairs by applying the Locality Sensitive Hashing Technique

5 examine each candidate pair’s signature or even more check the sets themselves if they are truly similar.

For the Restaurant rating problem, we can see restaurants as similar if they were visited or rated highly by many of the same customers, and see customers as similar if they visited or rated highly many of the same restaurants. But, we cannot say two customers are similar if their sets of purchased items have a high Jaccard similarity. Likewise, two items that have sets of purchasers with high Jaccard similarity will be rare also. Even a Jaccard similarity like 20% might be unusual enough to identify customers with similar tastes.

When our data consists of ratings rather than binary decisions , we cannot rely simply on sets as representations of customers or items, some options are:

1 Ignore low rated customer-restaurant pairs, that is, treat these events as if the customer never visited/rated the restaurant

2 Bisection the ratings into ‘liked’,’hated’

3 If rating are 1-to-5 stars, put a restaurant n times in a customer ‘s set if they rated that restaurant n-stars. Then use Jaccard similarity of bags (different from Jaccard Similarity of sets)to count the similarity.

Next time, I will focus on the clustering part, and some more sophisticated analysis on Locality Sensitive Hashing.

online recommendation 2—-Recommendation System using compressing sensor

notes on BIG DATA / DATA MINING ——- ONLINE RECOMMENDATION SYSTEM 1

Hello world!