Movies Recommendation with Hierarchical Poisson Factorization in Edward

In this article, we will explore an implementation on the Hierarchical Factorization (HPF) model in Edward. This implementation will use the movilens 2k dataset to learn latent features that try to capture the preferences and attributes of users and movies respectively. The code is avalaible on github.

Recommender systems

I'm not going to talk about recommender systems (recsys) here. For an overview of recsys and matrix factorization techniques, see [1]. Here, I will discuss the Hierarchical Poisson Factorization as presented in this paper [2].

Hierachical Poisson Factorization

The HPF model was developed by David Blei and Prem Gopelen and it's one of the simplest matrix factorization technique. The model assumes that each cell of the observed behavior matrix is drawn from a Poisson distribution whose parameter is a linear combination of the corresponding user preferences and movie attributes. This model have shown to have many desirable properties.

The genrative process of the HPF as describe in the original paper is as follow:

For each user \(u\):
1. Sample activity \(\xi_u \sim Gamma(a',a'/b')\).
2. For each component \(k\), sample preference \(\theta_{uk} \sim Gamma(a,\xi_u)\).
For each movie \(v\):
1. Sample activity \(\eta_v \sim Gamma(c',c'/d')\).
2. For each component \(k\), sample attribute \(\beta_{vk} \sim Gamma(c,\eta_v)\).
For each user \(u\) and movie \(v\), sample rating \(r_{uv} \sim Poisson(\theta_u^T\beta_v)\).

The posterior of the user activity matrix \(\mathbf{\theta_u}\) and movie preference matrix \(\mathbf{\beta_v}\) is computed via variatonal inference [3].

Dataset

The data used in this post is the movielens 2k dataset, published by GroupLeans research group which was released in the framework of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) and at the 5th ACM Conference on Recommender Systems (RecSys 2011).

Probabilistic programming

We will build that model in Edward [4], a "Turing-complete probabilistic programming language". To build a probabilistic model in Edward, all you need to do is to specify the set of random variables of your model, instanciate your inference algorithm, and press run. For the model described above, it looks something like the code below. The Jupyter notebooks of this article is avaible on github.

import edward as ed
import tensorflow as tf
from edward.models import Poisson, Gamma
from edward.models import PointMass, Empirical

ed.set_seed(42)

Model

We specify our model.

k = 20 # Number of components
n_iter = 500
t = 500


act = Gamma(1.0, 1.0, sample_shape=n) # Users activity
pref = Gamma(1.0, act, sample_shape=k) # Users preference

pop = Gamma(0.1, 0.1, sample_shape=m) # Items popularity
attr = Gamma(1.0, pop, sample_shape=k) # Items attribute

like = Poisson(tf.gather_nd(tf.matmul(pref, attr, transpose_a=True), tidx))

qact = Empirical(
    tf.nn.softplus(tf.Variable(tf.random_normal([t,n]))),
)
qpref = PointMass(
    tf.nn.softplus(tf.Variable(tf.random_normal([k,n]))),
)
qpop = Empirical(
    tf.nn.softplus(tf.Variable(tf.random_normal([t,m]))),
)
qattr = PointMass(
    tf.nn.softplus(tf.Variable(tf.random_normal([k,m]))),
)

Inference

Thanks to Edward compositionalability of inference, we can compose different inference algorithms for the same model. Here, we will make inference using Monte Carlo EM (Expectation Maximazation), with Gibbs sampling for the E-step over the user activity and movie popularity and MAP for the M-step over the user preference and movie attribute.

inference_e = ed.Gibbs(
    {act:qact, pop:qpop},
    data={like:y, pref:qpref, attr:qattr},
)
inference_m = ed.MAP(
    {pref:qpref, attr:qattr},
    data={like:y, act:qact, pop:qpop},
)

inference_e.initialize()
inference_m.initialize(n_iter=n_iter, optimizer="rmsprop")

tf.global_variables_initializer().run()

Once set up, we can now run the model.

loss = np.empty(n_iter, dtype=np.float32)

for i in range(n_iter):
    info_dict_e = inference_e.update()
    info_dict_m = inference_m.update()

    loss[i] = info_dict_m["loss"]

    inference_m.print_progress(info_dict_m)

We inspect the losses to see how how model have done.

fig = plt.figure(figsize=(8,6))
ax = fig.add_subplot(111)
ax.plot(loss / loss.max())
ax.set_title("Loss")
ax.set_xlabel("Iteration")

It seems like the model have converged. So, we can save our posteriors for latter use in the recommendation process.

sess = ed.get_session()
pref = sess.run(qpref) # Infered user preference.
attr = sess.run(qattr) # Infered movie attribute.

Results

We've implemented the HPF to infer the latent user prefrerence and movie attribute. We can then use them to make recommendation. Here, the meta dataframe contains basic information about each movie in the form:

meta.head()

	title	imdbPictureURL	year	genre
movieID
0	Toy story	http://ia.media-imdb.com/images/...	1995	Adventure, Animation, Children, Comedy, Fantasy
1	Jumanji	http://ia.media-imdb.com/images/...	1995	Adventure, Children, Fantasy
2	Grumpy Old Men	http://ia.media-imdb.com/images/...	1993	Comedy, Romance
3	Waiting to Exhale	http://ia.media-imdb.com/images/...	1995	Comedy, Drama, Romance
4	Father of the Bride Part II	http://ia.media-imdb.com/images/...	1995	Comedy

The rating matrix contains the actual ratings of users.

Heatmap

It's not easy to visualize a multi-dimensional data. One way around this is the heatmap. We plot the heatmap of movies attribute (see the code of the heatmap function in the repository).

heatmap(attr, meta)

As we can see, for each movie, most values are really small. The higher values may capture the salient features of that movie (even though we don't know for sure what these features might be).

Multi dimentional scaling

With the plot, above, it's hard to tell which movies are closer or far away. For that, we can plot movies (some of them) with two selected components on the x axis and y axis.

mds(attr, meta)

We can do this in tree dimensions (with tree selected components). We see on the plot that some movies are grouped together. Note that, this is just 20 data points (movies). Also, the movies closer together in these dimensions may be father apart in other dimensions.

Recommendation

Before recommending a movie to a user, let's see some of the movies rated by that user.

uid = np.random.randint(n) # Choose the user
rated_by(uid, meta, rating)

Now, let's see what the system "thinks" might interest that user.

recommend_to(uid, pref, attr, meta, rating)

The recommendation is done on movies not rated by the user (we assume that the rating is not available because the user haven't watched (or encountered) the movies yet). I'm not really a movie expert, so I will rely on your comments to discuss the outcome of this recommendation. The movies shown in the "rated by" plot is just a subset of movies rated by that user and they are selected at random.

One of the other feature of a recommendation system besides its use in recommendation, is its use for suggestion of movies similar to a selected one. In the similar_to function, we compute the similarity with one of the simplest distance metric, the euclidean distance. For reminder, the euclidean distance of two vectors \(u\) and \(v\) is the quantity:

\begin{equation*} d(u, v) = ||u - v||_2 \end{equation*}

Lower the distance, closer are the two vectors. For the similarity suggestion, we can choose a well known movie. Well, I choose Toy Story (choose your own). It's the first movie in the data set. See the entries of the meta dataframe in the beginning of the section. The ID of Toy Stoy is 0 (it's arbitrary). Let's see what the system "thinks" is "similar" to Toy Story.

similar_to(0, attr, meta, rating) # SImilar to Toy Story

I've wathed two of the movies suggested here (Finding Nemo, Monsters, Inc.) including Toy Stoy itself. So, I'm in the position to say that these movies share some features together. The ratings shown below the movie's title or after the movie's title (in the figure title) is the median rating for that movie.

After ploting this, it occurs to me that Toy Story is in the suggestion despites the fact that in the implementation, I explicit discard the movie istself (since a movie is always similar to itself). Below is the part of the code that performs the calculation (see the full code on github).

dist = np.apply_along_axis(norm, 1, attr[:, mid] - attr.T) # Euclidean distance: lower is closer.
ids = np.argsort(dist)[1:n+1] # Discard the movie itself

That result makes me think that may be Toy Story is duplicated in the dataset. Let's inspect closer the suggested movies (check thier IDs) based on their euclidean distance (including the movie itself).

info = closer_to(0)
info.head(6)

	title	imdbPictureURL	year	genre	distance
movieID
0	Toy story	http://ia.media-imdb.com/images/...	1995	Adventure, Animation, Children, Comedy, Fantasy	0.000000
2827	Toy story	http://ia.media-imdb.com/images/...	1995	Adventure, Animation, Children, Comedy, Fantasy	0.172951
2827	Toy story	http://ia.media-imdb.com/images/...	1995	Adventure, Animation, Children, Comedy, Fantasy	0.172951
5900	Finding Nemo	http://ia.media-imdb.com/images/...	2003	Adventure, Animation, Children, Comedy	0.342709
2104	A Bug's Life	http://ia.media-imdb.com/images/...	1998	Adventure, Animation, Children, Comedy	0.346706
4501	Monsters, Inc.	http://ia.media-imdb.com/images/...	2001	Animation, Children, Comedy, Fantasy	0.408800
3658	The Emperor's New Groove	http://ia.media-imdb.com/images/...	1000	Adventure, Animation, Children, Comedy, Fantasy	0.432105

And indeed, Toy Story has a duplicate (ID 2827) and it's interesting to see that the system was able to find it and suggest it as the closer to the other one (ID 0). The system doesn't know a priori the difference between "movie 2827" and "movie 0" (or "movie 68" and "movie 5341" for that matter of fact) but in the attribute subspace, it's able to tell that they much closer to each other than to any other movie.

Conclusion

In this post, we showed an implementation of the Hierarchical Poisson Factorization model in Edward for movies recommendation. Hierarchical models represent a powerful class of probabilistic models with a great deal as discussed in this blog post by Thomas Wiecki. By enhancing its structure by this hierarchical component, the Poisson Factorization model is a handy tool in the modeler toolbox.

Thank you for reading.

References

[1] Yehuda Coren, Robert Bell and Chris Volinsky. Matrix Factorization Techniques for Recommender Systems. IEEE Computer Society. 2009

[2] P. Gopalan, J. Hofman, D. Blei. Sacalable Recommendation with Hierarchical Poisson Factorization. Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence. 2015

[3] David M. Blei, Alp Kucukelbir, Jon D. McAuliffe. Variational Inference: A Review for Staticians. arXiv:1601.0067v2 [stat.CO]. 2016

[4] Dustin Tran, Alp Kucukelbir, Adji B. Dieng, Maja Rudolph, Dawen Liang, and David M Blei. Edward: A library for probabilistic modeling, inference, and criticism. arXiv preprint arXiv:1610.09787. 2016a

Menu