Wednesday, October 28, 2009

Nonnegative Matrix Factorization

The general problem of matrix factorization is to decompose a matrix X in two matrix factors A and B:

X_{m\times n}=A_{m\times r}B_{r\times n}

This could be accomplished by different methods. One of those is SVD. In that case X=U\Sigma V, so A=U\Sigma^{\frac{1}{2}} and B=\Sigma^{\frac{1}{2}}V.

This is used to solve different problems such as Latent Semantic Analysis (LSA). In this case, X is a term-document matrix. The columns of A can be interpreted as the elements of a basis for the latent space, which is r-dimensional. The columns of B are the codifying vectors, i.e., they correspond to the representation of the documents in the latent space. The problem of applying SVD to find the latent space, is that the codifying vectors could have, in general, negative values. This means that some documents are represented, not only by the presence of latent factors, but also by their absence.

To solve this problem, an additional restriction may be imposed to the basis vectors and codifying vectors, all their entries must be positive. This is called Nonnegative Matrix Factorization (NMF). Notice that in the case of SVD, the restriction is that the basis vectors must be orthogonal. This is accomplished by the Eigendecomposition. In NMF this restriction does not apply.

There are different ways to find a NMF [2], the most obvious one is to minimize:

||X-AB|| ^{2}

An alternative objetive function is:

D(X|AB)=\sum_{ij}\left(X_{ij}\log\frac{X_{ij}}{(AB)_{ij}}-X_{ij}+(AB)_{ij}\right)

In both cases, the constraint is A,B \ge 0.

Both objective functions are not convex, so there is not an algorithm that guarantees to find the global optimum. A gradient-descent technique could be applied. Lee and Seung [1,2] proposed an iterative algorithm using recursive updating rules for A and B that have a good compromise between speed and ease of implementation.

So, why is this important? Well, there are many reasons. First, it has been shown that the latent representation produced by NMF is better than LSI and spectral clustering for tasks such as document clustering [3]. Second, the NMF algorithm is more efficient than SVD, which requires to calculate a Eigendecomposition, this allows to scale NMF even to problems with n and m in the order of millions. Finally, efficient matrix factorization algorithms (not exactly NMF, but related) have been used to successfully address challenging problems such as the Netflix Prize [4,5]. So it is not difficult to foresee an increased interest in this technique and its application to different IR, ML and recommender system problems.

What does this mean for us? Well, we have to look at it and apply to the problems that we are addressing. Some of the ideas:
  1. Use NMF to perform combined visual-textual latent semantic indexing. We were working on doing this using Kernel-LSI, however we had scalability problems. Also, the literature suggests an improved performance of NMF over conventional LSI. NMF could be a very interesting alternative. 
  2. According to [3], NMF not only performs dimensionality reduction but does clustering very well. So, it would be interesting to evaluate NMF as an alternative to do simultaneous visualization and summarization of document collections.
  3. In order to work efficiently, NMF requires a sparse representation (this is the case, for bag-of-features representation for both text and images). However, we have expended a good deal of time working with kernel-based representation. The question is if it is possible to apply NMF when one only has a kernel matrix. [6] suggests that the answer is yes, but we have to check it carefully.
  4. Also, it is interesting to see how to maintain a NMF when the X matrix is changing dynamically (new elements arriving, the object-similarity function changes, etc.). We were thinking about this problem for dynamic visualization of image collections. The solution was not obvious for SVD but it is possible that NMF makes the problem easier to tackle.
Finally, even if you are not interested on the details of NMF, I suggest to take a look to [3] it is a very interesting read. It describes how the winning team was able to deal with the Netflix problem. There are many lessons to learn from it.

[1] D. D. Lee and S. H. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791, October 1999. [Online]. Available: http://dx.doi.org/10.1038/44565
[2] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” Advances in neural information processing systems, pp. 556–562, 2001. [Online]. Available: http://hebb.mit.edu/people/seung/papers/nmfconverge.pdf
[3] W. Xu, X. Liu, and Y. Gong, “Document clustering based on non-negative matrix factorization,” in SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. New York, NY, USA: ACM, 2003, pp. 267–273. [Online]. Available: http://portal.acm.org/citation.cfm?id=860485
[4] G. Takács, I. Pilászy, B. Németh, and D. Tikk, “Scalable collaborative filtering approaches for large recommender systems,” J. Mach. Learn. Res., vol. 10, pp. 623–656, 2009. [Online]. Available: http://portal.acm.org/citation.cfm?id=1577069.1577091
[5] Y. Koren, R. Bell, and C. Volinsky, "Matrix factorization techniques for recommender systems," Computer, vol. 42, no. 8, pp. 30-37, 2009. [Online]. Available: http://dx.doi.org/10.1109/MC.2009.263
[6] D. Zhang, Z.-H. Zhou, and S. Chen, “Non-negative matrix factorization on kernels,” 2006, pp. 404–412. [Online]. Available: http://www.springerlink.com/content/am285016220235t5
[7] C. Ding, X. He, and H. D. Simon, “On the equivalence of nonnegative matrix factorization and spectral clustering,” in Proc. SIAM Data Mining Conf, 2005, pp. 606–610. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/down…
[8] C. Ding, T. Li, and W. Peng, “Nonnegative matrix factorization and probabilistic latent semantic indexing: equivalence, chi-square statistic, and a hybrid method,” in AAAI’06: Proceedings of the 21st national conference on Artificial intelligence. AAAI Press, 2006, pp. 342–347. [Online]. Available: http://portal.acm.org/citation.cfm?id=1597594

3 comments:

  1. Reference number 3 is really illustrative and easy to follow. I wondered why did you write that negative values in the semantic document representation are not a good thing... it's really a more natural design criteria to select non-negative values, as well as non-orthogonal, possibly overlapped directions. This topic is quite interesting.

    ReplyDelete
  2. Actually, [3] discusses about the inconvenience of negative values in the representation: "Because it is more natural to consider each document as an additive rather than subtractive mixture of the underlying topics, the linear combination coefficients should all take non-negative values. Furthermore, it is also quite common that the topics comprising a document corpus are not completely independent of each other, and there are some overlaps among them." Basically, the restriction imposed by NMF seem to be more natural, for the particular problem of document representation, than the restrictions imposed by PCA.

    [1] illustrates very well the point with face images. NMF produces an additive representation of the faces based on localized features, whilst PCA produce a more 'holistic' representation that uses global features.

    ReplyDelete
  3. Interesting!! and you know? Today, I have meet a postdoc in MMV that worked during her Ph.D in face expression detection using NMF. We discussed about her work in a veery fast talk. I found two interesting things there... First, its obvious relationship with PCA and eigen-faces. Second, and more interesting, the way in which the constraints are modeled to solve the minimization problem, leading to better results each time. For instance, she talked about Discriminant NMF and other more complex variations involving more and more equations... We agreed to discuss deeper about the topic later... ;)

    ReplyDelete