Keywords: von Neumann’s trace inequality · Generalized Kristof theorem · Suborthonormal · Semiorthonormal · Singular value decomposition
Kristof’s (1969; 1970) theorem for the global maximum of the trace of matrix products gives simple derivations of the least square solutions for various problems in psychometrics and multivariate analysis. A special but basic case of the theorem for two sets of matrix products yielding a bilinear form was given by von Neumann (1937, Theorem 1) known as his trace inequality, which was introduced in psychometrics by Green Jr (1969, p. 317) based on the comment of Ingram Olkin.
In spite of its great use, von Neumann’s derivation using sophisticated mathematics was not easy for applied researchers to follow. Simplifications or elementary derivations of the theorem has been given by e.g., Kristof (1970) and Mirsky (1975). It is to be noted that these two authors also gave extensions of von Neumann’s trace inequality to those with more than two sets of matrix products in different forms.
While the proof by Kristof of his theorem is based on elementary linear algebra using mostly self-contained materials, the proof is long and involved. This may be one of the reasons for the relatively small frequency of citations as commented by Levin (1979, p. 109) “Kristof’s theorem has not received its due attention in the psychometric literature”, which was also cited by Waller (2018, Introduction), who also stated that “Underutilization of this method likely stems, in part, to the mathematical complexity of Kristof’s (1964; 1970) writings” (Abstract).
One of the purposes of this tutorial is to break down Kristof’s long and involved derivation using independent lemmas to provide a transparent structure of the proof. Note that the lemmas may also be of interest as general results in elementary linear algebra. The second purpose is to introduce a short derivation of von Neumann’s trace inequality obtained by Mirsky (1975) as mentioned earlier, where Fan (1951)’s lemma with a self-contained didactic proof is introduced. Note that von Neumann’s trace inequality and its extensions have wide applications in various fields e.g., applied linear algebra, mathematical physics and the hyperelasticity of isotropic materials as well as psychology as reviewed by Miranda and Thompson (1993), who cited Kristof (1970) among associated references.
The remainder of this article is organized as follows. In Section 2, some lemmas are introduced for Kristof’s theorem followed by a didactic derivation of von Neumann’s trace inequality. Section 3 gives didactic proofs of Kristof’s theorem for the 3-fold or tri-linear case and the general case. In Section 4, Ten Berge’s generalized theorem and modifications of the Kristof and Ten Berge theorems are presented. Some applications of these theorems are shown in Section 5. Section 6 gives discussions. In the appendix, technical details are provided.
In this section, six lemmas and a theorem with a didactic derivation of von Neumann’s trace inequality in line with the later derivation of Kristof’s theorem will be shown. Lemma 1 gives the maximum of the sum of products of two quantities required for von Neumann’s trace inequality, followed by Lemma 1A corresponding to Kristof (1970, Lemma 1), for the similar maximum of the sum of products of more than two quantities for the derivation of Kristof’s theorem. Lemma 2 shows the same ranges of the traces irrespective of their absolute values of associated diagonal elements with permutation, which corresponds to Kristof (1970, Lemma 2).
Lemma 3 is a new independent lemma corresponding to the symmetric condition for a maximized trace in Kristof (1970, (iv) of the proof of Theorem (first version)). Lemma 4 is the second independent lemma for the property that symmetric \(\bf {AD}\) and \(\bf {DA}\) with \(\bf {D}\) being diagonal make \(\bf {A}\) diagonal (Kristof, 1970, (iv) of the proof of Theorem (first version)).
Lemma 5 is the third independent lemma when we have two products of orthonormal and diagonal matrices (a special case of Kristof (1970, (iv) of the proof of Theorem (first version))) for von Neumann’s trace inequality, which was provided to understand the inequality as a special case of Kristof’s theorem. Theorem 1 is for von Neumann’s trace inequality with the derivation similar to the later one for Kristof’s general theorem.
Lemma 1 : The maximum of the sum of products of two quantities (Hardy, Littlewood, and Pólya (1934, 1952, Subsection 10.2); von Neumann (1937, Theorem 1); Simon (2005, Lemma 1.8)). For two sets of m numbers with \({a_1} \ge \cdots \ge {a_m} \ge 0\) and \({b_1} \ge \cdots \ge {b_m} \ge 0\), let \(a_1^*,...,a_m^*\) and \(b_1^*,...b_m^*\) be arbitrary cases in each set of \(m!\) permutations including possibly the same ones. Then, the maximum of \(\sum \nolimits _{i = 1}^m {a_i^*b_i^*} \) over the permutations is given by \(\sum \nolimits _{i = 1}^m {a_ib_i} \).
Proof. Without loss of generality, consider the maximum of \(\sum \nolimits _{i = 1}^m {a_ib_i^*} \). Suppose that \(b_1^* \ne {b_1}\). Then, exchanging \(b_1^*\) and \(b_k^* = {b_1}(k \ne 1)\) in the permutation, \(\sum \nolimits _{i = 1}^m {a_ib_i^*}\) increases if \(b_1^* \ne b_k^*\) and is unchanged if \(b_1^* = b_k^*\) since \(({a_i} - {a_j})({b_i} - {b_j}) \ge 0\) and consequently \({a_i}{b_i} + {a_j}{b_j} - ({a_i}{b_j} + {a_j}{b_i}) \ge 0\,\,(1 \le i \le j \le m)\). Using this possibly exchanged permutation, redefine \(b_1^*,...,b_m^*\). Then, when \(b_2^* \ne {b_2}\), exchange \(b_2^*\) and \(b_k^* = {b_2}(k > 2)\). Repeat this process until the possible exchange of \(b_{m - 1}^*\) and \(b_m^* = {b_{m - 1}}\) when \(b_{m - 1}^* \ne {b_{m - 1}}\). The final permutation gives \(\sum \nolimits _{i = 1}^m {{a_i}b_i^*} = \sum \nolimits _{i = 1}^m {{a_i}{b_i}} \), which is the maximum since no permutation \(b_1^*,...,b_m^*\) using pairwise exchanges after the final one increases \(\sum \nolimits _{i = 1}^m {{a_i}b_i^*}\). \(\sqcap \)\(\sqcup \)
The above proof is a “heuristic” one finding the maximum successively. Waller (2018, Topic III) also used a similar “constructive proof” for the above lemma based on the proof of Simon (2005, Lemma 1.8., p. 4), which is elementary though of interest. However, the above heuristic proof seems to be simpler than Waller’s didactic one. Note that “heuristic” is synonymous with “constructive” in this case.
Lemma 1A: The maximum of the sum of products with arbitrary number of factors Kristof (1970, Lemma 1). For n sets of m numbers with \(a_1^{(j)} \ge \cdots \ge a_m^{(j)} \ge 0\) \((j = 1,...,n;\,\,n \ge 2)\), let \(a_1^{(j)*},...,a_m^{(j)*}\) be an arbitrary case in the j-th set of \(m!\) permutations including possibly the same ones. Then, the maximum of \(\sum \nolimits _{i = 1}^m {a_i^{(1)*} \cdots } \,a_i^{(n)*}\) over the permutations is given by \(\sum \nolimits _{i = 1}^m {a_i^{(1)} \cdots } \,a_i^{(n)}\).
Proof. Consider the case of \(n=3\). For two sets \(a_1^{(j)*},...,a_m^{(j)*}\,\,(j = 1,2)\) in the three sets, Lemma 1 gives the maximum of the sum of the products as \(\sum \nolimits _{i = 1}^m {a_i^{(1)}} \,a_i^{(2)}\). Then, for the two sets of \(m\) products \(a_1^{(1)}a_1^{(2)},...,a_m^{(1)}a_m^{(2)}\) and \(m\) numbers \(a_1^{(3)*},...,a_m^{(3)*}\), the maximum of \(\sum \nolimits _{i = 1}^m {a_i^{(1)}} \,a_i^{(2)}a_i^{(3)*}\) over the \(m!\) permutation in the third set is similarly obtained by \(\sum \nolimits _{i = 1}^m {a_i^{(1)}} \,a_i^{(2)}a_i^{(3)}\). Since any permutation for the maximized one including the first two sets decreases the product sum or remains unchanged as seen in Lemma 1, \(\sum \nolimits _{i = 1}^m {a_i^{(1)}} \,a_i^{(2)}a_i^{(3)}\) is the global maximum. The cases with \(n \ge 4\) is similarly obtained. \(\sqcap \)\(\sqcup \)
Lemma 2 : The same ranges of the traces irrespective of their absolute values of the diagonal elements with permutation (Kristof, 1970, Lemma 2). Let \({\bf {\Gamma }}_1^*\) and \({\bf {\Gamma }}_2^*\) be \(m \times m\) diagonal matrices; and \({\bf {\Gamma }}_1\) and \({\bf {\Gamma }}_2\) be those with the corresponding diagonal elements replaced by their absolute values located in the weakly descending (non-increasing) orders, respectively. Suppose that \({\bf {X}}_1\) and \({\bf {X}}_2\) independently vary over all the \(m \times m\) orthonormal matrices. Then, \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*)\) has the same range as that of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})\).
Proof. Kristof’s derivation is didactically repeated. Note that \({\bf {\Gamma }}_i\) is obtained by \({{\bf {\Gamma }}_i} = {{\bf {P}}_i}{{\bf {S}}_i}{\bf {\Gamma }}_i^*{\bf {P}}_i^{\rm {T}}\), where \({\bf {S}}_i\) is the signed identity matrix replacing the diagonal elements of \({\bf {\Gamma }}_i^*\) by their corresponding absolute values, and \({\bf {P}}_i\) is the permutation matrix to have the weakly descending order mentioned earlier \((i = 1,2)\). Noting that \({\bf {\Gamma }}_i^* = {{\bf {S}}_i}{\bf {P}}_i^{\rm {T}}{{\bf {\Gamma }}_i}{{\bf {P}}_i}\), we have \[\begin {array}{l} {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*) = {\rm {tr}}\{ {{\bf {X}}_1}({{\bf {S}}_1}{\bf {P}}_1^{\rm {T}}{{\bf {\Gamma }}_1}{{\bf {P}}_1}){{\bf {X}}_2}({{\bf {S}}_2}{\bf {P}}_2^{\rm {T}}{{\bf {\Gamma }}_2}{{\bf {P}}_2})\} \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = {\rm {tr}}\{ ({{\bf {P}}_2}{{\bf {X}}_1}{{\bf {S}}_1}{\bf {P}}_1^{\rm {T}}){{\bf {\Gamma }}_1}({{\bf {P}}_1}{{\bf {X}}_2}{{\bf {S}}_2}{\bf {P}}_2^{\rm {T}}){{\bf {\Gamma }}_2})\} . \end {array}\] In the last result, \({{\bf {P}}_2}{{\bf {X}}_1}{{\bf {S}}_1}{\bf {P}}_1^{\rm {T}}\) and \({{\bf {P}}_1}{{\bf {X}}_2}{{\bf {S}}_2}{\bf {P}}_2^{\rm {T}}\) are products of orthonormal matrices and consequently orthonormal with the same variations of \({\bf {X}}_1\) and \({\bf {X}}_2\), which shows the required same ranges of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*)\) and \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})\). \(\sqcap \)\(\sqcup \)
The following lemma gives a derivation of Kristof’s theorem shown later when \(n = 1\), which will also be used in the cases when \(n = 2, 3,...\) in the derivation of the theorem by induction.
Lemma 3 A symmetric condition for a maximized trace (Kristof, 1970, (iv) of the proof of Theorem (first version)). Let \(\bf {G}\) be a square matrix of full rank whose singular value decomposition (SVD) is \({\bf {G}} = {\bf {U\Lambda }}{{\bf {V}}^{\rm {T}}}\), where \(\bf {U}\) and \(\bf {V}\) are orthonormal, and \(\bf {\Lambda }\) is a diagonal matrix with positive diagonal elements in a prescribed order. When \({\rm {tr}}({\bf {G}})\) is maximized with a given \(\bf {\Lambda }\), \(\bf {G}\) becomes symmetric.
Proof. Since \({\rm {tr}}({\bf {G}}) = {\rm {tr}}({\bf {U\Lambda }}{{\bf {V}}^{\rm {T}}}) = {\rm {tr}}({{\bf {V}}^{\rm {T}}}{\bf {U\Lambda }})\) with \({{\bf {V}}^{\rm {T}}}{\bf {U}}\) being orthonormal, \({\rm {tr}}({\bf {G}})\) is maximized when \({{\bf {V}}^{\rm {T}}}{\bf {U}}\) is an identity matrix, which indicates that \({\bf {U}} = {\bf {V}}\) and consequently \({\bf {G}} = {\bf {U\Lambda }}{{\bf {U}}^{\rm {T}}}\) is symmetric. \(\sqcap \)\(\sqcup \)
In the above proof, the identity matrix maximizing the trace is didactically obtained in two ways as follows. (i) Note that \({\rm {tr}}({{\bf {V}}^{\rm {T}}}{\bf {U\Lambda }}) = \sum \nolimits _{i = 1}^m {{\bf {v}}_i^{\rm {T}}{{\bf {u}}_i}} {\lambda _i}\), where \({\bf {v}}_i\) and \({\bf {u}}_i\) are the \(i\)-th columns of \(\bf {V}\) and \(\bf {U}\), respectively with \({\bf {v}}_i^{\rm {T}}{{\bf {v}}_i} = {\bf {u}}_i^{\rm {T}}{{\bf {u}}_i} = 1\,\,(i = 1,...,m)\) due to the orthonormality of \(\bf {V}\) and \(\bf {U}\); and \({\lambda _i} > 0\,\,(i = 1,...,m)\) by assumption. Using \({\bf {v}}_i^{\rm {T}}{{\bf {u}}_i} = {\bf {v}}_i^{\rm {T}}{{\bf {u}}_i}/\{ {({\bf {v}}_i^{\rm {T}}{{\bf {v}}_i})^{1/2}}{({\bf {u}}_i^{\rm {T}}{{\bf {u}}_i})^{1/2}}\} \le 1\) \((i = 1,...,m)\) by the Cauchy-Schwarz (C-S) inequality, we have \({\rm {tr}}({{\bf {V}}^{\rm {T}}}{\bf {U\Lambda }}) = \sum \nolimits _{i = 1}^m {{\bf {v}}_i^{\rm {T}}{{\bf {u}}_i}} {\lambda _i}\) \( \le \sum \nolimits _{i = 1}^m {{\lambda _i}} \). Since the maximum in the C-S inequality is obtained when \({{\bf {v}}_i} = {{\bf {u}}_i}\,(i = 1,...,m)\), we have \({{\bf {V}}^{\rm {T}}}{\bf {U}} = {{\bf {V}}^{\rm {T}}}{\bf {V}} = {{\bf {U}}^{\rm {T}}}{\bf {U}} = {{\bf {I}}_m}\) (the \(m \times m\) identity matrix) for maximizing \({\rm {tr}}({{\bf {V}}^{\rm {T}}}{\bf {U\Lambda }})\).
(ii) The second derivation is given without using the C-S inequality. Since \({\bf {W}} = \{ {w_{ij}}\} \equiv {{\bf {V}}^{\rm {T}}}{\bf {U}}\) is orthonormal as seen from \({({{\bf {V}}^{\rm {T}}}{\bf {U}})^{\rm {T}}}({{\bf {V}}^{\rm {T}}}{\bf {U}}) = {{\bf {U}}^{\rm {T}}}{\bf {V}}{{\bf {V}}^{\rm {T}}}{\bf {U}} = {{\bf {U}}^{\rm {T}}}{\bf {U}} = {{\bf {I}}_m}\), we have \({w_{ij}} \le 1\,\,(i,j = 1,...,m)\). Consequently, \({\rm {tr}}({{\bf {V}}^{\rm {T}}}{\bf {U\Lambda }}) = \sum \nolimits _{i = 1}^m {{w_{ii}}{\lambda _i}} \le \sum \nolimits _{i = 1}^m {{\lambda _i}} = {\rm {tr}}({\bf {\Lambda }})\), where the maximum is attained when \({w_{ii}} = 1\,\,(i = 1,...,m)\), which is the case of \({\bf {W}} = {{\bf {V}}^{\rm {T}}}{\bf {U}} = {{\bf {I}}_m}\).
(iii) It is of interest to find that (ii) gives the C-S inequality in (i) via Kristof’s theorem, which will be addressed later as an application of Kristof’s theorem.
Lemma 4 Symmetric \(\bf {AD}\) and \(\bf {DA}\) with diagonal \(\bf {D}\) make \(\bf {A}\) diagonal (Kristof, 1970, (iv) of the proof of Theorem (first version)). Let \({\bf {A}} = \{ {a_{ij}}\} \) and \(\bf {D}\) be \(m \times m\) matrices with \(\bf {D}\) being diagonal. Suppose that the diagonal elements \({d_i}(i = 1,...,m)\) of \(\bf {D}\) are nonzero and \(|{d_i}|\)’s are mutually different. Suppose further that \(\bf {AD}\) and \(\bf {DA}\) are both symmetric. Then, \(\bf {A}\) is diagonal.
Proof. By assumption, \({\bf {AD}} = {({\bf {AD}})^{\rm {T}}} = {\bf {D}}{{\bf {A}}^{\rm {T}}}\) and \({\bf {DA}} = {({\bf {DA}})^{\rm {T}}} = {{\bf {A}}^{\rm {T}}}{\bf {D}}\). From the last equation, we have \({\bf {DA}}{{\bf {D}}^{ - 1}} = {{\bf {A}}^{\rm {T}}}\). Using \({{\bf {A}}^{\rm {T}}} = {\bf {DA}}{{\bf {D}}^{ - 1}}\) on the right hand-side of the first equation \({\bf {AD}} = {\bf {D}}{{\bf {A}}^{\rm {T}}}\) and post-multiplying \({\bf {D}}^{ - 1}\) on both sides of the equation, we obtain \({\bf {A}} = {{\bf {D}}^2}{\bf {A}}{{\bf {D}}^{ - 2}}\), which indicates that \[{a_{ij}} = {a_{ij}}d_i^2/d_j^2 (i,j = 1,...,m).\] Since \(d_i^2/d_j^2 \ne 1\) when \(i \ne j\) by assumption, \({a_{ij}} = 0\,\,\,(i \ne j)\) follow. \(\sqcap \)\(\sqcup \)
When \(\bf {A}\) is diagonal, we have \({\bf {AD}} = {\bf {DA}}\), where \(\bf {A}\) and \(\bf {D}\) are said to commute. Using this formulation, a lemma equivalent to Lemma 4 was stated by Kiers and Ten Berge (1989, Lemma 1).
Lemma 5 : Two products of orthonormal and diagonal matrices (a special case of Kristof (1970, (iv) of the proof of Theorem (first version))). Let \({{\bf {X}}_i},\,\,{{\bf {D}}_i}\) and \({\bf {\Delta }}_i\) be \(m \times m\) matrices, where \({\bf {X}}_i\) is orthonormal while \({\bf {D}}_i\) and \({\bf {\Delta }}_i\) are diagonal and of full rank \((i = 1,2)\). Suppose that \[{{\bf {X}}_1}{{\bf {D}}_1}{{\bf {X}}_2} = {{\bf {\Delta }}_1} \ \ and \ \ {{\bf {X}}_2}{{\bf {D}}_2}{{\bf {X}}_1} = {{\bf {\Delta }}_2}.\] Then, \({\bf {X}}_1\) and \({\bf {X}}_2\) are the \(m \times m\) signed and/or permuted identity matrices with m nonzero elements being \(\pm 1\), where permutation indicates row- or column-wise one. When without permutation, \({\bf {X}}_1\) and \({\bf {X}}_2\) are diagonal matrices with their diagonal elements being \(\pm 1\), and \({{\bf {\Delta }}_1}{{\bf {D}}_2} = {{\bf {\Delta }}_2}{{\bf {D}}_1}\).
Proof. Left-multiplying \({{\bf {X}}_1}{{\bf {D}}_1}\) on both sides of \({{\bf {X}}_2}{{\bf {D}}_2}{{\bf {X}}_1} = {{\bf {\Delta }}_2}\) using \({{\bf {X}}_1}{{\bf {D}}_1}{{\bf {X}}_2} = {{\bf {\Delta }}_1}\), we have \({{\bf {\Delta }}_1}{{\bf {D}}_2}{{\bf {X}}_1} = {{\bf {X}}_1}{{\bf {D}}_1}{{\bf {\Delta }}_2}\) giving \({{\bf {\Delta }}_1}{{\bf {D}}_2} = {{\bf {X}}_1}{{\bf {D}}_1}{{\bf {\Delta }}_2}{\bf {X}}_1^{\rm {T}}\). The last result shows the spectral decomposition of the diagonal matrix \({{\bf {\Delta }}_1}{{\bf {D}}_2}\), which indicates that the orthonormal matrix \({\bf {X}}_1\) becomes a signed and/or permuted identity matrix and when without permutation \({{\bf {\Delta }}_1}{{\bf {D}}_2} = {{\bf {\Delta }}_2}{{\bf {D}}_1}\). For \({\bf {X}}_2\), exchanging the subscripts “1” and “2” due to symmetry, we obtain \({{\bf {\Delta }}_2}{{\bf {D}}_1} = {{\bf {X}}_2}{{\bf {D}}_2}{{\bf {\Delta }}_1}{\bf {X}}_2^{\rm {T}}\) indicating the same results as for \({\bf {X}}_1\). \(\sqcap \)\(\sqcup \)
Theorem 1 : Two-fold or bilinear case (\(n\) = 2) (von Neumann (1937, Theorem 1); Kristof (1970, Theorem (first version))). Let \({\bf {\Gamma }}_1^*\) and \({\bf {\Gamma }}_2^*\) be fixed diagonal matrices of full rank with the absolute values of the diagonal elements being mutually different in each matrix. Consider the maximum and minimum of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*)\) which are attained, where \({\bf {X}}_1\) and \({\bf {X}}_2\) independently vary over all \(m \times m\) orthonormal matrices. Then, \[ - {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*) \le {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}),\] where \({{\bf {\Gamma }}_i} = {\rm {diag}}({\gamma _{i1}},...,{\gamma _{im}})\) is given by \({\bf {\Gamma }}_i^*\) with their diagonal elements replaced by the corresponding absolute values with possible permutation to have the descending order i.e., \({\gamma _{i1}} > \cdots > {\gamma _{im}} > 0\) \((i = 1,2)\).
Proof. By Lemma 2, since the range of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})( = {\rm {tr}}({{\bf {\Gamma }}_2}{{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}))\) is the same as that of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*)\), we consider the former range. Due to Lemma 3, when the former trace is maximized, \({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}\) and similarly \({{\bf {\Gamma }}_2}{{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}\) become symmetric. Then, using Lemma 4 we find that \({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}\) is diagonal. Due to symmetry with \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})\) \( = {\rm {tr}}({{\bf {\Gamma }}_2}{{\bf {X}}_2}{{\bf {\Gamma }}_1}{{\bf {X}}_1})\), \({{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_1}\) is also found to be diagonal. By Lemma 5, these two diagonal conditions give the maximum \({\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2})\) of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})\) when \({\bf {X}}_1\) and \({\bf {X}}_2\) are the same signed identity matrices. The global maximum \({\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2})\) among the permuted diagonal elements of \({\bf {\Gamma }}_1\) and \({\bf {\Gamma }}_2\) is shown by Lemma 1. The minimum is given by replacing e.g., \({\bf {X}}_1\) by \( - {{\bf {X}}_1}\). \(\sqcap \)\(\sqcup \)
Remark 1 The second simple proof of Theorem 1. (Mirsky, 1975) using an associated property of the doubly stochastic matrix obtained by Fan (1951, Lemma 1A) will be shown with Fan’s lemma in the appendix. Note that a doubly stochastic matrix is a square one, where the sum of each row and that of each column are unities. An example is the matrix consisting of the squared elements of an orthonormal matrix. Mirsky’s proof has been known in the mathematical community as a short and simple derivation of von Neumann’s trace inequality.
Remark 1A In his tutorial, Waller (2018, Equation (21)) explained the result of Theorem 1 using the symmetric condition as given in Lemma 3 with the SVD \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}) = {\rm {tr}}({\bf {P\Delta }}{{\bf {Q}}^{\rm {T}}}) = {\rm {tr}}({\bf {P}}{{\bf {Q}}^{\rm {T}}}{\bf {\Delta }})\), whose optima are attained when \({\bf {P}} = {\bf {Q}}\) or \({\bf {P}} = - {\bf {Q}}\) as \( - {\rm {tr}}({\bf {\Delta }}) \le {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}) \le {\rm {tr}}({\bf {\Delta }})\). This result is correct. Then, Waller (2018, Equation (22)) gave inequalities \( - {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}) \le {\rm {tr}}({\bf {\Delta }}) \le {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2})\) using our notation followed by the statement “the bounds by Kristof’s theorem can be achieved”. These are also correct. However, the most important result in Theorem 1 is \({\rm {tr}}({\bf {\Delta }}) = {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2})\), whose proof has been shown by using Lemma 5 as well as the second one in the appendix.
In this section, an independent lemma in linear algebra is provided, which is an extension corresponding to the result in the proof of Kristof (1970). The tri-linear case is given as Theorem 3 for didactic purposes, followed by a short proof of Kristof’s general theorem using several lemmas.
Lemma 6 : Two products of square, diagonal and orthonormal matrices (an extension of Kristof (1970, (iv) of the proof of Theorem (first version))). Let \({\bf {A}},{\bf {X}},\,\,{{\bf {D}}_i}\) and \({\bf {\Delta }}_i\) be \(m \times m\) matrices of full rank, where \(\bf {X}\) is orthonormal while \({\bf {D}}_i\) and \({\bf {\Delta }}_i\) are diagonal \((i = 1,2)\). Suppose that \[{\bf {A}}{{\bf {D}}_1}{\bf {X}} = {{\bf {\Delta }}_1} \ \ and \ \ {\bf {X}}{{\bf {D}}_2}{\bf {A}} = {{\bf {\Delta }}_2}.\] Then, \(\bf {X}\) is the \(m \times m\) signed and/or permuted identity matrices with m nonzero elements being \(\pm 1\), where permutation indicates row- or column-wise one. When without permutation, \(\bf {X}\) is diagonal with its diagonal elements being \(\pm 1\), and \({{\bf {\Delta }}_1}{{\bf {D}}_2} = {{\bf {\Delta }}_2}{{\bf {D}}_1}\).
Proof. Right-multiplying \({{\bf {D}}_1}{\bf {X}}\) on both sides of \({\bf {X}}{{\bf {D}}_2}{\bf {A}} = {{\bf {\Delta }}_2}\) using \({\bf {A}}{{\bf {D}}_1}{\bf {X}} = {{\bf {\Delta }}_1}\), we have \({\bf {X}}{{\bf {D}}_2}{{\bf {\Delta }}_1} = {{\bf {\Delta }}_2}{{\bf {D}}_1}{\bf {X}}\) giving \({{\bf {\Delta }}_2}{{\bf {D}}_1} = {\bf {X}}{{\bf {\Delta }}_1}{{\bf {D}}_2}{{\bf {X}}^{\rm {T}}}\). The last result shows the spectral decomposition of the diagonal matrix \({{\bf {\Delta }}_2}{{\bf {D}}_1}\), which indicates that the orthonormal matrix \(\bf {X}\) becomes a signed and/or permuted identity matrix and when without permutation \({{\bf {\Delta }}_1}{{\bf {D}}_2} = {{\bf {\Delta }}_2}{{\bf {D}}_1}\). \(\sqcap \)\(\sqcup \)
Remark 2 Lemma 5 is seen as a special case of Lemma 6 when \({\bf {A}} = {{\bf {X}}_1}\) an orthonormal matrix and \(\bf {X}\) is denoted by \({\bf {X}}_2\). However, in Lemma 5, both \({\bf {X}}_1\) and \({\bf {X}}_2\) were found to be signed and/or permutated identity matrices. Note also that Kristof (1970) dealt with the case when \({\bf {A}} = {{\bf {G}}_1}{{\bf {\Gamma }}_1}{{\bf {G}}_2}{{\bf {\Gamma }}_2} \cdots {{\bf {G}}_{n - 1}}{{\bf {\Gamma }}_{n - 1}}{{\bf {G}}_n}\), \({{\bf {D}}_1} = {{\bf {\Gamma }}_n}\), \({{\bf {D}}_2} = {{\bf {\Gamma }}_{n + 1}}\) and \({\bf {X}} = {{\bf {G}}_{n + 1}}\), where \({\bf {\Gamma }}_i\) and \({\bf {G}}_i\) are diagonal and orthonormal matrices, respectively. This specification was necessary for his derivation by induction though the involved expression \({{\bf {G}}_1}{{\bf {\Gamma }}_1}{{\bf {G}}_2}{{\bf {\Gamma }}_2} \cdots {{\bf {G}}_{n - 1}}{{\bf {\Gamma }}_{n - 1}}{{\bf {G}}_n}\) may hide the basic structure in Lemma 6.
Theorem 2 : Three-fold or trilinear case (\(n\) = 3) (Kristof, 1970, Theorem (first version)). Let \({\bf {\Gamma }}_i^*\,(i = 1,2,3)\) be fixed diagonal matrices of full rank with the absolute values of the diagonal elements being mutually different in each matrix. Consider the maximum and minimum of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*{{\bf {X}}_3}{\bf {\Gamma }}_3^*)\) which are attained, where \({{\bf {X}}_i}(i = 1,2,3)\) independently vary over all \(m \times m\) orthonormal matrices. Then, \[ - {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*{{\bf {X}}_3}{\bf {\Gamma }}_3^*) \le {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3}),\] where \({{\bf {\Gamma }}_i} = {\rm {diag}}({\gamma _{i1}},...,{\gamma _{im}})\) is given by \({\bf {\Gamma }}_i^*\) with their diagonal elements replaced by the corresponding absolute values with possible permutation to have the descending order i.e., \({\gamma _{i1}} > \cdots > {\gamma _{im}} > 0\) \((i = 1,2,3)\).
Proof. As in the proof for Theorem 1, using Lemma 2 in a similar manner with \({{\bf {\Gamma }}_i} = {{\bf {P}}_i}{{\bf {S}}_i}{\bf {\Gamma }}_i^*{\bf {P}}_i^{\rm {T}}\) and unconstrained orthonormal \({{\bf {X}}_i}(i = 1,\,\,2,\,\,3)\), the range of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3})\) \(( = {\rm {tr}}({{\bf {\Gamma }}_3}{{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}))\) is found to be the same as that of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^*{{\bf {X}}_3}{\bf {\Gamma }}_3^*)\). Due to Lemma 3, when the former trace is maximized, \({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3} \equiv {\bf {B}}{{\bf {\Gamma }}_3}\) and similarly \({{\bf {\Gamma }}_3}{{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3} = {{\bf {\Gamma }}_3}{\bf {B}}\) become symmetric. Then, using Lemma 4 we find that \({\bf {B}} = {{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3} \equiv {\bf {A}}{{\bf {\Gamma }}_2}{{\bf {X}}_3}\) is diagonal. Similarly, since \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3}) = {\rm {tr}}({\bf {A}}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3}) = {\rm {tr}}({{\bf {X}}_3}{{\bf {\Gamma }}_3}{\bf {A}}{{\bf {\Gamma }}_2})\), \({{\bf {X}}_3}{{\bf {\Gamma }}_3}{\bf {A}}\) is also diagonal. From Lemma 6, these two diagonal conditions can make \({\bf {X}}_3\) an identity matrix. Then, using Theorem 1, the maximum \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3}) = {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3})\) is obtained when \({\bf {X}}_1\) and \({\bf {X}}_2\) are identity matrices as \({\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3})\). The minimum \( - {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3})\) is obtained as in Theorem 1. \(\sqcap \)\(\sqcup \)
Remark 2A In the proof of Theorem 2, the key result is \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {X}}_3}{{\bf {\Gamma }}_3})\) \( = {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3})\). Since \({{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3} \equiv {{\bf {\Gamma }}_{2*3}}\) is diagonal, the maximum of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2}{{\bf {\Gamma }}_3})\) \( = {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_{2*3}})\) is obtained by Theorem 1 for the bilinear case. This suggests a heuristic proof for the general case, which is employed in the following result.
Theorem 3 : The \(n\)-fold case (\(n\) = 2, 3,...) (Kristof, 1970, Theorem (first version)). Let \({\bf {\Gamma }}_i^*\,(i = 1,...,n)\) be fixed diagonal matrices. Consider the maximum and minimum of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*)\) which are attained, where \({{\bf {X}}_i}(i = 1,...,n)\) independently vary over all orthonormal matrices. Then, \[ - {\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_n}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) \le {\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_n}),\] where \({{\bf {\Gamma }}_i} = {\rm {diag}}({\gamma _{i1}},...,{\gamma _{im}})\) is given by \({\bf {\Gamma }}_i^*\) with their diagonal elements replaced by the corresponding absolute values with possible permutation to have the weakly descending order i.e., \({\gamma _{i1}} \ge \cdots \ge {\gamma _{im}} \ge 0\) \((i = 1,...,n)\).
Proof. As in Kristof (1970), first suppose that \({\gamma _{i1}} > \cdots > {\gamma _{im}} > 0\) \((i = 1,...,n)\). Using the result of Theorem 2, increase \(n\) one by one as \(n\) = 4, 5,... When \(n\) = 4, redefine \({\bf {B}} \equiv {{\bf {X}}_1}{{\bf {\Gamma }}_1} \cdots {{\bf {X}}_3}{{\bf {\Gamma }}_3}{{\bf {X}}_4} \equiv {\bf {A}}{{\bf {\Gamma }}_3}{{\bf {X}}_4}\). Then, from Lemma 4, \({\bf {B}} = {\bf {A}}{{\bf {\Gamma }}_3}{{\bf {X}}_4}\) becomes diagonal. Similarly, \({{\bf {X}}_4}{{\bf {\Gamma }}_4}{\bf {A}}\) is also diagonal. Then, as before \({\bf {X}}_4\) can become an identity matrix. Using Theorem 2, the maximum of \({\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1} \cdots {{\bf {X}}_3}{{\bf {\Gamma }}_3}{{\bf {X}}_4}{{\bf {\Gamma }}_4}) = {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1} \cdots {{\bf {X}}_3}{{\bf {\Gamma }}_3}{{\bf {\Gamma }}_4})\) is given by \({\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_4})\). The minimum is similarly obtained as \( - {\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_4})\). Increasing \(n\) successively one by one, we obtain the required results.
Further, consider the weakly ordered case \({\gamma _{i1}} \ge \cdots \ge {\gamma _{im}} \ge 0\) \((i = 1,...,n)\). As in Kristof (1970, (v) of the proof of Theorem (first version) based on the suggestion by Bary G. Wingersky), let \({\bf {W}} = \,{\rm {diag}}({w_1},...,{w_m})\) with \({w_1} > \cdots > {w_m} > 0\). Redefine \({\bf {\Gamma }}_i\) as \({{\bf {\Gamma }}_i} + \varepsilon {\bf {W}}(i = 1,...,n)\) with \(\varepsilon \, > 0\). Then, we have the same above result since \({\gamma _{i1}} > \cdots > {\gamma _{im}} > 0\). When \(\varepsilon \) approaches zero with fixed \({{\bf {X}}_i}(i = 1,...n)\), the same required result is given by substituting \(\varepsilon = 0\) for \({{\bf {\Gamma }}_i} + \varepsilon {\bf {W}}\) under the limiting condition of \({\gamma _{i1}} \ge \cdots \ge {\gamma _{im}} \ge 0\) \((i = 1,...,n)\). \(\sqcap \)\(\sqcup \)
Remark 3 The heuristic derivation of Theorem 3 is essentially equal to that by induction, where the latter was employed by Kristof (1970). The method of successively finding maxima was shown for didactic purposes as well as a direct derivation in Theorem 2 when \(n\) = 2. Since Theorems 1 and 2 are special cases of Theorem 3, the former results also hold under \({\gamma _{i1}} \ge \cdots \ge {\gamma _{im}} \ge 0\) . Though when \({\gamma _{i1}} \ge \cdots \ge {\gamma _{im}} = 0\), which is given in the limiting case of \(\varepsilon = 0\), \({\bf {\Gamma }}_i\) becomes singular while \({{\bf {\Gamma }}_i} + \varepsilon {\bf {W}}\) with \(\varepsilon > 0\) is non-singular, this rank difference does not affect the maximum attained.
Ten Berge (1983) gave a generalized version of Kristof’s theorem when \({\bf {X}}_i\)’s with \({\rm {rank}}({{\bf {X}}_i}) \equiv r_i^* \le {r_i}\) are \({m_{\,i - 1}} \times {m_{\,i}}\) possibly non-square suborthonormal matrices i.e., submatrices of orthonormal ones \((i = 1,...n;\,\,{m_0} \equiv {m_n})\). Let a semiorthonormal matrix be a non-square submatrix of its parent orthonormal one with the same number of the rows or columns (not both) as that of the parent. Note that if \({\bf {X}}_i\) is suborthonormal rather than semi- or fully orthonormal, \(r_i^*\) can be 0. For this result, we consider the restricted parent orthonormal matrix taking a block-diagonal form \({\bf {X}}_i^* = \left ( \begin {array}{l} {{\bf {X}}_{i11}}\,\,\,{{\bf {X}}_{i12}}\\ {{\bf {X}}_{i21}}\,\,\,{{\bf {X}}_{i22}} \end {array} \right ) = \left ( \begin {array}{l} {{\bf {X}}_{i11}}\,\,\,{\bf {O}}\\ {\bf {O}}\,\,\,\,\,\,\,{{\bf {X}}_{i22}} \end {array} \right )\), where \({\bf {X}}_i^{*{\rm {T}}}{\bf {X}}_i^* = {\bf {X}}_i^*{\bf {X}}_i^{*{\rm {T}}} = {{\bf {I}}_{{m_{i - 1}} + {m_i}}}\), \({\bf {X}}_{i11}^{\rm {T}}{{\bf {X}}_{i11}} = {{\bf {X}}_{i11}}{\bf {X}}_{i11}^{\rm {T}} = {{\bf {I}}_{{m_{i - 1}}}}\) and \({\bf {X}}_{i22}^{\rm {T}}{{\bf {X}}_{i22}} = {{\bf {X}}_{i22}}{\bf {X}}_{i22}^{\rm {T}} = {{\bf {I}}_{{m_i}}}\). Then, when \({\bf {X}}_i\) is one of the two off-block-diagonal zero submatrices \({{\bf {X}}_{i12}} = {\bf {X}}_{21}^{\rm {T}} = {\bf {O}}\), the rank of \({\bf {X}}_i\) is zero i.e., \(r_i^* = 0\). It is assumed that \({\bf {X}}_i^*\) varies unrestrictedly under the block-diagonal form. Though Ten Berge did not fully explain the cases with \(r_i^* < {r_i}\), \(r_i\) is seen as an upper bound of \(r_i^*\), which is given by \({r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \). Note that \({r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \) is the smallest upper bound when \({\bf {X}}_i\) varies unrestrictedly. In other words, when \({r_i} < \min \{ {m_{\,i - 1}},{m_{\,i}}\} \), \({\bf {X}}_i\) does not vary unrestrictedly. If \({\bf {X}}_i\) is semi- or fully orthonormal, we have \(r_i^* = {r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \).
Let \({\bf {\Gamma }}_i^*\) and \({\bf {\Gamma }}_i\) be \({m_{\,i}} \times {m_{\,i}}\) fixed diagonal matrices defined as in Theorem 3 with possible different \(m_{\,i}\)’s \((i = 1,...n)\). Let \(r = \min ({r_1},...,{r_n})\) and \(m = \max ({m_1},...,{m_n})\). Define \({\bf {\Delta }}_i\) and \({\bf {E}}_{\,r}\) as the \(m \times m\) diagonal matrices containing \({\bf {\Gamma }}_i\) and \({\bf {I}}_r\) in their upper left corners with zeros elsewhere, respectively. Then, Ten Berge gave the following result.
Theorem 4 : The generalized Kristof theorem (Ten Berge (1983, Theorem 1); see also Kiers and Ten Berge (1989); and Ten Berge (1993, Sections 3.2 and 3.3)). Under the definitions and assumptions given above, when \({\bf {X}}_i\) varies with the condition \({\rm {rank}}({{\bf {X}}_i}) \le {r_i}\) \((i = 1,...,n)\), we have \[ - {\rm {tr}}({{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}{{\bf {E}}_{\,r}}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) \le {\rm {tr}}({{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}{{\bf {E}}_{\,r}}).\]
For the proof of Theorem 4, Ten Berge (1983) defined \({\bf {Y}}_i\) as the \(m \times m\) matrix containing \({\bf {X}}_i\) in its upper left corner with zeroes elsewhere. Then, he defined its SVD as \({{\bf {Y}}_i} = {{\bf {P}}_i}{{\bf {D}}_i}{\bf {Q}}_i^{\rm {T}}\), where “\({\bf {P}}_i\) and \({\bf {Q}}_i\) are orthonormal and \({\bf {D}}_i\) is diagonal” (loc.cit., p. 521). Note that he employed the SVD using the non-negative singular values rather than the positive ones, where \({\bf {P}}_i\), \({\bf {Q}}_i\) and \({\bf {D}}_i\) are \(m \times m\) square matrices. This is seen in his inequalities (loc.cit., Equation (10)) \[ - {\rm {tr}}({{\bf {D}}_1}{{\bf {\Delta }}_1} \cdots {{\bf {D}}_n}{{\bf {\Delta }}_n}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) \le {\rm {tr}}({{\bf {D}}_1}{{\bf {\Delta }}_1} \cdots {{\bf {D}}_n}{{\bf {\Delta }}_n}).\]
For this derivation, he used \({{\bf {Y}}_i} = {{\bf {P}}_i}{{\bf {D}}_i}{\bf {Q}}_i^{\rm {T}}\) and \({\bf {\Delta }}_i^*\) defined similarly to \({\bf {\Delta }}_i\) using \({\bf {\Gamma }}_i^*\) \((i = 1,...,n)\), which gives \[{\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) = {\rm {tr}}({{\bf {Y}}_1}{\bf {\Delta }}_1^* \cdots {{\bf {Y}}_n}{\bf {\Delta }}_n^*) = {\rm {tr}}({{\bf {P}}_1}{{\bf {D}}_1}{\bf {Q}}_1^{\rm {T}}{\bf {\Delta }}_1^* \cdots {{\bf {P}}_n}{{\bf {D}}_n}{\bf {Q}}_n^{\rm {T}}{\bf {\Delta }}_n^*)\] (loc.cit., Equation (9)). He applied Kristof’s theorem to this result supposing that all the \(2n\) \(m \times m\) matrices \({\bf {P}}_i\) and \({{\bf {Q}}_i}\,(i = 1,...,n)\) vary over all the orthonormal matrices, which is required in Kristof’s theorem. Then, he showed his Equation (1) shown earlier.
However, it is found that when \({{\bf {D}}_i} = {\rm {diag}}({d_1},...,{d_m})\) with \({d_1} \ge \cdots \ge {d_{r_i^*}} > 0\) and \({d_{r_i^* + 1}} = \cdots = {d_m} = 0\), orthonormal matrices \({\bf {P}}_i\) and \({\bf {Q}}_i\) in \({{\bf {Y}}_i} = {{\bf {P}}_i}{{\bf {D}}_i}{\bf {Q}}_i^{\rm {T}}\) should be of the form \[{{\bf {P}}_i} = \left ( \begin {array}{l} {{\bf {P}}_{i1}}\,\,\,{\bf {O}}\\ {\bf {O}}\,\,\,\,{{\bf {P}}_{i2}} \end {array} \right ) \ \ \rm {and} \ \ {{\bf {Q}}_i} = \left ( \begin {array}{l} {{\bf {Q}}_{i1}}\,\,\,{\bf {O}}\\ {\bf {O}}\,\,\,\,{{\bf {Q}}_{i2}} \end {array} \right )\] where \({\bf {P}}_{i1}\) and \({\bf {Q}}_{i1}\) are \(r_i^* \times r_i^*\) orthonormal submatrices while \({\bf {P}}_{i2}\) and \({\bf {Q}}_{i2}\) are \((m - r_i^*) \times (m - r_i^*)\) similar ones unless vanishing when \(r_i^* = m\). This formulation is due to Ten Berge’s special definition of \({\bf {Y}}_i\) whose elements are zero except the upper left submatrix.
Although \({\bf {P}}_i\) and \({\bf {Q}}_i\) are orthonormal rather than semi- or suborthonormal, their block diagonal forms have substantial restrictions in the variations of orthonormal matrices required by Kristof’s theorem. One of the severe restrictions is the lack of giving permutations across two sets of variables. Consequently, the upper and lower bounds using Kristof’s theorem may not be attained when the diagonal elements of \({{\bf {D}}_i}{{\bf {\Delta }}_i}\) are not located in the weakly descending order. This restriction was not mentioned by Ten Berge though he did not state that the bounds are attained in his theorem.
The necessity of \({\bf {E}}_r\) in the statement of Theorem 4 is due to the unrestricted rank condition of the diagonal matrix \({\bf {\Gamma }}_i\) in the upper left corner of \({\bf {\Delta }}_i\) employed by Ten Berge. Since the rank of \({\bf {\Delta }}_i\) may be greater than that of \({\bf {D}}_i\), the upper bound in his Equation (10) becomes \[\begin {array}{l} {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) \le {\rm {tr}}({{\bf {D}}_1}{{\bf {\Delta }}_1} \cdots {{\bf {D}}_n}{{\bf {\Delta }}_n})\\ = {\rm {tr}}({{\bf {D}}_1} \cdots {{\bf {D}}_n}{{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}) \le {\rm {tr}}({{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}{{\bf {{\rm E}}}_r}), \end {array}\] where the last inequality is due to the range [0, 1] of the singular values of suborthonormal matrices (loc.cit., Lemma 2) yielding \({{\bf {D}}_1} \cdots {{\bf {D}}_n} \le {{\bf {{\rm E}}}_r}\) in Löwner’s (1934, p. 177) sense. Ten Berge explicitly wrote that ‘the statement that “the limits can be attained” has to be omitted’ (loc.cit., p. 521). It is to be noted that he added that “the limits ... can be attained if the \({\bf {X}}_i\) are varying independently and (except for the rank) unrestrictedly over the set of suborthonormal matrices” (loc.cit., p. 521).
The meaning of the parenthetical phrase “(except for the rank)” is not clear since when the upper bound \(r_i\) of the rank of \({\bf {X}}_i\) (recall the condition \({\rm {rank}}({{\bf {X}}_i}) = r_i^* \le {r_i}\)) is less than \(\min \{ {m_{\,i - 1}},{m_{\,i}}\} \), \({\bf {X}}_i\) does not vary unrestrictedly, but varies over a subset of the suborthonormal matrices satisfying \(r_i^* \le {r_i} < \min \{ {m_{\,i - 1}},{m_{\,i}}\} \). That is, in the subset, \({\bf {X}}_i\) cannot be semi- or fully orthonormal. In other words, in this subset the sum of the squared elements in each row or column of \({\bf {X}}_i\) is smaller than 1. Under this restriction, the optima may not be obtained. Note also that Ten Berge mentioned the typical cases with i.e., \(r_i^* = {r_i}\) as “this modification does not affect the validity” (loc.cit., p. 521) of his generalized theorem though the optima may not be attained due to the difficulty of applying Kristof’s theorem using constrained parent orthonormal matrices.
In the following modification with attained optima, fully unconstrained suborthonormal matrices are considered. Let \({\bf {X}}_i\) be the \({m_{\,i - 1}} \times {m_{\,i}}\,\,(i = 1,...,n;\,\,{m_{\,0}} \equiv {m_{\,n}})\) possibly non-square matrix with \({\rm {rank}}({{\bf {X}}_i}) = r_i^* \le {r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \), which is supposed to vary unrestrictedly and independently over the set of \({m_{\,i - 1}} \times {m_{\,i}}\) suborthonormal matrices in the corresponding \(m \times m\) parent orthonormal matrix with \(m = \max ({m_1},...,{m_n})\) as given earlier. The parent orthonormal matrix is denoted by \({\bf {X}}_i^*\), which includes \({\bf {X}}_i\) as a submatrix.
Let \({\bf {\Gamma }}_i^*\) and \({\bf {\Gamma }}_i\) be \({m_{\,i}} \times {m_{\,i}}\) fixed diagonal matrices defined as in Theorems 3 and 4. In the modification, however, \({\bf {\Gamma }}_i^*\) and \({\bf {\Gamma }}_i\) are assumed to be non-singular without loss of generality. This is seen from the form \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*)\) to be optimized later, since when \({\bf {\Gamma }}_i^*\) is singular, \({\bf {\Gamma }}_i^*\) can be redefined by deleting the row(s) and column(s) corresponding to the zero diagonal elements of \({\bf {\Gamma }}_i^*\). Then, in the similar manner, the corresponding column(s) of \({\bf {X}}_i\) and row(s) of \({{\bf {X}}_{i + 1}}({{\bf {X}}_{n + 1}} \equiv {{\bf {X}}_1})\) can be deleted without changing the value of \({\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^*{{\bf {X}}_2}{\bf {\Gamma }}_2^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*)\), where \(r_i^*(r_{i + 1}^*)\) and \({r_i}({r_{i + 1}})\) may be adjusted for the reduced \({{\bf {X}}_i}({{\bf {X}}_{i + 1}})\) when necessary.
Theorem 5 : A modified generalized Kristof theorem (a modification of Ten Berge (1983, Theorem 1)). Let \({\bf {X}}_i\), \({\bf {X}}_i^*\), \({\bf {\Gamma }}_i^*\) and \({\bf {\Gamma }}_i\) \((i = 1,...,n)\) be as defined above. Define \({\bf {\Delta }}_i\) as the diagonal matrix, whose upper left submatrix is \({\bf {\Gamma }}_i\) elsewhere zero, as defined earlier. Then, when the parent orthonormal matrices \({\bf {X}}_i^*\) \((i = 1,...,n)\) vary independently and unrestrictedly over the set of orthonormal matrices, we have \[ - {\rm {tr}}({{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}) \le {\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) \le {\rm {tr}}({{\bf {\Delta }}_1} \cdots {{\bf {\Delta }}_n}),\] where the optima are attained.
Proof. Define \({\bf {\Delta }}_i^*\) using \({\bf {\Gamma }}_i^*\) similarly to \({\bf {\Delta }}_i\). Then, we obtain \[{\rm {tr}}({{\bf {X}}_1}{\bf {\Gamma }}_1^* \cdots {{\bf {X}}_n}{\bf {\Gamma }}_n^*) = {\rm {tr}}({\bf {X}}_1^*{\bf {\Delta }}_1^* \cdots {\bf {X}}_n^*{\bf {\Delta }}_n^*).\] Noting that the assumption of the independent and unrestricted variations of \({\bf {X}}_i^*\) \((i = 1,...,n)\) satisfies that of Kristof’s theorem, the required results with the attained optima follow. \(\sqcap \)\(\sqcup \)
Remark 4 In Theorem 5, the assumption for the variations of \({\bf {X}}_i^*\) automatically gives \({\rm {rank}}({{\bf {X}}_i}) = r_i^* \le {r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \). The optima are obtained when \(r_i^* = {r_i}\), which indicates that \({\bf {X}}_i\) is semi- or fully orthonormal with non-zero singular value(s) being unity when the optima are attained. This makes the matrix \({\bf {E}}_r\) used in Ten Berge’s theorem unnecessary. Recall that the optima may not be obtained in his theorem since the non-zero singular value(s) of \({\bf {X}}_i\) may be less than unity and since the SVD form of \({\bf {Y}}_i\) restricts the permutation of the diagonal elements of \({\bf {\Gamma }}_i^*\) to have \({\bf {\Gamma }}_i\). In other words, when \({\bf {\Gamma }}_i^* = {{\bf {\Gamma }}_i}\), the latter restriction vanishes. Note that the former restriction corresponds to \(r_i^* < {r_i}\). That is, under this assumption \({\bf {X}}_i\) is suborthonormal (not semi- or fully orthonormal). On the other hand, the case \(r_i^* = {r_i}\), addressed earlier with Ten Berge’s statement, indicates that \({\bf {X}}_i\) is semi- or fully orthonormal. Although generally this case does not satisfy the assumption of the unconstrained variation of the parent orthonormal matrix, which is the assumption in Kristof’s theorem, the restricted variation also gives the same optima as for the unrestricted case since the non-zero singular value(s) are unities as long as \(r_i^* = {r_i}\). For Ten Berge’s generalized theorem, Kiers and Ten Berge (1989, p. 132) stated that “\(r\) is the minimum of the ranks of \({{\bf {\Gamma }}_1},...,{{\bf {\Gamma }}_k}\) and \({{\bf {X}}_1},...,{{\bf {X}}_k}\)”, where \(k = n\). This is misleading and should be corrected as “\(r\) is the minimum of the ranks of \({{\bf {\Gamma }}_1},...,{{\bf {\Gamma }}_k}\) and \({r_1},...,{r_k}\)” since when \({\rm {rank}}({{\bf {X}}_i}) = r_i^* \le {r_i}\), \(r_i^*\) can be smaller than \({r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \) in the subset of the variation of \({\bf {X}}_i\) unless the restricted case of \(r_i^* = {r_i}\) is used.
Remark 4 indicates the following modification of Kristof’s theorem.
Theorem 6 : A modification of Kristof’s theorem. In Theorem 3 of Kristof’s theorem (first version), redefine the orthonormal matrices \({{\bf {X}}_i}(i = 1,...,n)\) as \[{{\bf {X}}_i} = {\rm {Bdiag}}({{\bf {X}}_{i1}},...,{{\bf {X}}_{i{i_{\rm {B}}}}}),\] where \({\bf {X}}_{ij}\) is the \({i_j} \times {i_j}\) diagonal block \((j = 1,...,{i_{\rm {B}}})\) with \({i_1} + ... + {i_{{i_{\rm {B}}}}} = m\). Suppose that \({{\bf {X}}_i}(i = 1,...,n)\) independently and unrestrictedly vary with \({\rm {rank}}({{\bf {X}}_{ij}}) = {i_j}\) \((j = 1,...,{i_{\rm {B}}})\). Use \({{\bf {\Gamma }}_i}(i = 1,...,n)\) as defined in Kristof’s theorem. Then, we have \[ - {\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_n}) \le {\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1} \cdots {{\bf {X}}_n}{{\bf {\Gamma }}_n}) \le {\rm {tr}}({{\bf {\Gamma }}_1} \cdots {{\bf {\Gamma }}_n}),\] where the optima are attained.
Proof. The case when \({\bf {\Gamma }}_i^* = {{\bf {\Gamma }}_i}(i = 1,...,n)\) can be used in Kristof’s theorem. Although \({{\bf {X}}_i}(i = 1,...,n)\) take the block-diagonal forms, the optima with \({{\bf {X}}_i} = {{\bf {I}}_m}\,(i = 1,...,n)\) are attained in the subset of varying \({{\bf {X}}_i}(i = 1,...,n)\) under the block diagonal restriction. \(\sqcap \)\(\sqcup \)
The usage of \({\bf {\Gamma }}_i\) in place of \({\bf {\Gamma }}_i^*(i = 1,...,n)\) is due to the lack of permutation across the different diagonal blocks when using the block diagonal \({\bf {X}}_i\). Note that Kiers and Ten Berge (1989) used the assumptions that \({\bf {\Gamma }}_i\)’s are available when \(n\) = 2 stating that “in most applications the assumptions are satisfied, if they are not satisfied only minor modifications will typically be involved” (pp. 126-127). Note also that when \({i_{\rm {B}}} = 1\), \({\bf {X}}_i\) becomes a usual orthonormal matrix as in Kistof’s theorem, and when \({i_{\rm {B}}} = m\), \({\bf {X}}_i\) is the signed identity matrix whose diagonal elements are \( \pm 1\), where \({\bf {X}}_i\) and its diagonal elements i.e., \( \pm 1\) are \(m \times m\) and \(1{\kern 1pt} \times 1\) orthonormal matrices with unit singular value(s), respectively.
An extension in Theorems 4 and 5, is to relax \({\bf {\Gamma }}_i\) to be an unconstrained square matrix of the same rank as that of \({\bf {\Gamma }}_i\) \((i = 1,...,n)\), which is mathematically immaterial. Since under the relaxed condition the SVD of \({{\bf {\Gamma }}_i} \equiv {{\bf {U}}_i}{{\bf {\Lambda }}_i}{\bf {V}}_i^{\rm {T}}\) with \({\bf {U}}_i^{\rm {T}}{{\bf {U}}_i}\) and \({\bf {V}}_i^{\rm {T}}{{\bf {V}}_i}\) being an identity matrix of the same order as \({\rm {rank}}({{\bf {\Gamma }}_i})\) \((i = 1,...,n)\), redefine \({\bf {X}}_1\) and \({\bf {X}}_i\) as \({{\bf {V}}_n}{{\bf {X}}_1}{\bf {U}}_1^{\rm {T}}\) and \({{\bf {V}}_{i - 1}}{{\bf {X}}_i}{\bf {U}}_i^{\rm {T}}\) \((i = 2,...,n)\). Then, the problem becomes the same as that using the diagonal matrix \({\bf {\Lambda }}_i\) \((i = 1,...n)\). Note that this unconstrained condition was used in von Neumann’s (1937) trace inequality and Kristof’s (1970) Theorem (second version). Kristof (1970, p. 523) stated that “A distinction of the two versions will not be emphasized”, where the second version uses the unconstrained square matrix \({\bf {\Gamma }}_i\).
While as mentioned in the introductory section, “underutilization” of Kristof’s theorem and its generalization seem to be still true considering its simplicity, generality and tractability yielding solutions in various applications. In this section, basic or important cases in multivariate analysis showing advantages of Kristof’s theorem and its generalizations are provided. Examples or applications are shown below mostly in the chronological order. Although in Ten Berge’s generalized Kristof’s theorem, the optima may not be attained, all his three examples in multivariate analysis and an illustration of the Cauchy-Schwarz inequality using his theorem are the cases with attained optima.
Maximization of \({\rm {tr}}({\bf {M\Lambda L}})\) or \({\rm {tr}}({{\bf {A}}^{\rm {T}}}{\bf {\Lambda }})\): Green Jr (1969, Appendix B) (\(n\) = 1). In this problem, \({\bf {M}},{\bf {L}}\) and \(\bf {A}\) are fixed \(m \times r,s \times m\) and \(r \times s\,\,(r \ge s)\) matrices, respectively while \(\bf {\Lambda }\) is the \(r \times s\) matrix varying over all the semiorthonormal matrices when \(r > s\) or orthonormal when \(r = s\). This is seen as an extended application of von Neumann’s trace inequality when \(n = 1\) with unconstrained and possibly rectangular \({\bf {A}} = {({\bf {LM}})^{\rm {T}}}\). An application of this problem to have an optimal linear combination with a specified correlation matrix was investigated. This problem will also be addressed later.
Orthogonal procrustes transformation: Kristof’s (1970, Example 1) (\(n\) = 1). This problem is to minimize \({\rm {tr\{ }}({\bf {A}} - {\bf {BT}}){({\bf {A}} - {\bf {BT}})^{\rm {T}}}\} \), where \(\bf {A}\) and \(\bf {B}\) are fixed \(r \times s\,\,(r \ge s)\) matrices while \(\bf {T}\) is an orthonormal matrix to be derived, which reduces to maximizing \({\rm {tr(}}{{\bf {A}}^{\rm {T}}}{\bf {BT}})\), a case with \(n = 1\). Note that \(\bf {T}\) gives permutations with sign changes (reflections) of the columns of \(\bf {B}\) as well as the rotation of \(\bf {B}\). Although the term “procrustes rotation” is usually used as stated by Kristof (1970, p. 523) especially in factor analysis using factor rotation, it is important that \(\bf {T}\) can give permutations and reflections of the columns of \(\bf {B}\).
On the other hand, we have a problem of “procrustes transformation” without rotation. This happens in e.g., simulations when \(\bf {B}\) is one of sample loading matrices rotated by a fixed method e.g., the geomin, which is to be matched to the population geomin-rotated \(\bf {A}\) to see the sampling variation of \(\bf {B}\), where only permutations and reflections of the columns of \(\bf {B}\) are possible while unconstrained rotation is not allowed since otherwise \(\bf {BT}\) no longer becomes a geomin-rotated loading matrix. This problem is seen as a subproblem of Kristof’s theorem when \(n\) = 1 using a constrained orthonormal matrix \(\bf {T}\) considering only permutations and reflections of the columns of \(\bf {B}\). Problems using constrained orthonormal matrices are included in Ten Berge’s (1983) generalized Kristof theorem, as addressed earlier.
At the end of Example 1, Kristof (1970, p. 524) stated that “The generalization of the present problem to allow \({{\bf {A}}^{\rm {T}}}{\bf {B}}\) to be singular is immediate and does not require special discussion.” It is found that the singular case with \({\rm {rank}}({{\bf {A}}^{\rm {T}}}{\bf {B}}) \equiv {s^*} < s \le r\) gives the \(s^*\) positive singular values, say, \({\gamma _1} \ge \cdots \ge {\gamma _{s*}} > 0\). Then, using the SVD \({{\bf {A}}^{\rm {T}}}{\bf {B}} = {\bf {U}}\,{\rm {diag}}({\gamma _1},...,{\gamma _{s*}}){{\bf {V}}^{\rm {T}}}\), we obtain \[\begin {array}{l} {\rm {max\{ tr(}}{{\bf {A}}^{\rm {T}}}{\bf {BT}})\} = \max [{\rm {tr}}\{ {\bf {U}}\,{\rm {diag}}({\gamma _1},...,{\gamma _{s*}}){{\bf {V}}^{\rm {T}}}{\bf {T}}\} ]\\ = \max [{\rm {tr}}\{ \,{\rm {diag}}({\gamma _1},...,{\gamma _{s*}}){{\bf {V}}^{\rm {T}}}{\bf {TU}}\} ] = {\rm {tr}}\{ \,{\rm {diag}}({\gamma _1},...,{\gamma _{s*}})\} \\ = \sum \nolimits _{i = 1}^{s*} {{\gamma _i}} , \end {array}\] where \({{\bf {V}}^{\rm {T}}}{\bf {TU}}\) is a suborthonormal matrix since the product of suborthonormal matrices is suborthonormal (Ten Berge, 1983, Lemma 4). Actually, \(\bf {U}\) and \(\bf {V}\) of full column rank by definition are semiorthonormal (Ten Berge, 1983, Definition 2) while \(\bf {T}\) is orthonormal as well as suborthonormal.
Kristof (1970, p. 524) added three examples with \(n\) = 1 (“the trivial case of the theorem” in his terminology) for determinations of e.g., orthogonal matrices with specific properties developed in the 1960s in psychometrics.
The two-sided orthogonal procrustes problem: Kristof’s (1970, Example 2) (\(n\) = 2). This problem based on Schönemann (1968) minimizes \({\rm {tr\{ }}({\bf {B}} - {{\bf {T}}^{\rm {T}}}{\bf {AS}}){({\bf {B}} - {{\bf {T}}^{\rm {T}}}{\bf {AS}})^{\rm {T}}}\} \), where \(\bf {A}\) and \(\bf {B}\) are fixed square matrices while \(\bf {T}\) and \(\bf {S}\) are orthonormal matrices such that \(\bf {A}\) is to be matched to \(\bf {B}\). This problem reduces to maximizing \({\rm {tr(}}{{\bf {T}}^{\rm {T}}}{\bf {AS}}{{\bf {B}}^{\rm {T}}})\), a case with unconstrained \(\bf {T}\) and \(\bf {S}\) when \(n = 2\). The maximum of \({\rm {tr(}}{{\bf {T}}^{\rm {T}}}{\bf {AS}}{{\bf {B}}^{\rm {T}}})\) is obtained as the product of the positive singular values of \(\bf {A}\) and \(\bf {B}\).
Multivariate multiple regression: Kristof’s (1970, Example 4) and Ten Berge’s (1983, Application 1) (\(n\) = 2). Kristof’s example included an unnatural restriction of the same numbers of the dependent and independent variables, which was removed by Ten Berge’s application. The problem is to minimize \({\rm {tr\{ }}({\bf {Y}} - {\bf {XB}}){({\bf {Y}} - {\bf {XB}})^{\rm {T}}}\} \), where \(\bf {Y}\) is an \(s \times u\) matrix for \(u\) dependent variables and \(\bf {X}\) of full column rank is an matrix for \(t\) independent variables. The well-known solution \({\bf {\hat B}} = {({{\bf {X}}^{\rm {T}}}{\bf {X}})^{ - 1}}{{\bf {X}}^{\rm {T}}}{\bf {Y}}\) was obtained without calculus when \(n\) = 2 though the method is somewhat tedious.
Principal components analysis: Ten Berge’s (1983, Application 2) (\(n\) = 1). This application deals with the jointly determining \(p\) principal components when \(\bf {R}\) is a \(k \times k\) correlation matrix of rank \(r \le k\). The problem is to maximize the sum of squared loadings \({\rm {tr(}}{{\bf {B}}^{\rm {T}}}{{\bf {R}}^2}{\bf {B}})\), where \(\bf {B}\) is a \(k \times p\) loading matrix with \(p \le r\) subject to the uncorrelated components with unit variances \({{\bf {B}}^{\rm {T}}}{\bf {RB}} = {{\bf {I}}_p}\). Note that an atypical assumption of the possibly singular \(\bf {R}\) is used.
Let \({\bf {R}} = {\bf {K\Lambda }}{{\bf {K}}^{\rm {T}}}\) with \({{\bf {K}}^{\rm {T}}}{\bf {K}} = {{\bf {I}}_r}\) be the SVD using the positive diagonal elements of the \(r \times r\) diagonal matrix \(\bf {\Lambda }\). The solution can be given by maximizing \[\begin {array}{l} {\rm {tr(}}{{\bf {B}}^{\rm {T}}}{{\bf {R}}^2}{\bf {B}}) = {\rm {tr(}}{{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^2}{{\bf {K}}^{\rm {T}}}{\bf {B}}) = {\rm {tr\{ (}}{{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}}){\bf {\Lambda }}({{\bf {\Lambda }}^{1/2}}{{\bf {K}}^{\rm {T}}}{\bf {B}})\} \\ = {\rm {tr\{ }}({{\bf {\Lambda }}^{1/2}}{{\bf {K}}^{\rm {T}}}{\bf {B}}){\rm {(}}{{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}}){\bf {\Lambda }}\} \end {array}\] with \(n\) = 1, where \(({{\bf {\Lambda }}^{1/2}}{{\bf {K}}^{\rm {T}}}{\bf {B}}){\rm {(}}{{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}})\) is suborthonormal since \[{{\bf {I}}_p} = {{\bf {B}}^{\rm {T}}}{\bf {RB}} = {{\bf {B}}^{\rm {T}}}{\bf {K\Lambda }}{{\bf {K}}^{\rm {T}}}{\bf {B}} = ({{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}})({{\bf {\Lambda }}^{1/2}}{{\bf {K}}^{\rm {T}}}{\bf {B}}).\] Noting that \({{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}}\) is a semiorthonormal matrix, Ten Berge’s generalized Kristof theorem gives \({\rm {max\{ tr(}}{{\bf {B}}^{\rm {T}}}{{\bf {R}}^2}{\bf {B}})\} = \sum \nolimits _{i = 1}^p {{\lambda _i}} \), where \({\lambda _1} \ge \cdots \ge {\lambda _r} > 0\) and the unrotated loading matrix \(\bf {B}\) is a submatrix of \({\bf {K}}{{\bf {\Lambda }}^{ - 1/2}}\) taking its first \(p\) columns to give \({{\bf {B}}^{\rm {T}}}{\bf {K}}{{\bf {\Lambda }}^{1/2}}\) a submatrix of \({\bf {I}}_r\) consisting its first \(p\) rows.
The formulation using suborthonormal matrices \({\rm {tr(}}{{\bf {B}}^{\rm {T}}}{{\bf {R}}^2}{\bf {B}})\) is of interest since the restriction \({{\bf {B}}^{\rm {T}}}{\bf {RB}} = {{\bf {I}}_p}\) is cleverly used in the maximizing function without calculus or Lagrange multipliers. While the above example was employed by Ten Berge to show an application of his generalized Kristof theorem, when all the \(k\) components including minor ones are obtained in the usual non-singular case of \(\bf {R}\), the maximum of \({\rm {tr(}}{{\bf {B}}^{\rm {T}}}{{\bf {R}}^2}{\bf {B}})\) is obtained by Kristof’s theorem as \({\rm {tr}}({\bf {\Lambda }})\), where \(\bf {\Lambda }\) is the \(k \times k\) diagonal matrix with positive diagonals. In this case, \({{\bf {\Lambda }}^{1/2}}{{\bf {K}}^{\rm {T}}}{\bf {B}}\) becomes \({\bf {I}}_k\) yielding the well-known unrotated loading matrix \({\bf {B}} = {\bf {K}}{{\bf {\Lambda }}^{ - 1/2}}\).
Canonical correlations: Kristof (1970, General comment (a)) (\(n\) = 1), Ten Berge (1983, Application 3) (\(n\) = 1), Ogasawara (2000, with errata, Theorem 1) (\(n\) = 2), and Waller (2018, pp. 195-196) (\(n\) = 1). Kristof (1970) suggested a formulation of canonical correlations applying his theorem when \(n\) = 1, which was fully described by Ten Berge (1983) using an associated SVD as in principal components. Ogasawara (2000) used Kristof’s theorem with \(n\) = 2 to have a lower bound of the mean squared canonical correlations between factors in factor analysis and the corresponding principal components. Waller also showed an application of Kristof’s theorem for canonical correlations using two sets of principal components in the two sets of original variables. However, his results are those when the numbers of original variables in each set are the same. Note that use of the principal components needs justification when the numbers of the original variables are not equal or the number of the canonical correlations is less than the minimum of the numbers of the original variables.
The Cauchy-Schwarz inequality: Ten Berge (1983, p. 522) (\(n\) = 2); Paragraph (iii) after Lemma 3 (\(n\) = 1) and Theorem 5 of the current article (\(n\) = 2). This example is simple but impressive in that the example well shows the simplicity and generality of Kristof’s theorem and its extensions without calculus. Let \(\bf {x}\) and \(\bf {y}\) be non-zero vectors of order \(k\). Suppose that \(\bf {x}\) and \(\bf {y}\) vary independently and unrestrictedly. Then, \[\begin {array}{l} {{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}{\bf {y}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}} = {{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}{{\bf {I}}_k}{\bf {y}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}}1\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \ \ \ \ \ \ \ \ \ \ = {{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}{\bf {\Gamma }}_1^*{\bf {y}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}}{\bf {\Gamma }}_2^*, \end {array}\] where \({\bf {\Gamma }}_1^* = {{\bf {\Gamma }}_1} = {{\bf {I}}_k}\) and \({\bf {\Gamma }}_2^* = {{\bf {\Gamma }}_2} = 1\). Define \(k \times k\) diagonal matrices \({{\bf {\Delta }}_1} = {{\bf {\Gamma }}_1} = {{\bf {I}}_k}\) and \({{\bf {\Delta }}_2} = {{\bf {E}}_{11}}\), where the first diagonal element of \({\bf {E}}_{11}\) is unity elsewhere zero. Since the vectors \({{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}\) and \({{\bf {y}}^{\rm {T}}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}}\) are semiorthonormal, their non-zero singular values are unity. Consequently, when applying Ten Berge’s theorem, the maximum is attained as \({\rm {tr}}({{\bf {\Delta }}_1}{{\bf {\Delta }}_2}) = {\rm {tr}}({{\bf {I}}_k}{{\bf {E}}_{11}}) = 1\) with minimum obtained similarly. Then, we have the Cauchy-Schwarz inequality \( - 1 \le {{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}{\bf {y}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}} = \frac {{{{\bf {x}}^{\rm {T}}}{\bf {y}}}}{{{{({{\bf {x}}^{\rm {T}}}{\bf {x}})}^{1/2}}{{({{\bf {y}}^{\rm {T}}}{\bf {y}})}^{1/2}}}} \le 1\).
As addressed in Paragraph (iii) after Lemma 3, the inequality is given via Kristof’s theorem (\(n\) = 1). The same result is obtained by using Theorem 5 of this article. Define the \(k \times k\) parent orthonormal matrices \(\bf {X}\) and \(\bf {Y}\), whose first rows are \({{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}\) and \({{\bf {y}}^{\rm {T}}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}}\), respectively, where \(\bf {X}\) and \(\bf {Y}\) independently and unrestrictedly vary over the sets of orthonormal matrices. Then, using Theorem 5, the maximum of \({{\bf {x}}^{\rm {T}}}{({{\bf {x}}^{\rm {T}}}{\bf {x}})^{ - 1/2}}{\bf {y}}{({{\bf {y}}^{\rm {T}}}{\bf {y}})^{ - 1/2}}\) is attained as \({\rm {tr}}({\bf {X}}{{\bf {I}}_k}{\bf {Y}}{{\bf {E}}_{11}}) \le {\rm {tr}}({{\bf {I}}_k}{{\bf {E}}_{11}}) = 1\) with the minimum \({\rm {tr}}({\bf {X}}{{\bf {I}}_k}{\bf {Y}}{{\bf {E}}_{11}}) \ge - {\rm {tr}}({{\bf {I}}_k}{{\bf {E}}_{11}}) = - 1\) obtained similarly.
Generalized linear form: Ten Berge (1993, Equation (48)), Yanai and Takane (2007, Property 11) and Adachi (2020, Theorem A.4.2) (\(n\) = 1). This is a problem maximizing \({\rm {tr}}({{\bf {X}}^{\rm {T}}}{\bf {A}})\), where \(\bf {X}\) and fixed \(\bf {A}\) are \(p \times q\,\,(p \ge q)\) matrices with the constraint \({{\bf {X}}^{\rm {T}}}{\bf {X}} = {{\bf {I}}_q}\) and \({\rm {rank}}({\bf {A}}) \le q\). To the author’s knowledge, this problem was first solved by Green Jr (1969) as mentioned in the first example. After Kristof (1970), Ten Berge (1993) reformulated the problem as the generalized linear form. He defined the SVD \({\bf {A}} = {\bf {U\Lambda }}{{\bf {V}}^{\rm {T}}}\) with \({{\bf {U}}^{\rm {T}}}{\bf {U}} = {{\bf {V}}^{\rm {T}}}{\bf {V}} = {\bf {V}}{{\bf {V}}^{\rm {T}}} = {{\bf {I}}_q}\) and \(\bf {\Lambda }\) being the diagonal matrix with the non-negative diagonals using an application of his generalized Kristof theorem with \(n\) = 1. Then, the maximum is given by \[{\rm {max\{ tr}}({{\bf {X}}^{\rm {T}}}{\bf {A}})\} = {\rm {max\{ tr}}({{\bf {X}}^{\rm {T}}}{\bf {U\Lambda }}{{\bf {V}}^{\rm {T}}})\} = {\rm {max\{ tr}}({{\bf {V}}^{\rm {T}}}{{\bf {X}}^{\rm {T}}}{\bf {U\Lambda }})\} = {\rm {tr}}({\bf {\Lambda }})\] since \({{\bf {V}}^{\rm {T}}}{{\bf {X}}^{\rm {T}}}{\bf {U}}\) is suborthonormal. The maximum is attained when \({{\bf {V}}^{\rm {T}}}{{\bf {X}}^{\rm {T}}}{\bf {U}} = {{\bf {I}}_q}\) or \({\bf {X}} = {\bf {U}}{{\bf {V}}^{\rm {T}}}\).
Note that the definition of the SVD using the non-negative rather than positive singular values is important considering the case \({\rm {rank}}({\bf {A}}) \equiv {q^*} < q\). That is, in the last case when using only positive singular values, \(\bf {U}\) and \(\bf {V}\) become \(p \times {q^*}\) and \(q \times {q^*}\) semiorthonormal matrices, respectively, yielding \({{\bf {X}}^{\rm {T}}}{\bf {X}} \ne {{\bf {I}}_q}\), which does not satisfy the assumption. Note that Green Jr (1969, p. 317) correctly considered the two cases \({q^*} < q\) and \({q^*} = q\) as well as the cases of multiple or equal positive singular values in terms of the uniqueness of \(\bf {U}\), \(\bf {V}\) and \({\bf {U}}{{\bf {V}}^{\rm {T}}}\). For this example, Neudecker’s (2004, Section 2) derivation as “a Kristof-type theorem” with a correction and added explanation will be shown in the appendix.
(a) The trivial case (\(n\) = 1) and the bilinear case i.e., von Neumann’s trace inequality (\(n\) = 2). In the previous section, only the examples of \(n\) = 1 or 2 are shown. Though Kristof (1970) used the term “trivial case” when \(n\) = 1, its applications are meaningful ones as shown earlier. Note that only the derivation of Kristof’s theorem is trivial or self-evident when \(n\) = 1. A case of \(n\) = 4 was provided by Kristof (1970, Example 6) as a generalization of Meredith’s (1964) problem for a multivariate selection of subpopulations from a common parent. However, most of the applications of Kristof’s theorem and Ten Berge’s generalized one seem to be those of \(n\) = 1 or 2. Kiers and Ten Berge (1989, p. 126) stated that “All practical applications we have encountered so far apply to the cases \(k\) = 1 or \(k\) = 2”, where \(k\) is used for \(n\). That said, it is to be noted that Ten Berge (1983, p. 509) stated that “Theorems should be derived in the greatest possible generality”.
(b) Alternative proofs. Proofs in seminal papers tend to be complicated. After the discoveries, alternative simple or short proofs follow. For von Neumann’s (1937) trace inequality, the elementary alternative proof by Mirsky (1975) with Fan’s (1951) lemma may be the simplest one as shown in the appendix. In this tutorial, rephrasing or breaking down the proof by Kristof for his theorem has been shown. However, the logic is essentially the same as Kristof’s one using induction. Marshall, Olkin, and Arnold (2011, Chapter 20, Theorem B.2) also showed a similar proof by induction though they stated that “We give an inductive proof that is elementary, though still somewhat lengthy” (p. 791). Finding alternative simple, self-contained and hopefully short proofs of Kristof’s theorem when \(n \ge 3\) is an open problem. Mirsky used the doubly stochastic matrix. Applications or generalizations of Mirsky’s proof to the inequalities when \(n \ge 3\) seem to be difficult as far as the author conjectures.
(c) Equivalent and inequivalent cases of the Kristof and Ten Berge theorems. As mentioned earlier, Ten Berge (1983, Theorem 2) extended Kristof’s theorem. For the differences of the theorems, he stated that “The most striking difference is that the \({\bf {X}}_i\) are no longer required to be orthonormal. Second, the \({\bf {X}}_i\) need no longer to be square” (p. 521), which are advantages of Ten Berge’s theorem over Kristof’s one claimed by Ten Berge. The third difference is the lack of the attained optima, which is not an advantage. The two claimed advantages may be handled by Kristof’s theorem by considering the parent orthonormal matrices as used by Ten Berge and Theorem 5 in this article. So, the two theorems may be seen as equivalent when the optima are attained. Note that all the four examples in Ten Berge (1983) are the cases of attained optima. Probably, the cases of unattained optima due to \({\rm {rank}}({{\bf {X}}_i}) = r_i^* < {r_i} = \min \{ {m_{\,i - 1}},{m_{\,i}}\} \) may be theoretical or special, if any, in practice.
It is conjectured that even in this special case, some adjustment giving \(r_i^* \le {r_i}\) may be obtained. For instance, consider canonical correlation analysis for two sets of standardized data matrices i.e., \({\bf {X}}_1\) (\(n \times {r_1}\)) of rank \(r_1^*\) and \({\bf {X}}_2\) (\(n \times {r_2}\)) of rank \(r_2^*\). Then, when \({r^*} < \min \{ r_1^*,r_2^*\} \) canonical correlations are optimally derived in a least squares sense among \(\min \{ r_1^*,r_2^*\} \) possible ones, this seems to yield a similar problem. Actually, as Ten Berge (1983, p. 523) formulated the situation using the coefficient matrices \({\bf {B}}_1\) (\({r_1} \times {r^*}\)) and \({\bf {B}}_2\) (\({r_2} \times {r^*}\)) with other ones, the maximum of \({\rm {tr}}({\bf {B}}_1^{\rm {T}}{\bf {X}}_1^{\rm {T}}{{\bf {X}}_2}{{\bf {B}}_2})\) was attained.
Adachi, K. (2020). Matrix-based introduction to multivariate data analysis (2nd ed.). Singapore: Springer Singapore.
Fan, K. (1951). Maximum properties and inequalities for the eigenvalues of completely continuous operators. Proceedings of the National Academy of Sciences, 37(11), 760–766.
Green Jr, B. F. (1969). Best linear composites with a specified structure. Psychometrika, 34(3), 301–318.
Hardy, G. H., Littlewood, J. E., & Pólya, G. (1934). Inequalities. New York: Cambridge University Press.
Hardy, G. H., Littlewood, J. E., & Pólya, G. (1952). Inequalities (2nd ed). New York: Cambridge University Press. (Reprinted 1994)
Kiers, H. A. L., & Ten Berge, J. M. F. (1989). Optimality conditions for the trace of certain matrix products. Linear Algebra and its Applications, 126, 125–134.
Kristof, W. (1964). Die beste orthogonale transformation zur gegenseitigen Überführung zweier faktorenmatrizen. Diagnostica, 10, 87–90.
Kristof, W. (1969). A theorem on the trace of certain matrix products and some applications. ETS Research Bulletin Series, RB-69-21. Educational Testing Service: Princeton, NJ. doi: https://doi.org/https://doi.org/10.1002/j.2333-8504.1969.tb00400
Kristof, W. (1970). A theorem on the trace of certain matrix products and some applications. Journal of Mathematical Psychology, 7(3), 515–530.
Levin, J. (1979). Applications of a theorem on the traces of certain matrix products. Multivariate Behavioral Research, 14(1), 109–113.
Löwner, C. (1934). Über monotone matrixfunctionen. Mathematische Zeitschrif , 38, 177–216.
Marshall, A. W., Olkin, I., & Arnold, B. C. (2011). Inequalities: theory of majorization and its applications (2nd ed.). New York: Springer.
Meredith, W. (1964). Rotation to achieve factorial invariance. Psychometrika, 29(2), 187–206.
Miranda, H., & Thompson, R. C. (1993). A trace inequality with a subtracted term. Linear Algebra and its Applications, 185, 165–172.
Mirsky, L. (1975). A trace inequality of john von neumann. Monatshefte für mathematik, 79(4), 303–306.
Mori, K., & Kurata, H. (2013). Optimal correlation preserving linear predictors of factor scores in factor analysis. Journal of the Japan Statistical Society, 43(1), 79–89.
Neudecker, H. (2004). On best affine unbiased covariance-preserving prediction of factor scores. SORT, 28(1), 27–36.
Ogasawara, H. (2000). Some relationships between factors and components. Psychometrika, 65, 167–185. (errata, 65, 551)
Schönemann, P. H. (1968). On two-sided orthogonal procrustes problems. Psychometrika, 33(1), 19–33.
Simon, B. (2005). Trace ideals and their applications (2nd ed.) (No. 120). Providence, RI: American Mathematical Society.
Ten Berge, J. M. (1983). A generalization of kristof’s theorem on the trace of certain matrix products. Psychometrika, 48(4), 519–523.
Ten Berge, J. M. (1993). Least squares optimization in multivariate analysis. Leiden, The Netherlands: DSWO Press, Leiden University. (pdf version with minor corrections, 2005, https://kiers.webhosting.rug.nl/leastsquaresbook.pdf)
Uchida, T. (2023). von Neumann’s trace inequality. Unpublished document available at https://note.com/utaka233/n/n511cbf78f89d. (in Japanese).
von Neumann, J. (1937). Some matrix inequalities and metrization of matric-space. Tomsk University Review, 1, 286–300. Reprinted in A. H. Taub (Ed.), John von Newmann collected works (1962), Vol. IV, pp. 205-219. New York: Pergamon Press.
Waller, N. (2018). An introduction to Kristof’s theorem for solving least-square optimization problems without calculus. Multivariate Behavioral Research, 53(2), 190–198.
Yanai, H., & Takane, Y. (2007). Matrix methods and their applications to factor analysis. In S.-Y. Lee (Ed.), Handbook of latent variable and related models (pp. 345–366). Amsterdam: North-Holland.
Lemma A: Fan’s (1951, Lemma 1A) inequality for the doubly substochastic matrix. Let \({\bf {a}} = {({a_1},...,{a_n})^{\rm {T}}}\) and \({\bf {b}} = {({b_1},...,{b_n})^{\rm {T}}}\) be fixed vectors with \({a_1} \ge \cdots \ge {a_n} \ge 0\) and \({b_1} \ge \cdots \ge {b_n} \ge 0\). Define an \(n \times n\) doubly substochastic matrix \({\bf {P}} = \{ {p_{ij}}\} \) with non-negative elements satisfying \(\sum \nolimits _{j = 1}^n {{p_{ij}}} \le 1\,\,(i = 1,...,n)\) and \(\sum \nolimits _{i = 1}^n {{p_{ij}}} \le 1\,\,(j = 1,...,n)\) (see e.g., Marshall et al. (2011, Section 2.C)). Then, \({{\bf {a}}^{\rm {T}}}{\bf {Pb}} \le {{\bf {a}}^{\rm {T}}}{\bf {b}}\).
Proof (a slight extension of Uchida (2023)). Let \({c_1},...,{c_n}\) and \({d_1},...,{d_n}\) be non-negative numbers. Then, we can write \({a_i} = \sum \nolimits _{k = i}^n {{c_k}} \) and \({b_i} = \sum \nolimits _{k = i}^n {{d_k}} \). Using these expressions, \[\begin {array}{l} {{\bf {a}}^{\rm {T}}}{\bf {b}} - {{\bf {a}}^{\rm {T}}}{\bf {Pb}} = \sum \nolimits _{i,j = 1}^n {({\delta _{ij}} - {p_{ij}})} {a_i}{b_j} = \sum \nolimits _{i,j = 1}^n {({\delta _{ij}} - {p_{ij}})\sum \nolimits _{k = i}^n {{c_k}\sum \nolimits _{l = j}^n {{d_l}} } } \\ = \sum \nolimits _{k,l = 1}^n {{c_k}{d_l}} \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {({\delta _{ij}} - {p_{ij}})} } \end {array}\] follows, where \(\delta _{ij}\) is the Kronecker delta. In the above expression, define the doubly stochastic (rather than substochastic) matrix \({{\bf {P}}^*} = \{ p_{ij}^*\} \) satisfying \(p_{ij}^* \ge {p_{ij}}\,(i,j = 1,...n)\). Then, consider the case \(k \ge l\) in the above expression. We obtain \[\begin {array}{l} \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {({\delta _{ij}} - {p_{ij}})} } = \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {{\delta _{ij}}} } - \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {{p_{ij}}} } \\ = \sum \nolimits _{i = 1}^l {{\delta _{ii}}} - \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {{p_{ij}}} } = l - \sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {{p_{ij}}} } \\ \ge l - \sum \nolimits _{i = 1}^n {\sum \nolimits _{j = 1}^l {p_{ij}^*} } \ge l - l = 0. \end {array}\] When \(k \le l\), in a similar manner we obtain \(\sum \nolimits _{i = 1}^k {\sum \nolimits _{j = 1}^l {({\delta _{ij}} - {p_{ij}})} } \ge 0\). These two inequalities give the required result \({{\bf {a}}^{\rm {T}}}{\bf {b}} - {{\bf {a}}^{\rm {T}}}{\bf {Pb}} \ge 0\). \(\sqcap \)\(\sqcup \)
Remark A. The original proof by Fan (1951) is a short one though it is not self-contained in that “Abel’s lemma” is used. The author could not identify the Abel lemma with an associated reference among Abel’s formulas. The above proof is a slight extension of the result by Uchida (2023) who dealt with only the doubly stochastic matrix \({\bf {P}}^*\), which is a special case of the doubly substochastic matrix \(\bf {P}\).
The second proof of Theorem 1 (von Neumann’s trace inequality; Mirsky (1975, Section 3, p. 305)). \({\rm {|}}\,{\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})|\, \le {\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2})\) is derived. Let \({{\bf {X}}_i} = \{ {x_{(i)jk}}\} \,(j,k = 1,...,m)\) and \({{\bf {\Gamma }}_i} = {\rm {diag}}({\gamma _{(i)1}},...,{\gamma _{(i)m}})\) \((i = 1,\,\,2)\). Then, using Fan’s (1951) inequality for doubly (sub)stochastic matrices, we have \[\begin {array}{l} {\rm {|}}\,{\rm {tr}}({{\bf {X}}_1}{{\bf {\Gamma }}_1}{{\bf {X}}_2}{{\bf {\Gamma }}_2})|\, = \,\,|\sum \nolimits _{j,k = 1}^m {{x_{(1)jk}}{\gamma _{(1)k}}{x_{(2)kj}}} {\gamma _{(2)j}}|\\ \le \sum \nolimits _{j,k = 1}^m {|{x_{(1)jk}}{x_{(2)kj}}|{\gamma _{(1)k}}} {\gamma _{(2)j}}\\ \le \frac {1}{2}\sum \nolimits _{j,k = 1}^m {x_{(1)jk}^2{\gamma _{(1)k}}} {\gamma _{(2)j}} + \frac {1}{2}\sum \nolimits _{j,k = 1}^m {x_{(2)kj}^2{\gamma _{(1)k}}} {\gamma _{(2)j}}\\ \le \frac {1}{2}\sum \nolimits _{j = 1}^m {{\gamma _{(1)j}}} {\gamma _{(2)j}} + \frac {1}{2}\sum \nolimits _{j = 1}^m {{\gamma _{(1)j}}} {\gamma _{(2)j}}{\rm { = }}\,{\rm {tr}}({{\bf {\Gamma }}_1}{{\bf {\Gamma }}_2}). \end {array}\] \(\sqcap \)\(\sqcup \)
The Kristof-type theorem for correlation preserving predictors of factor scores: Neudecker (2004, Section 2) (\(n\) = 1). Neudecker obtained the same solution of the first example by Green and the last reformulated one by Ten Berge in Section 5 as “A Kristof-type theorem” in the context of the derivations of correlation preserving predictors of factor scores (for these predictors see the references in Neudecker (2004); and Mori and Kurata (2013)). He did not mention or use Ten Berge’s theorem, but employed calculus and Lagrange multipliers. Neudecker also used the SVD \({\bf {A}} = {{\bf {U}}_0}{{\bf {\Lambda }}_0}{\bf {V}}_0^{\rm {T}}\), where he employed only positive singular values i.e., \({{\bf {\Lambda }}_0} > {\bf {O}}\) in Löwner’s sense.
An advantage of Neudecker’s derivation is to give the set of explicit expressions of \(\bf {X}\) maximizing \({\rm {tr}}({{\bf {X}}^{\rm {T}}}{\bf {A}})\) as \({\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} = {\bf {U}}{{\bf {V}}^{\rm {T}}}\) when \({\rm {rank}}({\bf {A}}) = {q^*} = q\) and \({\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {\bf {Q}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\) with a \(p \times q\) matrix \(\bf {Q}\) when \({q^*} < q\) as “the general solution” (Neudecker, 2004, Equation (2.5)). In these expressions, \[{{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} = {\bf {X}}{\{ {({{\bf {X}}^{\rm {T}}}{\bf {X}})^{1/2}}\} ^ + } = {\bf {X}}{\{ {({{\bf {X}}^{\rm {T}}}{\bf {X}})^ + }\} ^{1/2}} \equiv {\bf {X}}{({{\bf {X}}^{\rm {T}}}{\bf {X}})^{( + )1/2}}\] and \[{{\bf {V}}_0}{\bf {V}}_0^{\rm {T}} = {({{\bf {X}}^{\rm {T}}}{\bf {X}})^{( + )1/2}}{({{\bf {X}}^{\rm {T}}}{\bf {X}})^{1/2}} = {({{\bf {X}}^{\rm {T}}}{\bf {X}})^{1/2}}{({{\bf {X}}^{\rm {T}}}{\bf {X}})^{( + )1/2}},\] where \(( \cdot )^{1/2}\) is the matrix square root of a matrix; and \(( \cdot )^ + \) is the Moore-Penrose generalized (MP \(g\)-) inverse of a possibly rectangular matrix, which is obtained by using the SVD. That is, when the SVD of a matrix is \({\bf {Y}} = {\bf {P\Gamma }}{{\bf {Q}}^{\rm {T}}}\) employing only the positive singular values, we have \({{\bf {Y}}^ + } = {\bf {Q}}{{\bf {\Gamma }}^{ - 1}}{{\bf {P}}^{\rm {T}}}\), which satisfies the conditions of the MP \(g\)-inverse: \({\bf {Y}}{{\bf {Y}}^ + }{\bf {Y}} = {\bf {Y}},\,\,{{\bf {Y}}^ + }{\bf {Y}}{{\bf {Y}}^ + } = {{\bf {Y}}^ + },\,\,{\bf {Y}}{{\bf {Y}}^ + } = {({\bf {Y}}{{\bf {Y}}^ + })^{\rm {T}}}\) and \({{\bf {Y}}^ + }{\bf {Y}} = {({{\bf {Y}}^ + }{\bf {Y}})^{\rm {T}}}\).
Neudecker (2004, Equation (2.5)) stated that “\(\bf {Q}\) arbitrary”. This is misleading since when \(\bf {Q}\) is a zero matrix, the rank of \({\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}}\) becomes \({q^*}( < q)\) and does not satisfy \({{\bf {X}}^{\rm {T}}}{\bf {X}} = {{\bf {I}}_q}\) though this will give the same maximum. Instead, \(\bf {Q}\) should be defined as a \(p \times q\) arbitrary suborthonormal matrix \({\bf {Q}} = {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}}\) of rank \(q - {q^*}\), where \({\bf {U}}_1\) and \({\bf {V}}_1\) are \(p \times (q - {q^*})\) and \(q \times (q - {q^*})\) semiorthonormal matrices, respectively such that \({\bf {U}}_1^{\rm {T}}{{\bf {U}}_1} = {\bf {V}}_1^{\rm {T}}{{\bf {V}}_1} = {{\bf {I}}_{q - {q^*}}}\), \({\bf {U}}_0^{\rm {T}}{{\bf {U}}_1} = {\bf {O}}\) and \({\bf {V}}_0^{\rm {T}}{{\bf {V}}_1} = {\bf {O}}\), which shows the arbitrary property of \(\bf {Q}\) stated earlier in that when \({\bf {U}}_1\) and \({\bf {V}}_1\) are replaced by \({{\bf {U}}_1}{\bf {U}}_1^*\) and \({{\bf {V}}_1}{\bf {V}}_1^*\) with \({\bf {U}}_1^*\) and \({\bf {V}}_1^*\) being arbitrary \((q - {q^*}) \times (q - {q^*})\) orthonormal matrices, these can be used with a different \({{\bf {Q}}^*} \equiv {{\bf {U}}_1}{\bf {U}}_1^*{({{\bf {V}}_1}{\bf {V}}_1^*)^{\rm {T}}} \ne {\bf {Q}}\). Note that these arbitrary \({\bf {U}}_1\) and \({\bf {V}}_1\) give \({\bf {V}}_0^{\rm {T}}{{\bf {V}}_0} + {\bf {V}}_1^{\rm {T}}{{\bf {V}}_1} = {({{\bf {V}}_0}:{{\bf {V}}_1})^{\rm {T}}}({{\bf {V}}_0}:{{\bf {V}}_1}) = {({{\bf {V}}_0}:{{\bf {V}}_1}{\bf {V}}_1^*)^{\rm {T}}}({{\bf {V}}_0}:{{\bf {V}}_1}{\bf {V}}_1^*) = {{\bf {I}}_q}\).
Then, it is found that \[\begin {array}{l} {\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {\bf {Q}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}) = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\\ \,\,\,\,\,\ = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}} = ({{\bf {U}}_0}:{{\bf {U}}_1}){({{\bf {V}}_0}:{{\bf {V}}_1})^{\rm {T}}} \end {array}\] satisfying \[\begin {array}{l} {{\bf {X}}^{\rm {T}}}{\bf {X}} = ({{\bf {V}}_0}:{{\bf {V}}_1}){({{\bf {U}}_0}:{{\bf {U}}_1})^{\rm {T}}}({{\bf {U}}_0}:{{\bf {U}}_1}){({{\bf {V}}_0}:{{\bf {V}}_1})^{\rm {T}}}\\ \,\,\,\,\,\,\,\,\,\,\,\ \ = ({{\bf {V}}_0}:{{\bf {V}}_1}){({{\bf {V}}_0}:{{\bf {V}}_1})^{\rm {T}}} = {{\bf {I}}_q}. \end {array}\] Using the above \(\bf {X}\), the maximum is given by \[\begin {array}{l} {\rm {tr}}({{\bf {A}}^{\rm {T}}}{\bf {X}}) = {\rm {tr}}\{ {({{\bf {U}}_0}{{\bf {\Lambda }}_0}{\bf {V}}_0^{\rm {T}})^{\rm {T}}}({{\bf {U}}_0}:{{\bf {U}}_1}){({{\bf {V}}_0}:{{\bf {V}}_1})^{\rm {T}}}\} \\ = {\rm {tr}}\{ {{\bf {V}}_0}{{\bf {\Lambda }}_0}{\bf {U}}_0^{\rm {T}}({{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}})\} \\ = {\rm {tr(}}{{\bf {V}}_0}{{\bf {\Lambda }}_0}{\bf {V}}_0^{\rm {T}}) = {\rm {tr(}}{\bf {V}}_0^{\rm {T}}{{\bf {V}}_0}{{\bf {\Lambda }}_0})\\ = {\rm {tr(}}{{\bf {\Lambda }}_0}), \end {array}\] where \({\rm {tr(}}{{\bf {\Lambda }}_0}) = {\rm {tr(}}{\bf {\Lambda }})\) and the singular diagonal matrix \(\bf {\Lambda }\) of rank \(q^*\) was used earlier. Define the semiorthonormal matrix \({\bf {U}} \equiv ({{\bf {U}}_0}:{{\bf {U}}_1})\) and orthonormal \({\bf {V}} \equiv ({{\bf {V}}_0}:{{\bf {V}}_1})\). Then, \({\bf {X}} = {\bf {U}}{{\bf {V}}^{\rm {T}}}\) is equal to that obtained by Ten Berge as shown earlier.
Recall the expression \({\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {\bf {Q}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\). The matrix \(\bf {Q}\) can be an arbitrary \(p \times q\) semiorthonormal one denoted by \({\bf {U}}^*\) with \({\bf {U}}_0^{\rm {T}}{{\bf {U}}^*} = {\bf {O}}\) and \({{\bf {U}}^{*{\rm {T}}}}{{\bf {U}}^*} = {{\bf {I}}_q}\). This is seen by \[\begin {array}{l} {{\bf {X}}^{\rm {T}}}{\bf {X}} = {\{ {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}^*}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\} ^{\rm {T}}}\{ {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}^*}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\} \\ \,\,\,\,\,\,\,\,\,\,\,\, = {{\bf {V}}_0}{\bf {U}}_0^{\rm {T}}{{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + ({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}){{\bf {U}}^{*{\rm {T}}}}{{\bf {U}}^*}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\\ \,\,\,\,\,\,\,\,\,\,\, = {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}} = {{\bf {I}}_q}, \end {array}\] satisfying the assumption, where the idempotent property \({({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})^2} = {{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}\) is used. The matrix \(\bf {X}\) gives the same maximum \[\begin {array}{l} {\rm {tr}}({{\bf {A}}^{\rm {T}}}{\bf {X}}) = {\rm {tr}}\left [ {{{\bf {V}}_0}{{\bf {\Lambda }}_0}{\bf {U}}_0^{\rm {T}}\{ {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}^*}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\} } \right ]\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = {\rm {tr}}({{\bf {V}}_0}{{\bf {\Lambda }}_0}{\bf {U}}_0^{\rm {T}}{{\bf {U}}_0}{\bf {V}}_0^{\rm {T}}) = {\rm {tr}}({{\bf {V}}_0}{{\bf {\Lambda }}_0}{\bf {V}}_0^{\rm {T}})\\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, = {\rm {tr}}({{\bf {\Lambda }}_0}). \end {array}\]
In the expression \({\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {\bf {Q}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}})\), the term \({{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}\) is the projection matrix onto the space spanned by the \(q^*\) orthonormal columns of \({\bf {V}}_0\) given by \({{\bf {V}}_0}{({\bf {V}}_0^{\rm {T}}{{\bf {V}}_0})^{ - 1}}{\bf {V}}_0^{\rm {T}} = {{\bf {V}}_0}{\bf {I}}_{q*}^{ - 1}{\bf {V}}_0^{\rm {T}} = {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}\). Recall that \({{\bf {I}}_q} = {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {V}}_1}{\bf {V}}_1^{\rm {T}}\). Then, \({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}} = {{\bf {V}}_1}{\bf {V}}_1^{\rm {T}}\) is also the projection matrix onto the space spanned by the \(q - {q^*}\) orthonormal columns of \({\bf {V}}_1\). In the term \({\bf {Q}}({{\bf {I}}_q} - {{\bf {V}}_0}{\bf {V}}_0^{\rm {T}}) = {\bf {Q}}{{\bf {V}}_1}{\bf {V}}_1^{\rm {T}}\), each row of \(\bf {Q}\) is projected onto the space spanned by \({\bf {V}}_1\).
An arbitrary property of \({\bf {U}}_1\) and \({\bf {V}}_1\) in \({\bf {X}} = {\bf {U}}{{\bf {V}}^{\rm {T}}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}}\) of Ten Berge’s (1983) solution similar to that with \(\bf {Q}\) is that \({\bf {U}}_1\) and \({\bf {V}}_1\) can be replaced by \({{\bf {U}}_1}{{\bf {W}}_{{\bf {U}}1}}\) and \({{\bf {V}}_1}{{\bf {W}}_{{\bf {V}}1}}\), respectively, where \({\bf {W}}_{{\bf {U}}1}\) and \({\bf {W}}_{{\bf {V}}1}\) are arbitrary \((q - {q^*}) \times (q - {q^*})\) orthonormal matrices yielding \({{\bf {X}}^*} \equiv {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{{\bf {W}}_{{\bf {U}}1}}{({{\bf {V}}_1}{{\bf {W}}_{{\bf {V}}1}})^{\rm {T}}} \ne {\bf {X}} = {{\bf {U}}_0}{\bf {V}}_0^{\rm {T}} + {{\bf {U}}_1}{\bf {V}}_1^{\rm {T}}\).
For completeness, arbitrary aspects in \({\bf {A}} = {{\bf {U}}_0}{{\bf {\Lambda }}_0}{\bf {V}}_0^{\rm {T}}\) of rank \({q^*} \le q\) are noted. Under the standard definition of \({{\bf {\Lambda }}_0} = {\rm {diag}}({\lambda _1},...,{\lambda _{{q^*}}})\) with \({\lambda _1} \ge \cdots \ge {\lambda _{{q^*}}} \ge 0\), \({\bf {\Lambda }}_0\) is identified while \({\bf {U}}_0\) and \({\bf {V}}_0\) are identified up to the sign changes (orientations or reflections) of the pairs of their corresponding columns. Further, suppose that some positive singular values are multiple e.g., \({\lambda _j} = {\lambda _{j + 1}} = \cdots = {\lambda _{j + k - 1}} \equiv {\lambda _{j(k)}}\) with multiplicity \(k( > 1)\), we have \[\begin {array}{l} {\bf {A}} = {{\bf {U}}_{0( - k)}}{{\bf {\Lambda }}_{0( - k)}}{\bf {V}}_{0( - k)}^{\rm {T}} + {{\bf {U}}_{0(k)}}{{\bf {\Lambda }}_{0(k)}}{\bf {V}}_{0(k)}^{\rm {T}}\\ \,\,\,\,\, = {{\bf {U}}_{0( - k)}}{{\bf {\Lambda }}_{0( - k)}}{\bf {V}}_{0( - k)}^{\rm {T}} + {{\bf {U}}_{0(k)}}{\lambda _{j(k)}}{{\bf {I}}_k}{\bf {V}}_{0(k)}^{\rm {T}}\\ \,\,\,\,\, = {{\bf {U}}_{0( - k)}}{{\bf {\Lambda }}_{0( - k)}}{\bf {V}}_{0( - k)}^{\rm {T}} + {\lambda _{j(k)}}{{\bf {U}}_{0(k)}}{{\bf {W}}_{(k)}}{({{\bf {V}}_{0(k)}}{{\bf {W}}_{(k)}})^{\rm {T}}}, \end {array}\] where \({\bf {U}}_{0(k)}\) and \({\bf {V}}_{0(k)}\) are semiorthonormal submatrices of \({\bf {U}}_0\) and \({\bf {V}}_0\), respectively corresponding to the multiple \(\lambda _{j(k)}\) with \({\bf {\Lambda }}_{0(k)}\) defined similarly; and \({\bf {U}}_{0( - k)}\), \({\bf {V}}_{0( - k)}\) and \({\bf {\Lambda }}_{0( - k)}\) are matrices given by using the singular values except \({\lambda _j} = {\lambda _{j + 1}} = \cdots = {\lambda _{j + k - 1}}\); and \({\bf {W}}_{(k)}\) is an arbitrary \(k \times k\) orthonormal matrix. The last arbitrary property is similar to that in the so-called “rotational indeterminacy” in factor analysis.