A data set has two classes and each class follows a two-dimensional Gaussian dis

ID: 2080217 • Letter: A

Question

A data set has two classes and each class follows a two-dimensional Gaussian distribution, C_0: x = [x_1 x_2] = [0 0] + [n_1 n_2] = [0 0] + n_1 C_1: x = [x_1 x_2] = [2 2] + [n_1 n_2] = [2 2] + n_2 The noise n1 and n2 has a mean vector and covariance matrices of m_n_1 = [0 0], Sum_n_1 = [0.25 0 0 0.25] m_n_2 = [0 0], Sum_n_2 = [1 0 0 1] Show the conditional probability distributions p(x|C_i), i = 0, 1 The MAP criterion for classification is, since p(C_i|x) = p(x|C_i)p(C_i)/p(x), we can rewrite the MAP criterion, in terms of likelihood ratio Lambda(x), as Further, we can write the MAP criterion in terms of log likelihood ratio as Derive the log likelihood ratio L(x) = ln p(x|C_0)/p(x|C_1). Derive the decision boundary using the MAP criterion assuming p(C_0) = p(C_1) = 1/2.

Explanation / Answer

In a statistical-classification problem with two classes, a decision boundary or decision surface is a hypersurface that partitions the underlying vector space into two sets, one for each class. The classifier will classify all the points on one side of the decision boundary as belonging to one class and all those on the other side as belonging to the other class.

A decision boundary is the region of a problem space in which the output label of a classifier is ambiguous.[1]

If the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly separable.

Decision boundaries are not always clear cut. That is, the transition from one class in the feature space to another is not discontinuous, but gradual. This effect is common in fuzzy logic based classification algorithms, where membership in one class or another is ambiguous

Dimension reduction is widely accepted as an analysis and modeling tool to deal with high-dimensional spaces. There are several reasons to keep the dimension as low as possible. For instance, it is desirable to reduce the system complexity, to avoid the curse of dimensionality, and to enhance data understanding. In general, dimension reduction can be defined as the search for a low-dimensional linear or nonlinear subspace that preserves some intrinsic properties of the original high-dimensional data. However, different applications have different preferences of what properties should be preserved in the reduction process. At least we can identify three cases:

1. Visualization and exploration, where the challenge is to embed a set of high-dimensional observations into a low-dimensional Euclidian space that preserves as closely as possible their intrinsic global/local metric structure [1], [2] and [3].

2. Regression, in which the goal is to reduce the dimension of the predictor vector with the minimum loss in its capacity to infer about the conditional distribution of the response variable [4], [5] and [6].

3. Classification, where we seek reductions that minimize the lowest attainable classification error in the transformed space [7].

Such disparate interpretations might thereby cast a strong influence on the design and choice of an appropriate dimension reduction algorithm for a given task as far as optimality is concerned.

In this paper we study the problem of dimensionality reduction for classification, which is commonly referred to as feature extraction in pattern recognition literature [8] and [9]. Particularly, we restrict ourselves to linear dimension reduction, i.e., seeking linear mapping that minimizes the lowest attainable classification error, i.e. the Bayes error, in the reduced subspace. Linear mapping is mathematically tractable and computationally simple, with certain regularization ability that sometimes makes it outperform nonlinear models. In addition, it may be nonlinearly extended, for example, through global coordination of local linear models (e.g., Refs. [10] and [11]) or kernel mapping (e.g., Refs. [12] and [13]).

PCA, ICA and LDA are typical linear dimension reduction techniques used in the pattern recognition community, which simultaneously generate a set of nested subspaces of all possible dimensions. However, they are not directly related to classification accuracy since their optimality criteria are based on variance, independence and likelihood. Various other dimension reduction methods have also been proposed, which intend to better reflect the classification goal by iteratively optimizing some criteria that either approximate or bound the Bayes error in the reduced subspace [7], [14], [15], [16], [17] and [18]. Such methods exclusively assume a given output dimension, and usually have the problem of local minima. Even though one can find the optimal solution for a given dimension, several questions still remain. How much discriminative information is lost in the reduction process? Which dimension should we choose next to get a better reduction? What is the smallest possible subspace that loses nothing from the original space as far as classification accuracy is concerned? Is there any efficient way to estimate this critical subspace other than the brute force approach, i.e. enumerating every optimal subspace for every possible dimension? The motivation for the present work is to explore possible answers to these questions.

For recognition tasks, finding lower dimensional feature subspaces without loss of discriminative information is especially attractive. We call this process sufficient dimension reduction, borrowing terminology from regression graphics [6]. The knowledge of smallest sufficient subspace enables the classifier designer to have a deeper understanding of the problem at hand, and thus to carry out the classification in a more effective manner. However, among existing dimension reduction algorithms, few have formally incorporated the notion of sufficiency [19].

In the first part of this paper, we formulate the concept of sufficient subspace for classification in parallel terms as for regression [6]. Our initial attempt is to explore a potential parallelism between classification and regression on the common problem of sufficient dimension reduction. In the second part, we discuss how to estimate the smallest sufficient subspace, or more formally, the intrinsic discriminative subspace (IDS). Decision boundary analysis (DBA), originally proposed by Lee and Landgrebe in 1993 [19], is such a technique that is promised, in theory, to recover the true IDS. Unfortunately, conditions for their method to work appear to be quite restrictive [20]. The main weakness of DBA is its dependence on nonparametric functional estimation in the full-dimensional space, which is a hard problem due to the curse of dimensionality. Similar problems have been observed in average derivative estimation (ADE) [21] and [22], a dimension reduction technique for regression in analogy of DBA for classification.

However, recent discovery and elaboration of kernel methods for classification and regression seem to suggest that learning in very high dimensions is not necessarily a terrible mistake. Several successful algorithms (e.g., Refs. [23], [24] and [25]) have been demonstrated with direct dependence on the intrinsic generalization ability of kernel machines in high dimensional spaces. In the same spirit, we will show in this paper that the marriage of DBA and kernel methods may lead to a superior reduction algorithm that shares the appealing properties of both. More precisely, we propose to combine DBA with support vector machines (SVM), a powerful kernel-based learning algorithm that has been successfully applied to many applications. The resultant SVM–DBA algorithm is able to overcome the difficulty of DBA in small sample size situations, and at the same time keep the simplicity of DBA with respect to IDS estimation. Thanks to the compact representation of SVM, our algorithm also achieves a significant gain in both estimation accuracy and computational efficiency over previous DBA implementations. From another perspective, the proposed method can be seen as a natural way to reduce the run-time complexity of SVM itself.

2. Brief review of existing linear dimension reduction methods

There are two basic approaches to dimensionality reduction, supervised and unsupervised. In the context of classification, a supervised approach is generally believed to be more effective. However, there are strong evidences that this is not always true (e.g., PCA and ICA might outperform LDA in face identification [26] and [27]). In this paper, we focus on supervised methods. According to the choice of criterion function, we further divide supervised methods into likelihood-based and error-based categories.

LDA is a time-honored reduction tool, which maximizes the Fisher's criterion (i.e., ratio of between-class over within-class variances). LDA is proven to be equivalent to the maximum likelihood solution to a Gaussian model subject to the equal within-class covariance constraint and reduced rank constraint on class centroids [28]. This likelihood-based interpretation of Fisher's criterion has led to several recent proposals. As the name suggests, heteroscedastic discriminant analysis (HDA [29] and [30]) allows unequal within-class covariance. When a diagonal covariance model is assumed, a special case of HDA called maximum likelihood linear transform (MLLT [31]) can be used to make the diagonal constraint more valid as evidenced from the data. Mixture discriminant analysis (MDA [32]) and nonparametric discriminant analysis (NDA [33]) extend LDA to non-Gaussian distributions and thus show greater flexibility. Penalized discriminant analysis (PDA [34]) is designed for situations with highly correlated features, such as sampled time-series or gray-scale pixel values, where a spatial smoothness constraint is imposed on the LDA coefficients. Exclusively, likelihood-based methods are not directly related to the classification error. Though LDA can be formulated as a generalized eigenvalue problem, its extensions above often require iterative computation.

Navigate

A data set has the following two-number summary: x-bar=47, s=2.5 Find the interv

A data set includes 103 body temperatures of healthy adult humans for which x ov

A data set has two classes and each class follows a two-dimensional Gaussian dis

Question

Explanation / Answer

Related Questions

Navigate