In general, I am interested in designing efficient and provable algorithms for solving real world problems. My primary research area is computational geometry, in particular on geometric optimization algorithms in high dimensions. We can often see trends or clusters in data by graphing or plotting-- giving geometric form to data. As data increases in volume and complexity, giving it geometric form and then developing computational geometry algorithms is still a fruitful way to approach data analysis. For example, activity data from a smartphone or fitness tracker can be viewed as a point in thousands of dimensions whose coordinates include all positions, heart rates, etc. from an entire sequence of measurements. For better privacy, we can share summaries (rough position, duration, etc.) as points in tens of dimensions. Points from many people can be clustered to identify similar patterns, and patterns matched (with unreliable data identified and discarded) to recognize actions that a digital assistant could take to improve quality of life or health outcomes. Below, we briefly show some examples and the results obtained from our research.
Clustering is one of the most fundamental problems in computer science, and finds numerous applications in many different areas, such as data management, machine learning, bioinformatics, networking. The common goal of many clustering problems is to partition a set of given data items into a number of clusters so as to minimize the total cost measured by a certain objective function. For example, the popular k-means clustering seeks k mean points to induce a partition of the given data items so that the average squared distance from each data item to its closest mean point is minimized. However, in many real-world applications, data items are often constrained or correlated, which require a great deal of effort to handle such additional constraints. In recent years, considerable attention has been paid to various types of constrained clustering problems, such as l-diversity clustering, r-gather clustering, capacitated clustering, fair clustering. One key property exhibited in many unconstrained clustering problems is the so called locality property, which indicates that each cluster is located entirely inside the Voronoi cell of its center. Existing algorithms for these clustering problems often rely on such a property. However, due to the additional constraints, the locality property may no longer exist. We present a unified framework called Peeling-and-Enclosing (PnE), based on two standalone geometric techniques called Simplex Lemma and Weaker Simplex Lemma, to solve a class of constrained clustering problems without the locality property in Euclidean space, where the dimensionality of the space could be rather high.
The locality property is destroyed if data items are constrained.
Given two patterns in Euclidean space, where each pattern is represented by a (weighted) point set, the goal of geometric matching is to compute the mapping between them with the smallest matching cost (e.g., Hausdorff distance or Earth Mover’s Distance). If the point sets are not fixed the space, the problem is called "geometric alignment" which aims to find appropriate transformations (e.g., rigid or affine transformation) and minimize the matching cost simultaneously. Geometric matching and alignment are fundamental optimization problems, and find many applications in real world, such as face recognition, fingerprint alignment. Moreover, as the recent developments of representation learning theory, a number of machine learning problems can be embedded into Euclidean spaces and reduced to geometric alignment, such as network alignment and unsupervised bilingual learning. We provide a Fully Polynomial Time Approximation Scheme (FPTAS) for solving the geometric alignment problem under rigid transformation.
We also study a new geometric optimization problem called the "geometric prototype'' in Euclidean space. Given a set of point sets, the geometric prototype can be viewed as the "average pattern'' minimizing the total matching cost to them. As a general model, the problem finds many applications in real-world, such as Wasserstein barycenter and ensemble clustering. To our best knowledge, the general geometric prototype problem has yet to be seriously considered by the theory community. To bridge the gap between theory and practice, we show that a small core-set can be obtained to substantially reduce the data size. Consequently, any existing heuristic or algorithm can run on the core-set to achieve a great improvement on the efficiency.
In this big data era, we are confronted with an extremely large amount of data and it is important to develop efficient algorithmic techniques to handle the arising realistic issues. Due to its recent rapid development, machine learning becomes a powerful tool for many emerging applications; meanwhile, the quality of training dataset often plays a key role and seriously affects the final learning result. For example, we can collect tremendous data (e.g., texts or images) through the internet, however, the obtained dataset often contains a significant amount of outliers. Since manually removing outliers will be very costly, it is very necessary to develop some efficient algorithms for recognizing outliers automatically in many scenarios. Outlier recognition is a typical unsupervised learning problem and the given data are unlabeled; thus we can only model it as an optimization problem based on some reasonable assumption in practice. For instance, it is very natural to assume that the inliers (i.e., normal data) locate in some dense region while the outliers are scattered in the feature space. We study several fundamental optimization problems with outliers, such as clustering, classification, and linear regression, and design provable algorithms for them.
Truth Discovery is an important problem arising in data analytics related fields such as data mining, database, and big data. It concerns about finding the most trustworthy information from a dataset acquired from a number of unreliable sources. Due to its importance, the problem has been extensively studied in recent years and a number techniques have already been proposed. However, all of them are of heuristic nature and do not have any quality guarantee. We formulate the problem as a high dimensional geometric optimization problem, called Entropy based Geometric Variance. Relying on a number of novel geometric techniques (such as Log-Partition and Modified Simplex Lemma), we further discover new insights to this problem. We show, for the first time, that the truth discovery problem can be solved with guaranteed quality of solution. Particularly, we show that it is possible to achieve a (1+eps)-approximation within nearly linear time under some reasonable assumptions. We expect that our algorithm will be useful for other data related applications.
Computing accurate and robust organizational patterns of chromosome territories inside the cell nucleus is critical for understanding several fundamental genomic processes, such as co-regulation of gene activation, gene silencing, X chromosome inactivation, and abnormal chromosome rearrangement in cancer cells. The usage of advanced fluorescence labeling and image processing techniques has enabled researchers to investigate interactions of chromosome territories at large spatial resolution. The resulting high volume of generated data demands for high-throughput and automated image analysis methods. We introduce a novel algorithmic tool for investigating association patterns of chromosome territories in a population of cells. Our method takes as input a set of graphs, one for each cell, containing information about spatial interaction of chromosome territories, and yields a single graph that contains essential information for the whole population and stands as its structural representative. We formulate this combinatorial problem as a semi-definite programming and present novel geometric techniques to efficiently solve it.