Patrice Koehl
Department of Computer Science
Genome Center
Room 4319, Genome Center, GBSF
451 East Health Sciences Drive
University of California
Davis, CA 95616
Phone: (530) 754 5121

Current Projects: 2018-

A data-driven approach to characterizing wine regionality: Applications to Malbec wines

Unraveling the regional specificities of Malbec wines from Argentina and Northern California

Hsieh Fushing, Olivia Lee, Constantin Heitkamp, Hildegarde Heymann, Susan E. Ebeler, Roger B. Boulton, Patrice Koehl

In any scientific settings, experiments are devices that are used to provide insight into the relationships between the features that control a system and the observations that are made on this system. Analyzing those relationships between features and observations for the set of objects under study can be complicated by possible additional relationships between the features themselves. In this paper, we propose a new approach for performing the analysis of an experiment that relies on these relationships instead of trying to circumvent them. This new approach follows two steps. We first cluster the objects of the experiment using each feature independently. We then assign a distance between two features to be the mutual entropy of the clustering results they generate. The set of features is then clustered using this distance measure. The result of this clustering is a set of sub-groups of features, such that two features in the same group carry similar, i.e. synergetic information with respect to the objects of the experiment. The objects are then analyzed separately on the different sub groups of features, using the recently proposed Data Mechanics approach. We have used this method to analyze the similarities and differences between Malbec wines from Argentina and California, as well as the similarities and differences between sub-regions of those two main wine producing countries. We report detection of groups of features that characterize the origins of the different wines included in the study.

All datas for this study:

Data Cloud Geometry: generating an ultrametric distance measure on data

DCG++: A data-driven metric for geometric pattern recognition

Jiahui Guan, Hsieh Fushing, Patrice Koehl

Clustering large and complex data sets whose partitions may adopt arbitrary shapes remains a difficult challenge. Part of this challenge comes from the difficulty in defining a similarity measure between the data points that captures the underlying geometry of those data points. In this project, we propose an algorithm, DCG++ that generates such a similarity measure that is data-driven and ultrametric. DCG++ uses Markov Chain Random Walks to capture the intrinsic geometry of data, scans possible scales, and combines all this information using a simple procedure that is shown to generate an ultrametric. We validate the effectiveness of this similarity measure on synthetic data with complex geometry, on a real-world data set as well as on an image segmentation problem. The experimental results show a significant improvement on performance with the DCG-based ultrametric compared to using an empirical distance measure.

All programs for this study:
  • DCG.tgz: Compressed archive for source code under LGPL license
  • README: simple README for compiling and running the program

  Page last modified 28 May 2018