Prof. Dr. Katharina Jutta Morik

Machine Learning
TU Dortmund University

Contact

Hub
  • Dynamic route planning with real-time traffic predictions
    Liebig, T. and Piatkowski, N. and Bockermann, C. and Morik, K.
    Information Systems 64 (2017)
    Situation aware route planning gathers increasing interest as cities become crowded and jammed. We present a system for individual trip planning that incorporates future traffic hazards in routing. Future traffic conditions are computed by a Spatio-Temporal Random Field based on a stream of sensor readings. In addition, our approach estimates traffic flow in areas with low sensor coverage using a Gaussian Process Regression. The conditioning of spatial regression on intermediate predictions of a discrete probabilistic graphical model allows us to incorporate historical data, streamed online data and a rich dependency structure at the same time. We demonstrate the system with a real-world use-case from Dublin city, Ireland. © 2016 Elsevier Ltd
    view abstract10.1016/j.is.2016.01.007
  • Mining Urban Data (Part C)
    Andrienko, G. and Gunopulos, D. and Ioannidis, Y. and Kalogeraki, V. and Katakis, I. and Morik, K. and Verscheure, O.
    Information Systems 64 (2017)
    Modern cities generate a flood of rich and varied data. New information sources like public transport and wearable devices provide opportunities for novel applications that will improve citizens׳ quality of life by reducing transportation time, enhancing city planning, and improving air quality to name a few applications. From a data science perspective, data emerging from smart cities give rise to a lot of challenges that constitute a new interdisciplinary field of research. This article introduces the third part of a special issue on the topic ‘Mining Urban Data’ published in the journal Information Systems. © 2016 Elsevier Ltd
    view abstract10.1016/j.is.2016.09.003
  • The PRIMPING routine—Tiling through proximal alternating linearized minimization
    Hess, S. and Morik, K. and Piatkowski, N.
    Data Mining and Knowledge Discovery 31 (2017)
    Mining and exploring databases should provide users with knowledge and new insights. Tiles of data strive to unveil true underlying structure and distinguish valuable information from various kinds of noise. We propose a novel Boolean matrix factorization algorithm to solve the tiling problem, based on recent results from optimization theory. In contrast to existing work, the new algorithm minimizes the description length of the resulting factorization. This approach is well known for model selection and data compression, but not for finding suitable factorizations via numerical optimization. We demonstrate the superior robustness of the new approach in the presence of several kinds of noise and types of underlying structure. Moreover, our general framework can work with any cost measure having a suitable real-valued relaxation. Thereby, no convexity assumptions have to be met. The experimental results on synthetic data and image data show that the new method identifies interpretable patterns which explain the data almost always better than the competing algorithms. © 2017, The Author(s).
    view abstract10.1007/s10618-017-0508-z
  • INSIGHT: Dynamic traffic management using heterogeneous urban data
    Panagiotou, N. and Zygouras, N. and Katakis, I. and Gunopulos, D. and Zacheilas, N. and Boutsis, I. and Kalogeraki, V. and Lynch, S. and O’Brien, B. and Kinane, D. and Mareček, J. and Yu, J.Y. and Verago, R. and Daly, E. and Piatkowski, N. and Liebig, T. and Bockermann, C. and Morik, K. and Schnitzler, F. and Weidlich, M. and Gal, A. and Mannor, S. and Stange, H. and Halft, W. and Andrienko, G.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9853 LNCS (2016)
    In this demo we present INSIGHT, a system that provides traffic event detection in Dublin by exploiting Big Data and Crowdsourcing techniques. Our system is able to process and analyze input from multiple heterogeneous urban data sources. © Springer International Publishing AG 2016.
    view abstract10.1007/978-3-319-46131-1_5
  • Integer undirected graphical models for resource-constrained systems
    Piatkowski, N. and Lee, S. and Morik, K.
    Neurocomputing 173 (2016)
    Machine learning on resource-constrained ubiquitous devices suffers from high energy consumption and slow execution. The number of clock cycles that is consumed by arithmetic instructions has an immediate impact on both. In computer systems, the number of consumed cycles depends on particular operations and the types of their operands. We propose a new class of probabilistic graphical models that approximates the full joint probability distribution of discrete multivariate random variables by relying only on integer addition/multiplication and binary bit shift operations. This allows us to sample from high-dimensional generative models and to use structured discriminative classifiers even on computational devices with slow floating point units or in situations where energy has to be saved. While theory and experiments on random synthetic data suggest that hard instances (leading to a large approximation error) exist, experiments on benchmark and real-world data show that the integer models achieve qualitatively the same results as their double-precision counterparts. Moreover, clock cycle consumption on two hardware platforms is regarded, where our results show that resource savings due to integer approximation is even larger on low-end hardware. The integer models consume half of the clock cycles and a small fraction of memory compared to ordinary undirected graphical models. © 2015 Elsevier B.V.
    view abstract10.1016/j.neucom.2015.01.091
  • Interpretable domain adaptation via optimization over the Stiefel manifold
    Pölitz, C. and Duivesteijn, W. and Morik, K.
    Machine Learning 104 (2016)
    In domain adaptation, the goal is to find common ground between two, potentially differently distributed, data sets. By finding common concepts present in two sets of words pertaining to different domains, one could leverage the performance of a classifier for one domain for use on the other domain. We propose a solution to the domain adaptation task, by efficiently solving an optimization problem through Stochastic Gradient Descent. We provide update rules that allow us to run Stochastic Gradient Descent directly on a matrix manifold: the steps compel the solution to stay on the Stiefel manifold. This manifold encompasses projection matrices of word vectors onto low-dimensional latent feature representations, which allows us to interpret the results: the rotation magnitude of the word vector projection for a given word corresponds to the importance of that word towards making the adaptation. Beyond this interpretability benefit, experiments show that the Stiefel manifold method performs better than state-of-the-art methods. © 2016, The Author(s).
    view abstract10.1007/s10994-016-5577-5
  • Mining Urban Data (Part B)
    Andrienko, G. and Gunopulos, D. and Ioannidis, Y. and Kalogeraki, V. and Katakis, I. and Morik, K. and Verscheure, O.
    Information Systems 57 (2016)
    Modern cities are flooded with data. New information sources like public transport and wearable devices provide opportunities for novel applications that will improve citizens' quality of life. From a data science perspective, data emerging from smart cities give rise to a lot of challenges that constitute a new interdisciplinary field of research. This article introduces the second part of a special issue on the topic 'Mining Urban Data' published in the journal Information Systems. © 2016 Published by Elsevier Ltd.
    view abstract10.1016/j.is.2016.01.001
  • Predictive process monitoring based on distributed sensor data
    Wiegand, M. and Stolpe, M. and Deuse, J. and Morik, K.
    At-Automatisierungstechnik 64 (2016)
    This paper presents a concept for predictive process monitoring based on real-time analysis of distributed sensor data with means of machine learning. To that end the paper proposes a systematic procedure for data preparation and analysis allowing for the prediction of final product quality. © 2016 Walter de Gruyter Berlin/Boston.
    view abstract10.1515/auto-2016-0013
  • Resource-aware steel production through data mining
    Blom, H. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9853 LNCS (2016)
    Today’s steel industry is characterized by overcapacity and increasing competitive pressure. There is a need for continuously improving processes, with a focus on consistent enhancement of efficiency, improvement of quality and thereby better competitiveness. About 70% of steel is produced using the BF-BOF (Blast Furnace-Blow Oxygen Furnace) route worldwide. The BOF is the first step of controlling the composition of the steel and has an impact on all further processing steps and the overall quality of the end product. Multiple sources of processrelated variance and overall harsh conditions for sensors and automation systems in general lead to a process complexity that is not easy to model with thermodynamic or metallurgical approaches. In this paper we want to give an insight how to improve the output quality with machine learning based modeling and which constraints and requirements are necessary for an online application in real-time. © Springer International Publishing AG 2016.
    view abstract10.1007/978-3-319-46131-1_31
  • Sustainable industrial processes by embedded real-time quality prediction
    Stolpe, M. and Blom, H. and Morik, K.
    Studies in Computational Intelligence 645 (2016)
    Sustainability of industrial production focuses on minimizing gas house emissions and the consumption ofmaterials and energy. The iron and steel production offers an enormous potential for resource savings through production enhancements. This chapter describes howembedding data analysis (datamining, machine learning) enhances steel production such that resources are saved. The steps of embedded data analysis are comprehensively presented giving an overview of related work. The challenges of (steel) production for data analysis are investigated. A framework for processing data streams is used for real-time processing. We have developed new algorithms that learn from aggregated data and from vertically distributed data. Two real-world case studies are described: the prediction of the Basic Oxygen Furnace endpoint and the quality prediction in a hot rolling mill process. Both case studies are not academic prototypes, but truly real-world applications. © Springer International Publishing Switzerland 2016.
    view abstract10.1007/978-3-319-31858-5_10
  • Data driven science: SIGKDD panel
    Morik, K. and Durrant-Whyte, H. and Hill, G. and Müller, D. and Berger-Wolf, T.
    Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2015-August (2015)
    The panel session, Data Driven Science discusses application and use of knowledge discovery, machine learning and data analytics in science disciplines; in natural, physical, medical and social science; from physics to geology, and from neuroscience to population health. Knowledge discovery methods are finding broad application in all areas of scientific endeavor, to explore experimental data, to discover new models, to propose new scientific theories and ideas. In addition, the availability of ever larger scientific data sets is driving a new data-driven paradigm for modeling of complex phenomena in physical, natural and social sciences. The purpose of this panel is to bring together users of knowledge discovery, machine learning and data analytics methods across the science disciplines, to understand what tools and methods are proving effective in areas such as data exploration and modeling, to uncover common problems that can be addressed in the KDD community, and to explore the emerging data-driven paradigm in science.
    view abstract10.1145/2783258.2788703
  • Development of a general analysis and unfolding scheme and its application to measure the energy spectrum of atmospheric neutrinos with IceCube: IceCube Collaboration
    Aartsen, M.G. and Ackermann, M. and Adams, J. and Aguilar, J.A. and Ahlers, M. and Ahrens, M. and Altmann, D. and Anderson, T. and Arguelles, C. and Arlen, T.C. and Auffenberg, J. and Bai, X. and Barwick, S.W. and Baum, V. and Beatty, J.J. and Tjus, J.B. and Becker, K.-H. and BenZvi, S. and Berghaus, P. and Berley, D. and Bernardini, E. and Bernhard, A. and Besson, D.Z. and Binder, G. and Bindig, D. and Bissok, M. and Blaufuss, E. and Blumenthal, J. and Boersma, D.J. and Bohm, C. and Bos, F. and Bose, D. and Böser, S. and Botner, O. and Brayeur, L. and Bretz, H.P. and Brown, A.M. and Casey, J. and Casier, M. and Cheung, E. and Chirkin, D. and Christov, A. and Christy, B. and Clark, K. and Classen, L. and Clevermann, F. and Coenders, S. and Cowen, D.F. and Cruz Silva, A.H. and Danninger, M. and Daughhetee, J. and Davis, J.C. and Day, M. and de André, J.P.A.M. and De Clercq, C. and De Ridder, S. and Desiati, P. and de Vries, K.D. and de With, M. and DeYoung, T. and Díaz-Vélez, J.C. and Dunkman, M. and Eagan, R. and Eberhardt, B. and Eichmann, B. and Eisch, J. and Euler, S. and Evenson, P.A. and Fadiran, O. and Fazely, A.R. and Fedynitch, A. and Feintzeig, J. and Felde, J. and Feusels, T. and Filimonov, K. and Finley, C. and Fischer-Wasels, T. and Flis, S. and Franckowiak, A. and Frantzen, K. and Fuchs, T. and Gaisser, T.K. and Gaior, R. and Gallagher, J. and Gerhardt, L. and Gier, D. and Gladstone, L. and Glüsenkamp, T. and Goldschmidt, A. and Golup, G. and Gonzalez, J.G. and Goodman, J.A. and Góra, D. and Grant, D. and Gretskov, P. and Groh, J.C. and Groß, A. and Ha, C. and Haack, C. and Haj Ismail, A. and Hallen, P. and Hallgren, A. and Halzen, F. and Hanson, K. and Hebecker, D. and Heereman, D. and Heinen, D. and Helbing, K. and Hellauer, R. and Hellwig, D. and Hickford, S. and Hill, G.C. and Hoffman, K.D. and Hoffmann, R. and Homeier, A. and Hoshina, K. and Huang, F. and Huelsnitz, W. and Hulth, P.O. and Hultqvist, K. and Hussain, S. and Ishihara, A. and Jacobi, E. and Jacobsen, J. and Jagielski, K. and Japaridze, G.S. and Jero, K. and Jlelati, O. and Jurkovic, M. and Kaminsky, B. and Kappes, A. and Karg, T. and Karle, A. and Kauer, M. and Keivani, A. and Kelley, J.L. and Kheirandish, A. and Kiryluk, J. and Kläs, J. and Klein, S.R. and Köhne, J.H. and Kohnen, G. and Kolanoski, H. and Koob, A. and Köpke, L. and Kopper, C. and Kopper, S. and Koskinen, D.J. and Kowalski, M. and Kriesten, A. and Krings, K. and Kroll, G. and Kroll, M. and Kunnen, J. and Kurahashi, N. and Kuwabara, T. and Labare, M. and Larsen, D.T. and Larson, M.J. and Lesiak-Bzdak, M. and Leuermann, M. and Leute, J. and Lünemann, J. and Madsen, J. and Maggi, G. and Maruyama, R. and Mase, K. and Matis, H.S. and Maunu, R. and McNally, F. and Meagher, K. and Medici, M. and Meli, A. and Meures, T. and Miarecki, S. and Middell, E. and Middlemas, E. and Milke, N. and Miller, J. and Mohrmann, L. and Montaruli, T. and Morse, R. and Nahnhauer, R. and Naumann, U. and Niederhausen, H. and Nowicki, S.C. and Nygren, D.R. and Obertacke, A. and Odrowski, S. and Olivas, A. and Omairat, A. and O’Murchadha, A. and Palczewski, T. and Paul, L. and Penek, Ö. and Pepper, J.A. and Pérez de los Heros, C. and Pfendner, C. and Pieloth, D. and Pinat, E. and Posselt, J. and Price, P.B. and Przybylski, G.T. and Pütz, J. and Quinnan, M. and Rädel, L. and Rameez, M. and Rawlins, K. and Redl, P. and Rees, I. and Reimann, R. and Relich, M. and Resconi, E. and Rhode, W. and Richman, M. and Riedel, B. and Robertson, S. and Rodrigues, J.P. and Rongen, M. and Rott, C. and Ruhe, T. and Ruzybayev, B. and Ryckbosch, D. and Saba, S.M. and Sander, H.-G. and Sandroos, J. and Santander, M. and Sarkar, S. and Schatto, K. and Scheriau, F. and Schmidt, T. and Schmitz, M. and Schoenen, S. and Schöneberg, S. and Schönwald, A. and Schukraft, A. and Schulte, L. and Schulz, O. and Seckel, D. and Sestayo, Y. and Seunarine, S. and Shanidze, R. and Smith, M.W.E. and Soldin, D. and Spiczak, G.M. and Spiering, C. and Stamatikos, M. and Stanev, T. and Stanisha, N.A. and Stasik, A. and Stezelberger, T. and Stokstad, R.G. and Stößl, A. and Strahler, E.A. and Ström, R. and Strotjohann, N.L. and Sullivan, G.W. and Taavola, H. and Taboada, I. and Tamburro, A. and Tepe, A. and Ter-Antonyan, S. and Terliuk, A. and Tešić, G. and Tilav, S. and Toale, P.A. and Tobin, M.N. and Tosi, D. and Tselengidou, M. and Unger, E. and Usner, M. and Vallecorsa, S. and van Eijndhoven, N. and Vandenbroucke, J. and van Santen, J. and Vehring, M. and Voge, M. and Vraeghe, M. and Walck, C. and Wallraff, M. and Weaver, C. and Wellons, M. and Wendt, C. and Westerhoff, S. and Whelan, B.J. and Whitehorn, N. and Wichary, C. and Wiebe, K. and Wiebusch, C.H. and Williams, D.R. and Wissing, H. and Wolf, M. and Wood, T.R. and Woschnagg, K. and Xu, D.L. and Xu, X.W. and Yanez, J.P. and Yodh, G. and Yoshida, S. and Zarzhitsky, P. and Ziemann, J. and Zierke, S. and Zoll, M. and Morik, K.
    European Physical Journal C 75 (2015)
    We present the development and application of a generic analysis scheme for the measurement of neutrino spectra with the IceCube detector. This scheme is based on regularized unfolding, preceded by an event selection which uses a Minimum Redundancy Maximum Relevance algorithm to select the relevant variables and a random forest for the classification of events. The analysis has been developed using IceCube data from the 59-string configuration of the detector. 27,771 neutrino candidates were detected in 346 days of livetime. A rejection of 99.9999 % of the atmospheric muon background is achieved. The energy spectrum of the atmospheric neutrino flux is obtained using the TRUEE unfolding program. The unfolded spectrum of atmospheric muon neutrinos covers an energy range from 100 GeV to 1 PeV. Compared to the previous measurement using the detector in the 40-string configuration, the analysis presented here, extends the upper end of the atmospheric neutrino spectrum by more than a factor of two, reaching an energy region that has not been previously accessed by spectral measurements. © 2015, The Author(s).
    view abstract10.1140/epjc/s10052-015-3330-z
  • Discovering neutrinos through data analytics
    Börner, M. and Rhode, W. and Ruhe, T. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9286 (2015)
    Astrophysical experiments produce Big Data which need efficient and effective data analytics. In this paper we present a general data analysis process which has been successfully applied to data from IceCube, a cubic kilometer neutrino detector located at the geographic South Pole. The goal of the analysis is to separate neutrinos from atmospheric muons within the data to determine the muon neutrino energy spectrum. The presented process covers straight cuts, variable selection, classification, and unfolding. A major challenge in the separation is the unbalanced dataset. The expected signal to background ratio in the initial data (trigger level) is roughly 1:106. The overall process was embedded in a multi-fold cross-validation to control its performance. A subsequent regularized unfolding yields the sought after neutrino energy spectrum. © Springer International Publishing Switzerland 2015.
    view abstract10.1007/978-3-319-23461-8_15
  • Investigation of word senses over time using linguistic corpora
    Pölitz, C. and Bartz, T. and Morik, K. and Störrer, A.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9302 (2015)
    Word sense induction is an important method to identify possible meanings of words. Word co-occurrences can group word contexts into semantically related topics. Besides the pure words, temporal information provide another dimension to further investigate the development of the word meanings over time. Large digital corpora of written language, such as those that are held by the CLARIN-D centers, provide excellent possibilities for such kind of linguistic research on authentic language data. In this paper, we investigate the evolution of meanings of words with topic models over time using large digital text corpora. © Springer International Publishing Switzerland 2015.
    view abstract10.1007/978-3-319-24033-6_22
  • Online analysis of high-volume data streams in astroparticle physics
    Bockermann, C. and Brügge, K. and Buss, J. and Egorov, A. and Morik, K. and Rhode, W. and Ruhe, T.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9286 (2015)
    Experiments in high-energy astroparticle physics produce large amounts of data as continuous high-volume streams. Gaining insights from the observed data poses a number of challenges to data analysis at various steps in the analysis chain of the experiments. Machine learning methods have already cleaved their way selectively at some particular stages of the overall data mangling process. In this paper we investigate the deployment of machine learning methods at various stages of the data analysis chain in a gamma-ray astronomy experiment. Aiming at online and real-time performance, we build up on prominent software libraries and discuss the complete cycle of data processing from raw-data capturing to high-level classification using a data-flow based rapid-prototyping environment. In the context of a gamma-ray experiment, we review user requirements in this interdisciplinary setting and demonstrate the applicability of our approach in a real-world setting to provide results from high-volume data streams in real-time performance. © Springer International Publishing Switzerland 2015.
    view abstract10.1007/978-3-s319-23461-8_7
  • Open smartphone data for structured mobility and utilization analysis in ubiquitous systems
    Piatkowski, N. and Streicher, J. and Spinczyk, O. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8940 (2015)
    The development and evaluation of new data mining methods for ubiquitous environments and systems requires real data that were collected from real users. In this work, we present an open smartphone utilization and mobility data set that was generated with several devices and participants during a 4-month study. A particularity of this data set is the inclusion of low-level operating system data. Additionally to the description of the data, we also describe the process of collection and the privacy measures we applied. To demonstrate the utility of the data, we evaluate the quality of generative spatio-temporal models for “apps” and network cells, since these are required as a building block in general predictions of the resource consumption of ubiquitous systems. © Springer International Publishing Switzerland 2015
    view abstract10.1007/978-3-319-14723-9_7
  • Heterogeneous stream processing and crowdsourcing for traffic monitoring: Highlights
    Schnitzler, F. and Artikis, A. and Weidlich, M. and Boutsis, I. and Liebig, T. and Piatkowski, N. and Bockermann, C. and Morik, K. and Kalogeraki, V. and Marecek, J. and Gal, A. and Mannor, S. and Kinane, D. and Gunopulos, D.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8726 LNAI (2014)
    We give an overview of an intelligent urban traffic management system. Complex events related to congestions are detected from heterogeneous sources involving fixed sensors mounted on intersections and mobile sensors mounted on public transport vehicles. To deal with data veracity, sensor disagreements are resolved by crowdsourcing. To deal with data sparsity, a traffic model offers information in areas with low sensor coverage. We apply the system to a real-world use-case. © 2014 Springer-Verlag.
    view abstract10.1007/978-3-662-44845-8_49
  • Robust selection of cancer survival signatures from high-throughput genomic data using two-fold subsampling
    Lee, S. and Rahnenführer, J. and Lang, M. and De Preter, K. and Mestdagh, P. and Koster, J. and Versteeg, R. and Stallings, R.L. and Varesio, L. and Asgharzadeh, S. and Schulte, J.H. and Fielitz, K. and Schwermer, M. and Morik, K. and Schramm, A.
    PLoS ONE 9 (2014)
    Identifying relevant signatures for clinical patient outcome is a fundamental task in high-throughput studies. Signatures, composed of features such as mRNAs, miRNAs, SNPs or other molecular variables, are often non-overlapping, even though they have been identified from similar experiments considering samples with the same type of disease. The lack of a consensus is mostly due to the fact that sample sizes are far smaller than the numbers of candidate features to be considered, and therefore signature selection suffers from large variation. We propose a robust signature selection method that enhances the selection stability of penalized regression algorithms for predicting survival risk. Our method is based on an aggregation of multiple, possibly unstable, signatures obtained with the preconditioned lasso algorithm applied to random (internal) subsamples of a given cohort data, where the aggregated signature is shrunken by a simple thresholding strategy. The resulting method, RS-PL, is conceptually simple and easy to apply, relying on parameters automatically tuned by cross validation. Robust signature selection using RS-PL operates within an (external) subsampling framework to estimate the selection probabilities of features in multiple trials of RS-PL. These probabilities are used for identifying reliable features to be included in a signature. Our method was evaluated on microarray data sets from neuroblastoma, lung adenocarcinoma, and breast cancer patients, extracting robust and relevant signatures for predicting survival risk. Signatures obtained by our method achieved high prediction performance and robustness, consistently over the three data sets. Genes with high selection probability in our robust signatures have been reported as cancer-relevant. The ordering of predictor coefficients associated with signatures was well-preserved across multiple trials of RS-PL, demonstrating the capability of our method for identifying a transferable consensus signature. The software is available as an R package rsig at CRAN (http://cran.r-project.org). © 2014 Lee et al.
    view abstract10.1371/journal.pone.0108818
  • Anomaly detection in vertically partitioned data by distributed core vector machines
    Stolpe, M. and Bhaduri, K. and Das, K. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8190 LNAI (2013)
    Observations of physical processes suffer from instrument malfunction and noise and demand data cleansing. However, rare events are not to be excluded from modeling, since they can be the most interesting findings. Often, sensors collect features at different sites, so that only a subset is present (vertically distributed data). Transferring all data or a sample to a single location is impossible in many real-world applications due to restricted bandwidth of communication. Finding interesting abnormalities thus requires efficient methods of distributed anomaly detection. We propose a new algorithm for anomaly detection on vertically distributed data. It aggregates the data directly at the local storage nodes using RBF kernels. Only a fraction of the data is communicated to a central node. Through extensive empirical evaluation on controlled datasets, we demonstrate that our method is an order of magnitude more communication efficient than state of the art methods, achieving a comparable accuracy. © 2013 Springer-Verlag.
    view abstract10.1007/978-3-642-40994-3_21
  • Quality prediction in interlinked manufacturing processes based on supervised \& unsupervised machine learning
    Lieber, D. and Stolpe, M. and Konrad, B. and Deuse, J. and Morik, K.
    Procedia CIRP 7 (2013)
    In the context of a rolling mill case study, this paper presents a methodical framework based on data mining for predicting the physical quality of intermediate products in interlinked manufacturing processes. In the first part, implemented data preprocessing and feature extraction components of the Inline Quality Prediction System are introduced. The second part shows how the combination of supervised and unsupervised data mining methods can be applied to identify most striking operational patterns, promising quality-related features and production parameters. The results indicate how sustainable and energy-efficient interlinked manufacturing processes can be achieved by the application of data mining. © 2013 The Authors.
    view abstract10.1016/j.procir.2013.05.033
  • Some machine learning approaches to the analysis of temporal data
    Morik, K.
    Robustness and Complex Data Structures: Festschrift in Honour of Ursula Gather (2013)
    Investigating time is not restricted to time series analysis, where from a sequence of equidistant measurements the value of the next measurement is predicted. In contrast, many applications have to cope with very large collections of time series data. The tasks range from regression and classification to detecting patterns in the data. By several case studies stemming from several years of research, this chapter illustrates the diversity of temporal phenomena handled in machine learning and data mining on the basis of very large data sets. The path leads from time series classification to the analysis of streaming data. A recurrent theme is the appropriate representation, feature extraction, and feature selection for high performance learning. © Springer-Verlag Berlin Heidelberg 2013.
    view abstract10.1007/978-3-642-35494-6_17
  • Spatio-temporal random fields: Compressible representation and distributed estimation
    Piatkowski, N. and Lee, S. and Morik, K.
    Machine Learning 93 (2013)
    Modern sensing technology allows us enhanced monitoring of dynamic activities in business, traffic, and home, just to name a few. The increasing amount of sensor measurements, however, brings us the challenge for efficient data analysis. This is especially true when sensing targets can interoperate - in such cases we need learning models that can capture the relations of sensors, possibly without collecting or exchanging all data. Generative graphical models namely the Markov random fields (MRF) fit this purpose, which can represent complex spatial and temporal relations among sensors, producing interpretable answers in terms of probability. The only drawback will be the cost for inference, storing and optimizing a very large number of parameters - not uncommon when we apply them for real-world applications. In this paper, we investigate how we can make discrete probabilistic graphical models practical for predicting sensor states in a spatio-temporal setting. A set of new ideas allows keeping the advantages of such models while achieving scalability. We first introduce a novel alternative to represent model parameters, which enables us to compress the parameter storage by removing uninformative parameters in a systematic way. For finding the best parameters via maximum likelihood estimation, we provide a separable optimization algorithm that can be performed independently in parallel in each graph node. We illustrate that the prediction quality of our suggested method is comparable to those of the standard MRF and a spatio-temporal k-nearest neighbor method, while using much less computational resources. © 2013 The Author(s).
    view abstract10.1007/s10994-013-5399-7
  • Exon-level expression analyses identify MYCN and NTRK1 as major determinants of alternative exon usage and robustly predict primary neuroblastoma outcome
    Schramm, A. and Schowe, B. and Fielitz, K. and Heilmann, M. and Martin, M. and Marschall, T. and Köster, J. and Vandesompele, J. and Vermeulen, J. and De Preter, K. and Koster, J. and Versteeg, R. and Noguera, R. and Speleman, F. and Rahmann, S. and Eggert, A. and Morik, K. and Schulte, J.H.
    British Journal of Cancer 107 (2012)
    Background: Using mRNA expression-derived signatures as predictors of individual patient outcome has been a goal ever since the introduction of microarrays. Here, we addressed whether analyses of tumour mRNA at the exon level can improve on the predictive power and classification accuracy of gene-based expression profiles using neuroblastoma as a model. Methods: In a patient cohort comprising 113 primary neuroblastoma specimens expression profiling using exon-level analyses was performed to define predictive signatures using various machine-learning techniques. Alternative transcript use was calculated from relative exon expression. Validation of alternative transcripts was achieved using qPCR- and cell-based approaches. Results: Both predictors derived from the gene or the exon levels resulted in prediction accuracies >80% for both event-free and overall survival and proved as independent prognostic markers in multivariate analyses. Alternative transcript use was most prominently linked to the amplification status of the MYCN oncogene, expression of the TrkA/NTRK1 neurotrophin receptor and survival. Conclusion: As exon level-based prediction yields comparable, but not significantly better, prediction accuracy than gene expression-based predictors, gene-based assays seem to be sufficiently precise for predicting outcome of neuroblastoma patients. However, exon-level analyses provide added knowledge by identifying alternative transcript use, which should deepen the understanding of neuroblastoma biology. © 2012 Cancer Research UK.
    view abstract10.1038/bjc.2012.391
  • Introduction to data mining for sustainability
    Morik, K. and Bhaduri, K. and Kargupta, H.
    Data Mining and Knowledge Discovery 24 (2012)
    Data mining techniques are presented to explore and analyze environmental spatio-temporal data or help to design and operate better sustainable systems. The measurement process needs to be understood, managed, and controlled. The data collections of environmental and engineering approaches to sustainability are challenging data mining in various ways. The high-dimensional data sets are organized into spatial and temporal neighborhoods and the relation between these two orderings needs to be taken into account by the mining algorithms. Many tools have been developed that display the data in different views and allow for interactive analysis. Another approach to sustainability is to enhance the management of human consumption of resources. This engineering approach aims at controlling processes such that natural resources are conserved.
    view abstract10.1007/s10618-011-0239-5
  • Multi-objective frequent termset clustering
    Morik, K. and Kaspari, A. and Wurst, M. and Skirzynski, M.
    Knowledge and Information Systems 30 (2012)
    Large media collections rapidly evolve in the World Wide Web. In addition to the targeted retrieval as is performed by search engines, browsing and explorative navigation is an important issue. Since the collections grow fast and authors most often do not annotate their web pages according to a given ontology, automatic structuring is in demand as a prerequisite for any pleasant human-computer interface. In this paper, we investigate the problem of finding alternative high-quality structures for navigation in a large collection of high-dimensional data. We express desired properties of frequent termset clustering (FTS) in terms of objective functions. In general, these functions are conflicting. This leads to the formulation of FTS clustering as a multi-objective optimization problem. The optimization is solved by a genetic algorithm. The result is a set of Pareto-optimal solutions. Users may choose their favorite type of a structure for their navigation through a collection or explore the different views given by the different optimal solutions. We explore the capability of the new approach to produce structures that are well suited for browsing on a social bookmarking data set. © 2011 Springer-Verlag London Limited.
    view abstract10.1007/s10115-011-0431-3
  • Separable approximate optimization of support vector machines for distributed sensing
    Lee, S. and Stolpe, M. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7524 LNAI (2012)
    Sensor measurements from diverse locations connected with possibly low bandwidth communication channels pose a challenge of resource-restricted distributed data analyses. In such settings it would be desirable to perform learning in each location as much as possible, without transferring all data to a central node. Applying the support vector machines (SVMs) with nonlinear kernels becomes nontrivial, however. In this paper, we present an efficient optimization scheme for training SVMs over such sensor networks. Our framework performs optimization independently in each node, using only the local features stored in the respective node. We make use of multiple local kernels and explicit approximations to the feature mappings induced by them. Together they allow us constructing a separable surrogate objective that provides an upper bound of the primal SVM objective. A central coordination is also designed to adjust the weights among local kernels for improved prediction, while minimizing communication cost. © 2012 Springer-Verlag.
    view abstract10.1007/978-3-642-33486-3_25
  • The JARID 1C histone demethylase is upregulated in aggressive neuroblastomas independent of MYCN amplification
    Fielitz, K. and Schowe, B. and Schulte, J. H. and Vandesompele, J. and Mestdagh, P. and Eggert, A. and Morik, K. and Schramm, A.
    Klinische Padiatrie 224 (2012)
    view abstract10.1055/s-0032-1310494
  • Fast-ensembles of minimum redundancy feature selection
    Schowe, B. and Morik, K.
    Studies in Computational Intelligence 373 (2011)
    Finding relevant subspaces in very high-dimensional data is a challenging task not only for microarray data. The selection of features is to enhance the classification performance, but on the other hand the feature selection must be stable, i.e., the set of features selected should not change when using different subsets of a population. ensemble methods have succeeded in the increase of stability and classification accuracy. However, their runtime prevents them from scaling up to real-world applications.We propose two methods which enhance correlation-based feature selection such that the stability of feature selection comes with little or even no extra runtime.We show the efficiency of the algorithms analytically and empirically on a wide range of datasets. © 2011 Springer-Verlag Berlin Heidelberg.
    view abstract10.1007/978-3-642-22910-7_5
  • Learning from label proportions by optimizing cluster model selection
    Stolpe, M. and Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6913 LNAI (2011)
    In a supervised learning scenario, we learn a mapping from input to output values, based on labeled examples. Can we learn such a mapping also from groups of unlabeled observations, only knowing, for each group, the proportion of observations with a particular label? Solutions have real world applications. Here, we consider groups of steel sticks as samples in quality control. Since the steel sticks cannot be marked individually, for each group of sticks it is only known how many sticks of high (low) quality it contains. We want to predict the achieved quality for each stick before it reaches the final production station and quality control, in order to save resources. We define the problem of learning from label proportions and present a solution based on clustering. Our method empirically shows a better prediction performance than recent approaches based on probabilistic SVMs, Kernel k-Means or conditional exponential models. © 2011 Springer-Verlag.
    view abstract10.1007/978-3-642-23808-6_23
  • Towards adjusting mobile devices to user's behaviour
    Fricke, P. and Jungermann, F. and Morik, K. and Piatkowski, N. and Spinczyk, O. and Stolpe, M. and Streicher, J.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6904 (2011)
    Mobile devices are a special class of resource-constrained embedded devices. Computing power, memory, the available energy, and network bandwidth are often severely limited. These constrained resources require extensive optimization of a mobile system compared to larger systems. Any needless operation has to be avoided. Time-consuming operations have to be started early on. For instance, loading files ideally starts before the user wants to access the file. So-called prefetching strategies optimize system's operation. Our goal is to adjust such strategies on the basis of logged system data. Optimization is then achieved by predicting an application's behavior based on facts learned from earlier runs on the same system. In this paper, we analyze system-calls on operating system level and compare two paradigms, namely server-based and device-based learning. The results could be used to optimize the runtime behaviour of mobile devices. © 2011 Springer-Verlag.
    view abstract10.1007/978-3-642-23599-3_6
  • Accurate prediction of neuroblastoma outcome based on miRNA expression profiles
    Schulte, J.H. and Schowe, B. and Mestdagh, P. and Kaderali, L. and Kalaghatgi, P. and Schlierf, S. and Vermeulen, J. and Brockmeyer, B. and Pajtler, K. and Thor, T. and De Preter, K. and Speleman, F. and Morik, K. and Eggert, A. and Vandesompele, J. and Schramm, A.
    International Journal of Cancer 127 (2010)
    For neuroblastoma, the most common extracranial tumour of childhood, identification of new biomarkers and potential therapeutic targets is mandatory to improve risk stratification and survival rates. MicroRNAs are deregulated in most cancers, including neuroblastoma. In this study, we analysed 430 miRNAs in 69 neuroblastomas by stem-loop RT-qPCR. Prediction of event-free survival (EFS) with support vector machines (SVM) and actual survival times with Cox regression-based models (CASPAR) were highly accurate and were independently validated. SVM-accuracy for prediction of EFS was 88.7% (95% CI: 88.5-88.8%). For CASPAR-based predictions, 5y-EFS probability was 0.19% (95% CI: 0-38%) in the CASPAR-predicted short survival group compared with 0.78% (95%CI: 64-93%) in the CASPAR-predicted long survival group. Both classifiers were validated on an independent test set yielding accuracies of 94.74% (SVM) and 5y-EFS probabilities as 0.25 (95% CI: 0.0-0.55) for short versus 1 ± 0.0 for long survival (CASPAR), respectively. Amplification of the MYCN oncogene was highly correlated with deregulation of miRNA expression. In addition, 37 miRNAs correlated with TrkA expression, a marker of excellent outcome, and 6 miRNAs further analysed in vitro were regulated upon TrkA transfection, suggesting a functional relationship. Expression of the most significant TrkA-correlated miRNA, miR-542-5p, also discriminated between local and metastatic disease and was inversely correlated with MYCN amplification and event-free survival. We conclude that neuroblastoma patient outcome prediction using miRNA expression is feasible and effective. Studies testing miRNA-based predictors in comparison to and in combination with mRNA and aCGH information should be initiated. Specific miRNAs (e.g., miR-542-5p) might be important in neuroblastoma tumour biology, and qualify as potential therapeutic targets. © 2010 UICC.
    view abstract10.1002/ijc.25436
  • Clustering the web 2.0
    Morik, K. and Wurst, M.
    Studies in Computational Intelligence 263 (2010)
    Ryszard Michalski has been the pioneer of Machine Learning. His conceptual clustering focused on the understandability of clustering results. It is a key requirement if Machine Learning is to serve users successfully. In this chapter, we present two approaches to clustering in the scenario of Web 2.0 with a special concern of understandability in this new context. In contrast to semantic web approaches which advocate ontologies as a common semantics for homogeneous user groups, Web 2.0 aims at supporting heterogeneous user groups where users annotate and organize their content without a reference to a common schema. Hence, the semantics is not made explicit. It can be extracted by Machine Learning, though, hence providing users with new services. © 2010 Springer-Verlag Berlin Heidelberg.
    view abstract10.1007/978-3-642-05179-1_10
  • Enhancing ubiquitous systems through system call mining
    Morik, K. and Jungermann, F. and Piatkowski, N. and Engel, M.
    Proceedings - IEEE International Conference on Data Mining, ICDM (2010)
    Collecting, monitoring, and analyzing data automatically by well instrumented systems is frequently motivated by human decision-making. However, the same need occurs when system software decisions are to be justified. Compiler optimization or storage management requires several decisions which result in more or less resource consumption, be it energy, memory, or runtime. A magnitude of system data can be collected in order to base decisions of compilers or the operating system on empirical analysis. The challenge of large-scale data is aggravated if system data of small and often mobile systems are collected and analyzed. In contrast to the large data volume, the mobile devices offer only very limited storage and computing capacity. Moreover, if analysis results are put to use at the operating system, the real-time response is at the system level, not on the level of human reaction time. In this paper, small and most often mobile systems (i.e., ubiquitous systems) are instrumented for the collection of system call data. It is investigated whether the sequence and the structure of system calls are to be taken into account by the learning method, or not. A structural learning method, Conditional Random Fields (CRF), is applied using different internal optimization algorithms and feature mappings. Implementing CRF in a massively parallel way using general purpose graphic processor units (GPGPU) points at future ubiquitous systems. © 2010 IEEE.
    view abstract10.1109/ICDMW.2010.133
  • Learning in Order: Steps of Acquiring the Concept of the Day/Night Cycle
    Morik, K. and Mühlenbrock, M.
    In Order to Learn: How the sequence of topics influences learning (2010)
    This chapter presents a detailed model of children's explanations of where the sun goes at night. Knowledge of the day/night cycle is one of the first relatively complex sets of knowledge that all people acquire. The model shows how children progress through a lattice of possible explanations (a lattice is a partially but not completely ordered set). The task and data modeled offer an excellent basis for the investigation of order effects, with implications for modeling scientific discovery and for learning in general. It shows that some transitions are particularly difficult, that some transitions require using incomplete or incorrect knowledge, and that not all transitions are possible. It also shows that the order of learning can make a large difference in the amount that has to be learned and, perhaps more importantly, unlearned. Better orders provide about a 30% reduction in facts that have to be learned. These findings make suggestions about the instructional complexity that children, and presumably learners in general, can handle, and about the use and importance of intermediate stages of learning. © 2007 by Frank E. Ritter, Josef Nerb, Erno Lehtinen, and Timothy M. O'Shea. All rights reserved.
    view abstract10.1093/acprof:oso/9780195178845.003.0009
  • Nemoz - A distributed framework for collaborative media organization
    Morik, K.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6202 LNAI (2010)
    Multimedia applications have received quite some interest. Embedding them into a framework of ubiquitous computing and peer-to-peer Web 2.0 applications raises research questions of resource-awareness which are not that demanding within a server-based framework. In this chapter, we present Nemoz, a collaborative music organizer based on distributed data and multimedia mining techniques. We introduce the Nemoz platform before focusing on the steps of intelligent collaborative structuring of multimedia collections, namely, feature extraction and distributed data mining. We summarize the characteristics of knowledge discovery in ubiquitous computing that have been handled within the Nemoz project. © 2010 Springer-Verlag.
    view abstract10.1007/978-3-642-16392-0_12
  • clustering

  • datadriven science

  • gene expression profiling

  • learning systems

  • neuroblastoma

  • sensors

  • spatio-temporal models

« back