Nowadays there is a lot of data being available, and many more are being created all the time through different data sources such as the internet and its services, sensors (fixed and mobile), media-collection devices, telecommunications, consumption, financial, meteorological, among many others. Any sector, whether industrial, agro-business, health, services, transportation, logistics, urban planning, energy consumption, climatology, meteorology, etc., are interested in how to exploit this huge volume of data.

The knowledge obtained from this data will be higher by paying attention to all available information, by structured data sources and unstructured available around a problem domain. It is also known that to manipulate, manage and extract relevant knowledge of this huge volume of data already exceeded the storage capacity and computer processing.

This new scenario has provided the arise of the concept, Data Science involving the main foundations that support a process of knowledge discovery and generating useful information for decision-making from several sources in large volumes data. This new concept has a strong relationship with Data Mining, since the latter focuses techniques and algorithms to identify patterns and useful knowledge extraction. The details and validity of such procedures and methods characterize and establish the concept of science data.

When a process of knowledge discovery is added a huge amount of data, with its various types, generating various data sources, and the need to manipulate them to online knowledge extraction or real-time, comes a new concept, Big Date. This is a term used to refer to the handling, storage and analysis of massive data volumes generated and captured at high speed from sources containing Various types of data, structured and unstructured, but complementary; and when analyzed can generate credible information of great value to society. The 5Vs are the principles that underlie the concept of Big Data. Within this new paradigm it is possible to identify their key challenges: theoretical foundations, infrastructure, capture, storage, manipulation, search, transfer, mining, analysis, visualization, privacy and data security.

In order to extract knowledge and patterns from massive amounts of data, there are the first challenges: 1st) the hardware infrastructure and software becomes unconventional requiring new architectures and solutions such as Hadoop technology; 2nd) data mining, which allows the discovery of patterns is to be made with acceptable processing times; and 3rd) visualization of massive data with results to support decision-making.

The Data Science Research Group (DSRgroup) of the Computer Science Department of the Pontifical Catholic University of Minas Gerais was created by the collaboration of the laboratories. Applied Computational Intelligence Laboratory – LICAP and Computer Architecture and Parallel Processing – CArt. We are interested in extraction of knowledge from data. These data can be structured, semi-structured and unstructured. We use theories and techniques drawn from fields such as: computation, statistics, mathematics and information theory proposing new approach and/or algorithms for: methodologies for knowledge discovery, conceptual modeling, pre-processing of data, machine learning, statistical learning, formal verification of algorithms, computer programming for explosive problem and high performance computing. Nowadays we are interested in adapt the conventional solutions of Data Mining for Big Data Problems. We are interested in several domains, such as Bioinformatics and Health, Education, Social Networks, Weather forecasting, Prediction of stock, Siderurgical Industry, Energy, Marketing optimization, Fraud detection, Security and Public policy.

We are interested in the following topics

  • Methodology for Knowledge discovery
  • Data mining for conventional and complex data
  • Data mining for sequence
  • Data mining for longitudinal data
  • Data mining for temporal/time series data
  • Pre-processing of data (outlier analysis, missing value, noisy data)
  • Semi-supervised learning
    • Transductive learning
  • Supervised learning
    • Classification
    • Classification Multi-label
  • Unsupervised learning
    • Clustering, bi-clustering
    • Categorical data
    • Similarity measures
  •  Dimensionality reduction
    • Feature selection
    • Bio-inspired mechanisms
  • Techniques and methods
    • Neural networks
    • Support vector machine
    • Bayesian inference
    • Association rules
  • Quantitative methods for data analysis
    • Behaviour analysis
    • Singular Values of Decomposition
    • Factor analysis
    • Simple and multivariate liner regression
    • Non-linear Regression
    • Descriptive Statics
    • Inferential Statics
    • Hypothesis test
    • Analysis of variance – ANOVA
  • Formal concept analysis for data mining
    • Conceptual implication rules
    • Minimal set of implication rule
    • Handling of context of high dimensionality
    • Optimization of FCA algorithms
    • Parallelism of FCA algorithms
    • FCA to represent and analyses social networks
  • Neural Networks for data mining
    • Knowledge extraction from trained neural networks
    • Cause-effect analysis
  •  Formal verification of systems and project of algorithms
    • Algorithms in relation with automata theory and formal language
    • Algorithm for combinatorial optimization
    • Experimental algorithm, testing of algorithms
    • Optimization of algorithm for data mining
  • Big Data
    • Big-data infrastructure
    • Distributed computing MPI, GPU, Map-reduce
    • High-performance computing
    • Sampling, balancing
    • Scalable methods
  • Applications
    •  Bioinformatics
    • Data mining for social good
    • Education
    • Finance
    • Social networks
    • Community detection
    • User modeling
    • Industry