publications | Pablo Andretta Jaskowiak

2026

Experimental and machine learning-based predictions of thermohydraulic performance in gasketed plate heat exchangers with three corrugated plate geometries

G. S. M. Martins, A. M. P. Gabardo, A. O. Stein, J. L. G. Oliveira, P. A. Jaskowiak, and K. V. Paiva

Applied Thermal Engineering, 2026

Abs Bib DOI URL

Gasketed plate heat exchangers (GPHEs) are widely used in thermal systems, yet substantial discrepancies persist among published friction-factor correlations, even for nominally similar corrugated geometries. A key unresolved cause is plate pack breathing—elastic deformation driven by inlet-pressure asymmetry and tightening—which creates expanded and constricted channels with distinct hydraulic behavior. This work reports a systematic experimental investigation of three corrugation patterns (Chevron, Zigzag, and Four-Quadrant) tested at three tightening levels (0.98A, 1.00A, and 1.02A) under controlled inlet-pressure differences. Local friction factors (f), overall heat-transfer conductance (UA), and Nusselt numbers (Nu) were determined independently for each channel to quantify breathing effects. Results show that friction factors are highly sensitive to deformation: depending on geometry and tightening level, differences on f-values between expanded and constricted channels reach up to 100%, explaining much of the scatter in existing f–Re correlations. In contrast, heat-transfer performance is comparatively robust, with tightening level-induced data dispersion in Nu limited to 3–10% (Chevron), 4–12% (Zigzag), and 3–16% (Four-Quadrant). Predictive models were developed using symbolic regression and a multilayer perceptron (MLP) neural network. The symbolic models achieved mean absolute percentage errors (MAPE) of 3–9%, corresponding to a reduction in prediction uncertainty of up to 50% compared with classical correlations that neglect tightening and inlet-pressure asymmetry. The MLP model showed superior performance, retaining MAPE ≈ 3.5% and strong generalization under Reynolds-range exclusion tests. This study clarifies the causes regarding friction-factor inconsistencies in GPHEs and provides validated predictive tools for operation under asymmetric or non-uniform tightening conditions.
@article{Martins2026, title = {Experimental and machine learning-based predictions of thermohydraulic performance in gasketed plate heat exchangers with three corrugated plate geometries}, journal = {Applied Thermal Engineering}, pages = {130506}, year = {2026}, issn = {1359-4311}, doi = {https://doi.org/10.1016/j.applthermaleng.2026.130506}, url = {https://www.sciencedirect.com/science/article/pii/S1359431126008148}, author = {Martins, G. S. M. and Gabardo, A. M. P. and Stein, A. O. and Oliveira, J. L. G. and Jaskowiak, P. A. and Paiva, K. V.}, keywords = {Gasketed plate heat exchanger (GPHE), Friction factor, Thermohydraulic performance, Symbolic regression, Artificial neural networks (ANN)}, }

2025

An Isolation Forest Approach for Robust Anomaly Detection in Industrial Machines Using Out-of-Distribution Acoustic Data

Cristofer Silva, João Lucas Lopes Tavares Campos, Leonardo Afonso Ferreira Bortoni, Pablo Andretta Jaskowiak, and Diego Pinheiro

In Anais do XVII Congresso Brasileiro de Inteligência Computacional (CBIC 2025), 2025

Bib DOI URL

@inproceedings{Silva2025,
  author = {Silva, Cristofer and Campos, Jo\~ao Lucas Lopes Tavares and Bortoni, Leonardo Afonso Ferreira and Jaskowiak, Pablo Andretta and Pinheiro, Diego},
  title = {An Isolation Forest Approach for Robust Anomaly Detection in Industrial Machines Using Out-of-Distribution Acoustic Data},
  booktitle = {Anais do XVII Congresso Brasileiro de Intelig\^encia Computacional (CBIC 2025)},
  pages = {1--7},
  editor = {Guimar\~aes, Frederico and Ferreira, Danton and Barreto, Guilherme},
  publisher = {SBIC},
  address = {Belo Horizonte, MG},
  year = {2025},
  doi = {10.21528/CBIC2025-1175744},
  url = {https://sbia.org.br/eventos/cbic_2025/cbic2025-1175744/}
}

Distributed genetic algorithm for floating production unit’s mooring system optimization

Arthur M.P. Gabardo, Pablo A. Jaskowiak, and Thiago P. Tancredi

Ocean Engineering, 2025

Abs Bib DOI URL

Mooring systems ensure the structural integrity of Floating Production Units, especially in deep waters. These structures face environmental conditions that cause displacements along the waterline, resulting in significant stresses on risers and other subsystems. Minimizing such displacements and loads is crucial for safety. The design of mooring systems is complex, requiring a thorough analysis of project characteristics and environmental factors to fulfill safety requirements. Even though advances in computational tools have enhanced offshore system design by enabling a holistic assessment of critical parameters, optimization remains computationally intensive and often constrained by available resources. In this context, this work proposes the adoption of High-Performance Computing in offshore system design, allowing distributed optimization processes and accelerating the evaluation of solutions. Our procedure was successfully validated in a real-world case study involving the repositioning of an operational Floating Production Unit through a distributed implementation of the NSGA-II optimization algorithm, resulting in the reconfiguration of its mooring system. A viable repositioning was achieved with 32.94 m of displacement in the desired direction (10.98 % of water depth) and a reduction of 7 % in maximum line tension in the mooring system. Altogether, execution time was reduced by tenfold relative to the serial implementation of NSGA-II.
@article{Gabardo2025b, title = {Distributed genetic algorithm for floating production unit’s mooring system optimization}, journal = {Ocean Engineering}, volume = {340}, pages = {122186}, year = {2025}, issn = {0029-8018}, doi = {https://doi.org/10.1016/j.oceaneng.2025.122186}, url = {https://www.sciencedirect.com/science/article/pii/S0029801825018700}, author = {Gabardo, Arthur M.P. and Jaskowiak, Pablo A. and Tancredi, Thiago P.}, keywords = {Mooring optimization, Offshore systems design, NSGA-II, High-Performance computing, Distributed processing}, }
Pressure drop prediction in gyroid triply periodic minimal surfaces structures using experimental data, computational fluid dynamics and artificial neural networks

C. E. B. Correa, G. S. M. Martins, G. Zilio, P. A. Jaskowiak, and K. V. Paiva

Applied Thermal Engineering, 2025

Abs Bib DOI URL

Triply periodic minimal surface gyroid cores had emerged as promising options for compact heat exchangers, and yet there is no consensus on their hydraulic characterization. This study addressed that gap by comparing correlations based on the Fanning friction factor and the porous-media formulation of Darcy–Forchheimer, and proposed a neural network as an alternative tool. Pressure drop was experimentally measured in additively manufactured polymer gyroid cores with porosity ranging from 0.33 to 0.42 and core diameters of 20–26 mm. Simulations were performed to generate data, which were validated against experimental data with a maximum deviation of 13 %. Experimental results showed that pressure drop deviations remained below 7.6 %, indicating a low diameter influence and a dominant effect of porosity. The classical Fanning-based formulation did not yield a single correlation. Simulations demonstrated that the periodic variation of cross-sectional area along the channel generated local accelerations and recirculation, thereby hindering the applicability of the Fanning friction-factor theory, which assumes a constant cross-sectional area. The Darcy–Forchheimer formulation provides a pair of correlations for permeability and inertial coefficient, resulting in errors of up to 14 % compared to experiments. A multilayer perceptron trained on 376 validated data points accurately predicted the pressure drop per unit length with an R2 > 0.98 and a maximum deviation of 11 % relative to the experimental results. This study demonstrated that the Darcy–Forchheimer porous-media formulation provided a more suitable hydraulic characterization for gyroid cores, while neural networks surpassed classical correlations, offering an alternative tool for predicting pressure drops.
@article{Correa2025, title = {Pressure drop prediction in gyroid triply periodic minimal surfaces structures using experimental data, computational fluid dynamics and artificial neural networks}, journal = {Applied Thermal Engineering}, pages = {129238}, year = {2025}, issn = {1359-4311}, doi = {https://doi.org/10.1016/j.applthermaleng.2025.129238}, url = {https://www.sciencedirect.com/science/article/pii/S135943112503830X}, author = {Correa, C. E. B. and Martins, G. S. M. and Zilio, G. and Jaskowiak, P. A. and Paiva, K. V.}, keywords = {Triply periodic minimal surfaces, Gyroid, Pressure drop, Computational fluid dynamics, Artificial neural networks}, }
Highway to... Determining Fatal Outcomes in Traffic Accidents Based on Police Reports

Arthur M. P. Gabardo, Guilherme A. A. Schünemann, Pablo A. Jaskowiak, Benjamin G. Moreira, and Ricardo J. Pfitscher

In Intelligent Systems, 2025

Abs Bib DOI URL

Brazil faces significant traffic safety challenges with its vast territory and one of the world’s largest road networks. Road traffic accidents, particularly on federal highways, remain a leading cause of death in the country, with serious economic and social consequences. This work presents a case study of three machine learning methods—Random Forest (RF), k-Nearest Neighbors (kNN), and Multilayer Perceptron (MLP)—for classifying the severity of traffic accidents in the Brazilian southern region. Using an open dataset from the Brazilian Federal Highway Police (PRF) covering the years 2021 to 2024, extensive preprocessing was carried out, including categorical variable encoding, feature selection, and the application of the SMOTE technique to address class imbalance. Model performance was assessed through statistical metrics such as specificity, F1-score, and AUC-ROC. The results show that RF and kNN (with SMOTE) achieved the best performance in predicting fatal accidents, both with AUC-ROC of 0.99. In addition to an in-depth model evaluation, this study presents a post-hoc analysis of feature importance and contributions through Shapley Additive Explanations (SHAP) for the best performing model, in order to support knowledge discovery and highlight the most influential factors associated with fatalities.
@inproceedings{Gabardo2025c, author = {Gabardo, Arthur M. P. and Sch{\"u}nemann, Guilherme A. A. and Jaskowiak, Pablo A. and Moreira, Benjamin G. and Pfitscher, Ricardo J.}, editor = {de Freitas, Rosiane and Furtado, Diego}, title = {Highway to... Determining Fatal Outcomes in Traffic Accidents Based on Police Reports}, booktitle = {Intelligent Systems}, year = {2025}, publisher = {Springer Nature Switzerland}, address = {Cham}, doi = {10.1007/978-3-032-15990-8_16}, url = {https://link.springer.com/chapter/10.1007/978-3-032-15990-8_16}, pages = {230--244}, isbn = {978-3-032-15990-8} }

2024

Comparison of Face Detection Methods Under the Influence of Lighting Variation

Renan Sakamoto, Benjamin Moreira, and Pablo Jaskowiak

In Anais do XXI Encontro Nacional de Inteligência Artificial e Computacional, 2024

Bib DOI URL

@inproceedings{Sakamoto2024,
  author = {Sakamoto, Renan and Moreira, Benjamin and Jaskowiak, Pablo},
  title = { Comparison of Face Detection Methods Under the Influence of Lighting Variation},
  booktitle = {Anais do XXI Encontro Nacional de Inteligência Artificial e Computacional},
  location = {Belém/PA},
  year = {2024},
  keywords = {},
  issn = {2763-9061},
  pages = {496--507},
  publisher = {SBC},
  address = {Porto Alegre, RS, Brasil},
  doi = {10.5753/eniac.2024.245258},
  url = {https://sol.sbc.org.br/index.php/eniac/article/view/33819},
}

A Case Study on Water Demand Forecasting in a Coastal Tourist City

Antoniel Stefaniak, Pablo Andretta Jaskowiak, and Lucas Weihmann

In 34th Brazilian Conference on Intelligent Systems (BRACIS), 2024

Abs Bib DOI Slides

Urban Water-Demand (UWD) forecasting is crucial for efficient water management, improving distribution, and supporting environmental sustainability. In tourist destinations with significant seasonal variations in number of inhabitants (water consumers), accurate water-demand forecasting becomes particularly important. This work evaluates two statistical models for short-term UWD forecasting, namely, Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA). Two different strategies for model deployment and comparison are considered: (i) a sliding window (SW) approach with one-year (1Y) and two-year (2Y) windows for training and; (ii) a expanding window (EW) approach. The ARIMA model ployed with a Sliding Window (SW) with a two-year (2Y) resolution achieved the best overall results, followed by SARIMA considering Expanding Window (EW) model. To place these outcomes in perspective, we performed a comparison with results from related work that took into account Machine Learning methods for regression for the same data. This comparison suggests that statistical methods provide results that are both competitive and robust in terms of quality for short-term forecasts.
@inproceedings{Stefaniak2024, author = {Stefaniak, Antoniel and Jaskowiak, Pablo Andretta and Weihmann, Lucas}, title = {A Case Study on Water Demand Forecasting in a Coastal Tourist City}, booktitle = {34th Brazilian Conference on Intelligent Systems (BRACIS)}, year = {2024}, publisher = {Springer Nature Switzerland}, pages = {3--17}, isbn = {978-3-031-79035-5}, doi = {10.1007/978-3-031-79035-5_1}, }
Acoustic Features and Autoencoders for Fault Detection in Rotating Machines: A Case Study

Leonardo Bortoni, and Pablo Andretta Jaskowiak

In 34th Brazilian Conference on Intelligent Systems (BRACIS), 2024

Abs Bib DOI Slides

Traditional Machine Fault Detection (MFD) techniques usually rely on multiple sensor data sources, such as vibration, temperature, force, and audio/acoustic signals. Acoustic signals, in particular, are quite appealing in the context of MFD, as they are often among the first manifestations of machine failure. Furthermore, they are associated with high sensitivity, environmental resilience, and do not require machine interference. Given these compelling characteristics, MFD based exclusively on acoustic signals can be highly beneficial. In this work, we evaluate an unsupervised MFD approach based on Autoencoders (AE) trained exclusively on features extracted from acoustic signals of a rotating machine. The data employed in this work comes from the Machine Fault Database (MaFaulDa), which includes information from vibration and velocity sensors, besides the acoustic measurements. This allows us to compare the performance of the AE models to that of supervised models (such as MLPs) trained on the same acoustic-based feature set, as well as feature sets that incorporate all sensors from MaFaulDa. Our results support that unsupervised MFD based on Autoencoders and acoustic signals is particularly appealing, as it requires only normal machine operation for training. Indeed, we obtained AUC values of 0.86 for the task.
@inproceedings{Bortoni2024, author = {Bortoni, Leonardo and Jaskowiak, Pablo Andretta}, booktitle = {34th Brazilian Conference on Intelligent Systems (BRACIS)}, year = {2024}, isbn = {978-3-031-79035-5}, doi = {10.1007/978-3-031-79035-5_3}, publisher = {Springer Nature Switzerland}, }
A Case Study on Deep Learning for Photovoltaic Power Forecasting Combining Satellite and Ground Data

L. H. Buzzi, L. Weihmann, and P. A. Jaskowiak

Learning & Nonlinear Models, 2024

Abs Bib DOI

The increasing demand for clean energy presents challenges in energy supply management, largely due to their intermittency. Photovoltaic power generation, in specific, is greatly affected by weather factors, which may render power grids susceptible to instability, quality and balance issues. In this context, photovoltaic power generation forecasting is crucial not only to enhance the management of diverse energy sources through generation planning, but also to ensure widespread adoption of photovoltaic energy. To address the predictability issue in generation, this study aims to investigate the combination of satellite data with meteorological data to predict the energy generation potential in photovoltaic panels within 30, 60, 120, and 180-minute horizons. For this purpose, images from the GOES-16 satellite are used in combination with data from a ground-based weather station, located at Florianópolis – Santa Catarina – Brazil. The data is fed to a convolutional neural network, where convolutions are employed to extract features from the satellite images, aiming to establish a relationship with solar irradiation. The output of the convolutional network serves as input for a multilayer perceptron network, which utilizes the data to predict the Global Horizontal Irradiance (GHI). Our results support that models incorporating satellite images provide forecasts approximately 41% better for the 30-minute horizon and 21% better for the 180-minute horizon, when compared to models without satellite images.
@article{Buzzi2024, author = {Buzzi, L. H. and Weihmann, L. and Jaskowiak, P. A.}, title = {A Case Study on Deep Learning for Photovoltaic Power Forecasting Combining Satellite and Ground Data}, journal = {Learning \& Nonlinear Models}, pages = {6--18}, publisher = {SBIC}, year = {2024}, volume = {22}, number = {2}, doi = {10.21528/lnlm-vol22-no2-art1}, }
Machine learning for water demand forecasting: Case study in a Brazilian coastal city

Jesuino Vieira Filho, Arlan Scortegagna, Amanara Potykytã de Sousa Dias Vieira, and Pablo Andretta Jaskowiak

Water Practice and Technology, Apr 2024

Abs Bib DOI URL

Water resources management is crucial for human well-being and contemporary socio-economic development. However, the increasing use of water has led to various problems that affect its quality and availability. To address these issues, accurate forecasting of water consumption is essential for the optimal operation of water collection, treatment, and distribution systems. This study aims to compare four machine learning methods for predicting daily urban water demand in a Brazilian coastal tourist city (Guaratuba – Paraná). Historical data from the city’s water distribution system, spanning from 2016 to 2019 (1,461 measurements in total), were considered along with meteorological and calendar data to conduct the investigation. Three time series cross-validation approaches were considered for each method, thus totaling 12 evaluation settings. All models were subjected to hyperparameter optimization and evaluated using appropriate performance metrics from the literature. Results demonstrate the importance of using nonlinear models to predict short-term water demand, highlighting the problem’s complexity. From the compared models, multilayer perceptron provided the best results. Finally, regardless of the model, the best results were obtained by applying an expanding window time series cross-validation, indicating that the more historical data available, the better, in this particular case.
@article{Filho2024, author = {Filho, Jesuino Vieira and Scortegagna, Arlan and Vieira, Amanara Potykyt{\~a} de Sousa Dias and Jaskowiak, Pablo Andretta}, title = {Machine learning for water demand forecasting: Case study in a Brazilian coastal city}, journal = {Water Practice and Technology}, year = {2024}, month = apr, day = {23}, pages = {wpt2024096}, issn = {1751-231X}, doi = {10.2166/wpt.2024.096}, url = {https://doi.org/10.2166/wpt.2024.096} }
Synapse meets Slurm: Proposta de um Middleware para Paralelização de Algoritmos de Otimização Populacionais

Arthur Miguel Pereira Gabardo, Thiago Pontin Tancredi, and Pablo Andretta Jaskowiak

In XXIV Escola Regional de Alto Desempenho da Região Sul (ERAD-RS), Apr 2024

Abs Bib PDF DOI Slides

A Otimização Multidisciplinar possui um papel central na integração de áreas diversas em projetos de engenharia. Recursos computacionais podem, porém, ser um fator limitante em processos de otimização. Este artigo apresenta um middleware desenvolvido para integrar o software Synapse a clusters gerenciados pelo Slurm Workload Manager. O middleware facilita a execução de algoritmos de otimização populacionais (e.g., algoritmos genéticos) de maneira distribuída, favorecendo processos de otimização. Testes em um cluster heterogêneo permitiram validar a solução. Em experimentos preliminares, a solução desenvolvida apresentou speedups de até dez vezes em relação ao uso de workstations, abordagem até então suportada pelo Synapse.
@inproceedings{GabTanJas2024, author = {Gabardo, Arthur Miguel Pereira and Tancredi, Thiago Pontin and Jaskowiak, Pablo Andretta}, title = {Synapse meets Slurm: Proposta de um Middleware para Paralelização de Algoritmos de Otimização Populacionais}, pages = {1-4}, booktitle = {XXIV Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Florianópolis, SC, Brazil}, location = {Florianópolis, SC, Brazil}, month = apr, year = {2024}, doi = {https://doi.org/10.5753/eradrs.2024.238656}, }
Advancements in Computational Tools for Integrated Mooring Systems Design: a Review

Arthur Miguel Pereira Gabardo, Thiago Pontin Tancredi, and Pablo Andretta Jaskowiak

In Proceedings of the XII Congresso Brasileiro De Pesquisa E Desenvolvimento Em Petróleo E Gás, Oct 2024

Abs Bib PDF URL

Mooring systems play a crucial role in restraining displacement, preserving structural integrity, and ensuring the safety of Floating Production Units (FPUs) and their subsystems (e.g., risers). However, designing these systems is complex due to the interaction between various project characteristics, environmental factors, and operational requirements of the floating unit. A comprehensive analysis of multiple design configurations is needed to ascertain their feasibility, resulting in a lengthy process where less than 0.1% of solutions satisfy all design requirements. Computational tools are crucial, being widely employed in the design and verification of offshore systems. Through sophisticated simulations of nonlinear dynamic systems, finite element methods (FEM), and computational fluid dynamics (CFD) analyses, engineers can evaluate the performance and viability of different mooring system configurations. This article presents the Synapse Multidisciplinary Engineering software, which leverages Multidisciplinary Optimization (MDO) methods to simplify and automate the design process, meeting project constraints and requirements. To mitigate the high computational cost of simulations, analyses, and optimizations, Synapse uses machine learning algorithms, surrogate models, and high-performance computing (HPC) cluster infrastructures. New human-machine interfaces (HMI), such as augmented and virtual reality (AR/VR), are being explored to revolutionize the design and visualization of mooring systems. This review highlights these trends in computational tools for the integrated design of mooring systems, emphasizing the continuous evolution and development of these technologies.
@inproceedings{GabTanJas2025, author = {Gabardo, Arthur Miguel Pereira and Tancredi, Thiago Pontin and Jaskowiak, Pablo Andretta}, title = {Advancements in Computational Tools for Integrated Mooring Systems Design: a Review}, pages = {1-9}, booktitle = {Proceedings of the XII Congresso Brasileiro De Pesquisa E Desenvolvimento Em Petróleo E Gás}, publisher = {Associação Brasileira De Pesquisa E Desenvolvimento Em Petróleo E Gás}, month = oct, year = {2024}, url = {https://pdpetro.com.br/anais/?idpdpetro=12} }

2023

Deep Learning and Satellite Images for Photovoltaic Power Forecasting: A Case Study

Luiz Henrique Buzzi, Lucas Weihmann, and Pablo Andretta Jaskowiak

In Proceedings of the XVI Brazilian Conference on Computational Intelligence (CBIC 2023), Oct 2023

Abs Bib PDF DOI URL Slides

The growing demand for renewable energy resources presents a supply management challenge, as photovoltaic (PV) energy exhibits intermittent generation due to meteorological factors. The unpredictability of these variations leaves power grids vulnerable to instability, quality, and balance issues. In this context, accurate forecasting of PV power generation can improve management through generation planning, allowing for the balancing of different energy sources, which is crucial for achieving widespread PV energy adoption. The rapid development and significant advancements in deep learning present new possibilities for the use of satellite imagery in PV power forecasting. In this work we build and evaluate several deep learning models in the context of PV power forecasting, aiming at 30 and 60 minutes horizons. Our models are built for the prediction of the Global Horizontal Irradiance (GHI) component which, due to its strong correlation with PV power generation, can be employed not only to derive the actual PV plant output, but also as a measure generation potential, regardless of the actual PV plant. The models take as input images from the GOES-16 satellite and ground-based meteorological measurements, which are considered as desired outcomes. Several model configurations demonstrated the viability of GHI forecasting based on satellite imagery, with the best models achieving relative root mean squared errors (rRMSE) of 15.6% and 17.2% for 30-minute and 60-minute forecast horizons, respectively
@inproceedings{BuzWeiJas2023, author = {Buzzi, Luiz Henrique and Weihmann, Lucas and Jaskowiak, Pablo Andretta}, title = {Deep Learning and Satellite Images for Photovoltaic Power Forecasting: A Case Study}, pages = {1-8}, booktitle = {Proceedings of the XVI Brazilian Conference on Computational Intelligence (CBIC 2023)}, publisher = {Sociedade Brasileira de Inteligência Computacional (SBIC)}, address = {Salvador, BH, Brazil}, location = {Salvador, BH, Brazil}, month = oct, year = {2023}, doi = {10.21528/CBIC2023-120}, }
Clustering Validation with The Area Under Precision-Recall Curves

Pablo Andretta Jaskowiak, and Ivan Gesteira Costa

arXiv e-prints, Apr 2023

Abs Bib DOI URL

Confusion matrices and derived metrics provide a comprehensive framework for the evaluation of model performance in machine learning. These are well-known and extensively employed in the supervised learning domain, particularly classification. Surprisingly, such a framework has not been fully explored in the context of clustering validation. Indeed, just recently such a gap has been bridged with the introduction of the Area Under the ROC Curve for Clustering (AUCC), an internal/relative Clustering Validation Index (CVI) that allows for clustering validation in real application scenarios. In this work we explore the Area Under Precision-Recall Curve (and related metrics) in the context of clustering validation. We show that these are not only appropriate as CVIs, but should also be preferred in the presence of cluster imbalance. We perform a comprehensive evaluation of proposed and state-of-art CVIs on real and simulated data sets. Our observations corroborate towards an unified validation framework for supervised and unsupervised learning, given that they are consistent with existing guidelines established for the evaluation of supervised learning models.
@article{JasCos2023, author = {Jaskowiak, Pablo Andretta and Costa, Ivan Gesteira}, title = {Clustering Validation with The Area Under Precision-Recall Curves}, journal = {arXiv e-prints}, keywords = {Computer Science - Machine Learning}, year = {2023}, month = apr, eid = {arXiv:2304.01450}, pages = {arXiv:2304.01450}, doi = {10.48550/arXiv.2304.01450}, archiveprefix = {arXiv}, eprint = {2304.01450}, primaryclass = {cs.LG}, adsurl = {https://ui.adsabs.harvard.edu/abs/2023arXiv230401450A}, adsnote = {Provided by the SAO/NASA Astrophysics Data System} }

2022

The area under the ROC curve as a measure of clustering quality

Pablo Andretta Jaskowiak, Ivan Gesteira Costa, and Ricardo J. G. B. Campello

Data Mining and Knowledge Discovery, May 2022

Abs Bib DOI URL Code

The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.
@article{JasCosCam2022, author = {Jaskowiak, Pablo Andretta and Costa, Ivan Gesteira and Campello, Ricardo J. G. B.}, title = {The area under the ROC curve as a measure of clustering quality}, journal = {Data Mining and Knowledge Discovery}, year = {2022}, month = may, day = {01}, volume = {36}, number = {3}, pages = {1219-1245}, issn = {1573-756X}, doi = {10.1007/s10618-022-00829-0}, url = {https://link.springer.com/article/10.1007/s10618-022-00829-0} }

2021

Playing NES Tetris with No Piece Rotations

Adler Soster, Michael Birken, and Pablo Andretta Jaskowiak

In Anais Estendidos do XX Simpósio Brasileiro de Jogos e Entretenimento Digital (SBGames 2021), May 2021

Abs Bib PDF DOI URL

Tetris is one of the highest-grossing video games in all history and, despite of its age, remains quite popular. One of its most acclaimed versions was released in 1989 for the Nintendo Entertainment System (NES) and is often referred to as NES Tetris. This particular version of the game has led to the creation of the Classic Tetris World Championship (CTWC), resulting in growing popularity and alternative modes of gameplay. In one of such variants, players aim to clear as many lines as possible, with an additional constraint: piece rotations are not allowed. In this work we build and evaluate agents to play this particular variant of the game based on different metrics that grade board configurations. The relative importance of metrics is determined with the Particle Swarm Optimization. Our best results match those of top performing human players, even though the metrics we employ were not specifically developed for this game variant.
@inproceedings{SosBirJas2021, author = {Soster, Adler and Birken, Michael and Jaskowiak, Pablo Andretta}, title = {Playing NES Tetris with No Piece Rotations}, booktitle = {Anais Estendidos do XX Simp\'{o}sio Brasileiro de Jogos e Entretenimento Digital (SBGames 2021)}, location = {Online}, year = {2021}, issn = {2179-2259}, pages = {339--343}, publisher = {Sociedade Brasileira de Computa\c{\}\~{a}o (SBC)}, address = {Porto Alegre, RS, Brasil}, doi = {10.5753/sbgames_estendido.2021.19664}, url = {https://sol.sbc.org.br/index.php/sbgames_estendido/article/view/19664} }
Modelagem e Identificação de Dados Epidemiológicos Associados à Pandemia de COVID- 19 em Santa Catarina

Eduard Hermes Anschau, Alexandro Garro Brito, and Pablo Andretta Jaskowiak

In Anais do 15 Congresso Brasileiro de Inteligência Computacional (CBIC 2021), May 2021

Abs Bib DOI URL

O novo coronavírus (COVID-19) difundiu-se de maneira significante por todo o globo e tornou-se uma das grandes mazelas da contemporaneidade, impactando profundamente o Brasil, o qual configura como uma das nações mais afetadas pela doença. Desse modo, a necessidade por sistemas tecnológicos de combate à crise sanitária tornou-se ainda mais urgente nesse país. À vista disso, o presente artigo apresenta um estudo comparativo entre duas técnicas de modelagem e previsão de dados epidemiológicos associados à pandemia de Covid no Brasil, especificamente no estado de Santa Catarina. Essencialmente, foram concebidos modelos do tipo Non-Linear Autoregressive model with eXogenous input (NARX) polinomiais como contraponto à modelagem de séries temporais por meio da construção de redes neurais recorrentes da variante Long short-term memory (LSTM) para séries de dados correspondentes às quantidades de casos confirmados, óbitos, pacientes recuperados e leitos do sistema público de saúde ocupados por pacientes acometidos pela doença. O desempenho preditivo dos modelos, avaliado por meio da aplicação de métricas de desempenho tradicionais, mostrou que, para três das quatro séries temporais utilizadas para previsão, o modelo NARX obteve resultados mais satisfatórios.
@inproceedings{AnsBriJas2021, author = {Anschau, Eduard Hermes and Brito, Alexandro Garro and Jaskowiak, Pablo Andretta}, title = {Modelagem e Identifica\c{c}\~ao de Dados Epidemiol\'ogicos Associados \`a Pandemia de COVID- 19 em Santa Catarina}, pages = {1-8}, booktitle = {Anais do 15 Congresso Brasileiro de Intelig\^encia Computacional (CBIC 2021)}, editor = {Filho, Carmelo Jos'e Albanez Bastos and Siqueira, Hugo Valadares and Ferreira, Danton Diego and Bertol, Douglas Wildgrube and de Oliveira, Roberto C'elio Lim\~ao}, publisher = {Sociedade Brasileira de Inteligência Computacional (SBIC)}, address = {Joinville, SC, Brazil}, year = {2021}, doi = {http://dx.doi.org/10.21528/CBIC2021-122}, url = {https://sbic.org.br/eventos/cbic_2021/cbic2021-122/} }

Retrofitting of a two-degrees-of-freedom welding torch displacement system

João Victor Fabri, Tiago Vieira Da Cunha, and Pablo Andretta Jaskowiak

In Proceedings of the 26th International Congress of Mechanical Engineering (COBEM 2021), May 2021

Bib PDF DOI

@inproceedings{FabCunJas2021,
  author = {Fabri, Jo\~{a}o Victor and Cunha, Tiago Vieira Da and Jaskowiak, Pablo Andretta},
  title = {Retrofitting of a two-degrees-of-freedom welding torch displacement system},
  pages = {1-7},
  booktitle = {Proceedings of the 26th International Congress of Mechanical Engineering (COBEM 2021)},
  year = {2021},
  doi = {http://dx.doi.org/10.26678/ABCM.COBEM2021.COB2021-0114},
  address = {Florian\'{o}polis, SC, Brazil}
}

2020

Comparative Study of Photovoltaic Power Forecasting Methods

Angelo Pelisson, Thiago Covões, Anderson Spengler, and Pablo Andretta Jaskowiak

In Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2020), May 2020

Abs Bib DOI URL

Electricity consumption is growing rapidly worldwide. Renewable energy resources, such as solar energy, play a crucial role in this scenario, contributing to satisfy demand sustainability. Although the share of Photovoltaic (PV) power generation has increased in the past years, PV systems are quite sensitive to climatic and meteorological conditions, leading to undesirable power production variability. In order to improve energy grid stability, reliability, and management, accurate forecasting models that relate operational conditions to power output are needed. In this work we evaluate the performance of regression methods applied to forecast short term (next day) energy production of a PV Plant. Specifically, we consider five regression methods and different configurations of feature sets. Our results suggest that MLP and SVR provide the best forecasting results, in general. Also, although features based on different solar irradiance levels play a key role in predicting power generation, the use of additional features can improve prediction results.
@inproceedings{PelCovSpeJas2020, author = {Pelisson, Angelo and Cov\~{o}es, Thiago and Spengler, Anderson and Jaskowiak, Pablo Andretta}, title = {Comparative Study of Photovoltaic Power Forecasting Methods}, booktitle = {Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2020)}, location = {Online Event}, year = {2020}, issn = {2763-9061}, pages = {555--566}, publisher = {Sociedade Brasileira de Computação (SBC)}, address = {Porto Alegre, RS, Brazil}, doi = {10.5753/eniac.2020.12159}, url = {https://sol.sbc.org.br/index.php/eniac/article/view/12159} }
Comparação de Métodos de Deep Learning Pré-Treinados da Biblioteca OpenCV para Detecção de Pessoas em Ambientes Internos

Jesuino Vieira Filho, and Pablo Andretta Jaskowiak

Revista Eletrônica de Iniciação Científica em Computação, Nov 2020

Abs Bib PDF DOI URL

Sistemas de monitoramento baseados em câmeras são cada vez mais onipresentes em ambientes internos e externos. A existência de um sistema de monitoramento não garante, porém, que todas as informações coletadas sejam utilizadas e/ou analisadas. Quando uma interpretação das imagens é necessária, usualmente recorre-se à visão computacional. Neste contexto particular, métodos de Deep Learning têm recebido crescente atenção. De fato, apesar de seu desenvolvimento recente, alguns destes métodos estão disponı́veis em bibliotecas e pacotes de software de forma pré-treinada, permitindo sua aplicação com relativa facilidade. Neste trabalho diferentes métodos de Deep Learning disponı́veis na biblioteca OpenCV foram comparados para a detecção e contagem de pessoas em ambientes internos. Os métodos foram comparados quanto à sua precisão, revocação e tempo de detecção. Para a aplicação considerada, os resultados obtidos sugerem que o método YOLO (v3) apresenta um bom compromisso entre medida F1 e tempo de reconhecimento. A detecção precisa e rápida de pessoas pode vir a auxiliar futuramente, por exemplo, na estimação da carga térmica observada e consequente ajuste de sistemas de condicionamento de ar.
@article{VieJas2020, title = {Compara\c{c}\~{a}o de M\'{e}todos de Deep Learning Pr\'{e}-Treinados da Biblioteca OpenCV para Detec\c{c}\~{a}o de Pessoas em Ambientes Internos}, volume = {18}, doi = {10.5753/reic.2020.1766}, url = {https://sol.sbc.org.br/journals/index.php/reic/article/view/1766}, number = {4}, journal = {Revista Eletrônica de Inicia\c{c}\~{a}o Cient\'{i}fica em Computa\c{c}\~{a}o}, author = {Filho, Jesuino Vieira and Jaskowiak, Pablo Andretta}, year = {2020}, month = nov, }
Development of a mobile application for monitoring and controlling a CNC machine using Industry 4.0 concepts

Adriano Fagali Souza, Juliana Martins, Henrique Maiochi, Aline Durrer Patelli Juliani, and Pablo Andretta Jaskowiak

The International Journal of Advanced Manufacturing Technology, Dec 2020

Abs Bib DOI URL

Industry 4.0 comprises a set of technologies that allow the interconnection, monitoring, and controlling of manufacturing processes. Today it represents a key point for the modern industry. The current work presents an Industry 4.0 system developed for monitoring and controlling a 5-axis CNC machine center, in real time, through a mobile device, providing important feedback information for users and manufacturers of the machine. Given that response time is crucial in such applications, we conducted an experimental investigation to examine the system latency with distinct database structures, based on SQL and NoSQL. The results suggest that the non-relational structure (NoSQL) presented lower response times and is, thus, best suited for the application in hand. The system allows monitoring and controlling of any CNC machine remotely—given that a middleware for connecting the machine is provided—in real time, presenting new possibilities from the perspectives of machine tool builders and shop floor management.
@article{FagMartMaiJulJas2020, author = {de Souza, Adriano Fagali and Martins, Juliana and Maiochi, Henrique and Juliani, Aline Durrer Patelli and Jaskowiak, Pablo Andretta}, title = {Development of a mobile application for monitoring and controlling a CNC machine using Industry 4.0 concepts}, journal = {The International Journal of Advanced Manufacturing Technology}, year = {2020}, month = dec, day = {01}, volume = {111}, number = {9}, pages = {2545-2552}, issn = {1433-3015}, url = {https://link.springer.com/article/10.1007/s00170-020-06245-2}, doi = {10.1007/s00170-020-06245-2} }

2019

Modeling The Thermal Performance Of A Window Type Air-conditioning System With Artificial Neural Networks

João Victor Fabri, Pablo Andretta Jaskowiak, and Diogo Londero Silva

In Proceedings of the 25th International Congress of Mechanical Engineering (COBEM 2019), Dec 2019

Bib PDF DOI

@inproceedings{FabJasLon2019,
  author = {Fabri, Jo\~{a}o Victor and Jaskowiak, Pablo Andretta and da Silva, Diogo Londero},
  title = {Modeling The Thermal Performance Of A Window Type Air-conditioning System With Artificial Neural Networks},
  pages = {1-8},
  booktitle = {Proceedings of the 25th International Congress of Mechanical Engineering (COBEM 2019)},
  year = {2019},
  address = {Uberlandia, MG, Brazil},
  doi = {http://dx.doi.org/10.26678/abcm.cobem2019.cob2019-0789}
}

2018

Agrupamento De Dados Coletados Sobre A Rugosidade De Uma Amostra De Calçadas Na Cidade De Joinville – Sc – Brasil

G. A. M. Andrade, Pablo Andretta Jaskowiak, C. A. Isler, and A. H. Pfutzenreuter

In Oitavo Congresso Luso-brasileiro Para O Planeamento Urbano, Regional, Integrado E Sustentável (PLURIS 2018), Dec 2018

Bib PDF

@inproceedings{AndJasIslPfu2018,
  author = {Andrade, G. A. M. and Jaskowiak, Pablo Andretta and Isler, C. A. and Pfutzenreuter, A. H.},
  title = {Agrupamento De Dados Coletados Sobre A Rugosidade De Uma Amostra De Cal\c{c}adas Na Cidade De Joinville – Sc – Brasil},
  pages = {686-700},
  booktitle = {Oitavo Congresso Luso-brasileiro Para O Planeamento Urbano, Regional, Integrado E Sustent\'{a}vel (PLURIS 2018)},
  year = {2018},
  address = {Coimbra, Portugal}
}

Clustering of RNA-Seq samples: Comparison study on cancer data

Pablo Andretta Jaskowiak, Ivan G. Costa, and Ricardo J. G. B. Campello

Methods, Dec 2018

Abs Bib DOI URL

RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
@article{JasCosCam2018, title = {Clustering of RNA-Seq samples: Comparison study on cancer data}, journal = {Methods}, volume = {132}, pages = {42-49}, year = {2018}, issn = {1046-2023}, doi = {10.1016/j.ymeth.2017.07.023}, url = {https://www.sciencedirect.com/science/article/pii/S1046202317300476}, author = {Jaskowiak, Pablo Andretta and Costa, Ivan G. and Campello, Ricardo J. G. B.}, keywords = {RNA-Seq, Gene expression, Clustering, Cluster analysis, Cancer} }

2017

Estratégias De Controle Para Sistemas De Condicionamento De Ar Automotivo

A. D. P. Juliani, Pablo Andretta Jaskowiak, and D. L. da Silva

In Congresso Nacional Das Engenharias Da Mobilidade, Dec 2017

Bib PDF

@inproceedings{JulJasLon2017,
  author = {Juliani, A. D. P. and Jaskowiak, Pablo Andretta and {da Silva}, D. L.},
  title = {Estrat\'{e}gias De Controle Para Sistemas De Condicionamento De Ar Automotivo},
  pages = {1-9},
  booktitle = {Congresso Nacional Das Engenharias Da Mobilidade},
  year = {2017},
  address = {Joinville, SC, Brazil}
}

2016

On strategies for building effective ensembles of relative clustering validity criteria

Pablo A Jaskowiak, Davoud Moulavi, Antonio C. S. Furtado, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander

Knowledge and Information Systems, May 2016

Abs Bib DOI URL

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.
@article{JasMouFurCamZimSan2016, author = {Jaskowiak, Pablo A and Moulavi, Davoud and Furtado, Antonio C. S. and Campello, Ricardo J. G. B. and Zimek, Arthur and Sander, J{\"o}rg}, title = {On strategies for building effective ensembles of relative clustering validity criteria}, journal = {Knowledge and Information Systems}, year = {2016}, month = may, day = {01}, volume = {47}, number = {2}, pages = {329-354}, issn = {0219-3116}, url = {https://link.springer.com/article/10.1007/s10115-015-0851-6}, doi = {10.1007/s10115-015-0851-6} }

2015

PhD Thesis

On the evaluation of clustering results: measures, ensembles, and gene expression data analysis

Pablo Andretta Jaskowiak

University of São Paulo (USP), Nov 2015

Bib DOI URL Slides

@phdthesis{Jaskowiak2015,
  author = {Jaskowiak, Pablo Andretta},
  title = {On the evaluation of clustering results: measures, ensembles, and gene expression data analysis},
  school = {University of S\~{a}o Paulo (USP)},
  month = nov,
  url = {https://teses.usp.br/teses/disponiveis/55/55134/tde-23032016-111454/pt-br.php},
  doi = {10.11606/T.55.2016.tde-23032016-111454},
  year = {2015}
}

Impact of missing data imputation methods on gene expression clustering and classification

Marcilio C. P. de Souto, Pablo A. Jaskowiak, and Ivan G. Costa

BMC Bioinformatics, Feb 2015

Abs Bib DOI URL

Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes.
@article{deSouto2015, author = {{de Souto}, Marcilio C. P. and Jaskowiak, Pablo A. and Costa, Ivan G.}, title = {Impact of missing data imputation methods on gene expression clustering and classification}, journal = {BMC Bioinformatics}, year = {2015}, month = feb, day = {26}, volume = {16}, number = {1}, pages = {64}, issn = {1471-2105}, url = {https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0494-3}, doi = {10.1186/s12859-015-0494-3} }

A Cluster Based Hybrid Feature Selection Approach

Pablo A. Jaskowiak, and Ricardo J.G.B. Campello

In 2015 Brazilian Conference on Intelligent Systems (BRACIS), Feb 2015

Bib DOI URL

@inproceedings{JasCam2015,
  author = {Jaskowiak, Pablo A. and Campello, Ricardo J.G.B.},
  booktitle = {2015 Brazilian Conference on Intelligent Systems (BRACIS)},
  title = {A Cluster Based Hybrid Feature Selection Approach},
  year = {2015},
  pages = {43-48},
  url = {https://ieeexplore.ieee.org/document/7423993},
  doi = {10.1109/BRACIS.2015.14}
}

2014

A framework for bottom-up induction of oblique decision trees

Rodrigo C. Barros, Pablo A. Jaskowiak, Ricardo Cerri, and Andre C.P.L.F. de Carvalho

Neurocomputing, Feb 2014

Abs Bib DOI URL

Decision-tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The vast majority of the oblique and univariate decision-tree induction algorithms employ a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose BUTIF—a novel Bottom-Up Oblique Decision-Tree Induction Framework. BUTIF does not rely on an impurity-measure for dividing nodes, since the data resulting from each split is known a priori. For generating the initial leaves of the tree and the splitting hyperplanes in its internal nodes, BUTIF allows the adoption of distinct clustering algorithms and binary classifiers, respectively. It is also capable of performing embedded feature selection, which may reduce the number of features in each hyperplane, thus improving model comprehension. Different from virtually every top-down decision-tree induction algorithm, BUTIF does not require the further execution of a pruning procedure in order to avoid overfitting, due to its bottom-up nature that does not overgrow the tree. We compare distinct instances of BUTIF to traditional univariate and oblique decision-tree induction algorithms. Empirical results show the effectiveness of the proposed framework.
@article{BarJasCerCar2014, title = {A framework for bottom-up induction of oblique decision trees}, journal = {Neurocomputing}, volume = {135}, pages = {3-12}, year = {2014}, issn = {0925-2312}, doi = {10.1016/j.neucom.2013.01.067}, url = {https://www.sciencedirect.com/science/article/pii/S0925231213011351}, author = {Barros, Rodrigo C. and Jaskowiak, Pablo A. and Cerri, Ricardo and {de Carvalho}, Andre C.P.L.F.}, keywords = {Oblique decision trees, Bottom-up induction, Clustering} }
On the selection of appropriate distances for gene expression data clustering.

Pablo Andretta Jaskowiak, Ricardo J. G. B. Campello, and Ivan G. Costa

BMC Bioinformatics, Jan 2014

Abs Bib DOI URL

Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
@article{JasCamCos14, author = {Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Costa, Ivan G.}, doi = {10.1186/1471-2105-15-S2-S2}, issn = {1471-2105}, journal = {BMC Bioinformatics}, keywords = {Cluster Analysis,Gene Expression Profiling,Gene Expression Profiling: methods,Humans,Neoplasms,Neoplasms: genetics,Oligonucleotide Array Sequence Analysis,Oligonucleotide Array Sequence Analysis: methods}, month = jan, number = {Suppl 2}, pages = {S2}, pmid = {24564555}, title = {{On the selection of appropriate distances for gene expression data clustering.}}, url = {http://www.biomedcentral.com/1471-2105/15/S2/S2}, volume = {15 Suppl 2}, year = {2014} }
Density-Based Clustering Validation

Davoud Moulavi, Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Arthur Zimek, and Jörg Sander

In Proceedings of the 2014 SIAM International Conference on Data Mining (SDM), Jan 2014

Abs Bib DOI URL Supp Code

Abstract One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.
@inproceedings{MouJasCamZimSan2014, author = {Moulavi, Davoud and Jaskowiak, Pablo A. and Campello, Ricardo J. G. B. and Zimek, Arthur and Sander, J\"{o}rg}, title = {Density-Based Clustering Validation}, booktitle = {Proceedings of the 2014 SIAM International Conference on Data Mining (SDM)}, year = {2014}, pages = {839-847}, doi = {10.1137/1.9781611973440.96}, url = {https://epubs.siam.org/doi/10.1137/1.9781611973440.96} }

2013

On the Combination of Relative Clustering Validity Criteria

Lucas Vendramin, Pablo A. Jaskowiak, and Ricardo J. G. B. Campello

In Proceedings of the 25th International Conference on Scientific and Statistical Database Management, Jan 2013

Abs Bib DOI URL Slides

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
@inproceedings{VenJasCam2013, author = {Vendramin, Lucas and Jaskowiak, Pablo A. and Campello, Ricardo J. G. B.}, title = {On the Combination of Relative Clustering Validity Criteria}, year = {2013}, isbn = {9781450319218}, publisher = {Association for Computing Machinery (ACM)}, address = {New York, NY, USA}, doi = {10.1145/2484838.2484844}, url = {https://dl.acm.org/doi/10.1145/2484838.2484844}, booktitle = {Proceedings of the 25th International Conference on Scientific and Statistical Database Management}, articleno = {4}, numpages = {12}, keywords = {combinations of validity criteria, relative validity criteria, clustering validation}, location = {Baltimore, Maryland, USA}, series = {SSDBM}, }
Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis

Pablo A. Jaskowiak, Ricardo J. G. B. Campello, and Ivan G. Costa

IEEE Transactions on Computational Biology and Bioinformatics, Jan 2013

Abs Bib DOI URL

Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
@article{JasCamCos13, author = {Jaskowiak, Pablo A. and Campello, Ricardo J. G. B. and Costa, Ivan G.}, title = {Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis}, journal = {IEEE Transactions on Computational Biology and Bioinformatics}, issue_date = {July 2013}, volume = {10}, number = {4}, year = {2013}, issn = {1545-5963}, pages = {845--857}, numpages = {13}, doi = {http://dx.doi.org/10.1109/TCBB.2013.9}, url = {https://ieeexplore.ieee.org/document/6461019}, acmid = {2564679}, publisher = {IEEE Computer Society Press}, address = {Los Alamitos, CA, USA}, keywords = {Proximity measure, distance, similarity, correlation coefficient, clustering, gene expression, cancer, time course} }

2012

Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer

Pablo Andretta Jaskowiak, Ricardo J. G. B. Campello, and Ivan G. Costa

In Advances in Bioinformatics and Computational Biology, Jan 2012

Abs Bib DOI URL Slides

Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
@inproceedings{JasCamCos2012, author = {Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Costa, Ivan G.}, editor = {de Souto, Marcilio C. and Kann, Maricel G.}, title = {Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer}, booktitle = {Advances in Bioinformatics and Computational Biology}, year = {2012}, publisher = {Springer Berlin Heidelberg}, address = {Berlin, Heidelberg}, pages = {120--131}, isbn = {978-3-642-31927-3}, doi = {10.1007/978-3-642-31927-3_11}, url = {https://link.springer.com/chapter/10.1007/978-3-642-31927-3_11} }

2011

MSc Thesis

Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica

Pablo Andretta Jaskowiak

Mar 2011

Bib DOI URL Slides

@masterthesis{Jaskowiak2011,
  author = {Jaskowiak, Pablo Andretta},
  title = {Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica},
  school = {University of S\~{a}o Paulo (USP)},
  month = mar,
  url = {https://teses.usp.br/teses/disponiveis/55/55134/tde-05052011-143134/pt-br.php},
  year = {2011},
  doi = {10.11606/D.55.2011.tde-05052011-143134}
}

Comparing Correlation Coefficients As Dissimilarity Measures For Cancer Classification In Gene Expression Data

Pablo Andretta Jaskowiak, and Ricardo J. G. B. Campello

In Proceedings Of The 6th Brazilian Symposium On Bioinformatics, Mar 2011

Abs Bib PDF Slides

An important analysis performed in gene expression data is sample classification, e.g., the classification of different types or subtypes of cancer. Different classifiers have been employed for this challenging task, among which the k -Nearest Neighbors (k NN) classifier stands out for being at the same time very simple and highly flexible in terms of discriminatory power. Although the choice of a dissimilarity measure is essential to k NN, little effort has been undertaken to evaluate how this choice affects its performance in cancer classification. To this extent,we compare seven correlation coefficients for cancer classification using kNN. Our comparison suggests that a recently introduced correlation may perform better than commonly used measures. We also show that correlation coefficients rarely considered can provide competitive results when compared to widely used dissimilarity measures
@inproceedings{JasCam2011, author = {Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B.}, title = {Comparing Correlation Coefficients As Dissimilarity Measures For Cancer Classification In Gene Expression Data}, booktitle = {Proceedings Of The 6th Brazilian Symposium On Bioinformatics}, year = {2011}, pages = {1--8}, }

A bottom-up oblique decision tree induction algorithm

Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak, and André C. P. L. F. Carvalho

In 2011 11th International Conference on Intelligent Systems Design and Applications, Mar 2011

Bib DOI URL

@inproceedings{6121697,
  author = {Barros, Rodrigo C. and Cerri, Ricardo and Jaskowiak, Pablo A. and de Carvalho, Andr\'{e} C. P. L. F.},
  booktitle = {2011 11th International Conference on Intelligent Systems Design and Applications},
  title = {A bottom-up oblique decision tree induction algorithm},
  year = {2011},
  pages = {450-456},
  doi = {10.1109/ISDA.2011.6121697},
  url = {https://ieeexplore.ieee.org/document/6121697},
}

2010

A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination

Pablo Andretta Jaskowiak, Ricardo J. G. B. Campello, Thiago F. Covões, and Eduardo R. Hruschka

In 2010 Eleventh Brazilian Symposium on Neural Networks (SBRN 2010), Mar 2010

Bib DOI URL Slides

@inproceedings{JasCamCovHru2010,
  author = {Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Cov\~{o}es, Thiago F. and Hruschka, Eduardo R.},
  booktitle = {2010 Eleventh Brazilian Symposium on Neural Networks (SBRN 2010)},
  title = {A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination},
  year = {2010},
  pages = {13-18},
  doi = {10.1109/SBRN.2010.11},
  url = {https://ieeexplore.ieee.org/document/5715206}
}