Urban Water-Demand (UWD) forecasting is crucial for efficient water management, improving distribution, and supporting environmental sustainability. In tourist destinations with significant seasonal variations in number of inhabitants (water consumers), accurate water-demand forecasting becomes particularly important. This work evaluates two statistical models for short-term UWD forecasting, namely, Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA). Two different strategies for model deployment and comparison are considered: (i) a sliding window (SW) approach with one-year (1Y) and two-year (2Y) windows for training and; (ii) a expanding window (EW) approach. The ARIMA model ployed with a Sliding Window (SW) with a two-year (2Y) resolution achieved the best overall results, followed by SARIMA considering Expanding Window (EW) model. To place these outcomes in perspective, we performed a comparison with results from related work that took into account Machine Learning methods for regression for the same data. This comparison suggests that statistical methods provide results that are both competitive and robust in terms of quality for short-term forecasts.
@inproceedings{Stefaniak2024,author={Stefaniak, Antoniel and Jaskowiak, Pablo Andretta and Weihmann, Lucas},title={A Case Study on Water Demand Forecasting in a Coastal Tourist City},booktitle={34th Brazilian Conference on Intelligent Systems (BRACIS)},year={2025},publisher={Springer Nature Switzerland},pages={3--17},isbn={978-3-031-79035-5},doi={10.1007/978-3-031-79035-5_1},}
Acoustic Features and Autoencoders for Fault Detection in Rotating Machines: A Case Study
Traditional Machine Fault Detection (MFD) techniques usually rely on multiple sensor data sources, such as vibration, temperature, force, and audio/acoustic signals. Acoustic signals, in particular, are quite appealing in the context of MFD, as they are often among the first manifestations of machine failure. Furthermore, they are associated with high sensitivity, environmental resilience, and do not require machine interference. Given these compelling characteristics, MFD based exclusively on acoustic signals can be highly beneficial. In this work, we evaluate an unsupervised MFD approach based on Autoencoders (AE) trained exclusively on features extracted from acoustic signals of a rotating machine. The data employed in this work comes from the Machine Fault Database (MaFaulDa), which includes information from vibration and velocity sensors, besides the acoustic measurements. This allows us to compare the performance of the AE models to that of supervised models (such as MLPs) trained on the same acoustic-based feature set, as well as feature sets that incorporate all sensors from MaFaulDa. Our results support that unsupervised MFD based on Autoencoders and acoustic signals is particularly appealing, as it requires only normal machine operation for training. Indeed, we obtained AUC values of 0.86 for the task.
@inproceedings{Bortoni2024,author={Bortoni, Leonardo and Jaskowiak, Pablo Andretta},booktitle={34th Brazilian Conference on Intelligent Systems (BRACIS)},year={2025},isbn={978-3-031-79035-5},doi={10.1007/978-3-031-79035-5_3},publisher={Springer Nature Switzerland},}
2024
Comparison of Face Detection Methods Under the Influence of Lighting Variation
Renan Sakamoto, Benjamin Moreira, and Pablo Jaskowiak
In Anais do XXI Encontro Nacional de Inteligência Artificial e Computacional, 2024
@inproceedings{Sakamoto2024,author={Sakamoto, Renan and Moreira, Benjamin and Jaskowiak, Pablo},title={ Comparison of Face Detection Methods Under the Influence of Lighting Variation},booktitle={Anais do XXI Encontro Nacional de Inteligência Artificial e Computacional},location={Belém/PA},year={2024},keywords={},issn={2763-9061},pages={496--507},publisher={SBC},address={Porto Alegre, RS, Brasil},doi={10.5753/eniac.2024.245258},url={https://sol.sbc.org.br/index.php/eniac/article/view/33819},}
A Case Study on Deep Learning for Photovoltaic Power Forecasting Combining Satellite and Ground Data
The increasing demand for clean energy presents challenges in energy supply management, largely due to their intermittency. Photovoltaic power generation, in specific, is greatly affected by weather factors, which may render power grids susceptible to instability, quality and balance issues. In this context, photovoltaic power generation forecasting is crucial not only to enhance the management of diverse energy sources through generation planning, but also to ensure widespread adoption of photovoltaic energy. To address the predictability issue in generation, this study aims to investigate the combination of satellite data with meteorological data to predict the energy generation potential in photovoltaic panels within 30, 60, 120, and 180-minute horizons. For this purpose, images from the GOES-16 satellite are used in combination with data from a ground-based weather station, located at Florianópolis – Santa Catarina – Brazil. The data is fed to a convolutional neural network, where convolutions are employed to extract features from the satellite images, aiming to establish a relationship with solar irradiation. The output of the convolutional network serves as input for a multilayer perceptron network, which utilizes the data to predict the Global Horizontal Irradiance (GHI). Our results support that models incorporating satellite images provide forecasts approximately 41% better for the 30-minute horizon and 21% better for the 180-minute horizon, when compared to models without satellite images.
@article{Buzzi2024,author={Buzzi, L. H. and Weihmann, L. and Jaskowiak, P. A.},title={A Case Study on Deep Learning for Photovoltaic Power Forecasting Combining Satellite and Ground Data},journal={Learning \& Nonlinear Models},pages={6--18},publisher={SBIC},year={2024},volume={22},number={2},doi={10.21528/lnlm-vol22-no2-art1},}
Machine learning for water demand forecasting: Case study in a Brazilian coastal city
Jesuino Vieira Filho, Arlan Scortegagna, Amanara Potykytã de Sousa Dias Vieira, and Pablo Andretta Jaskowiak
Water resources management is crucial for human well-being and contemporary socio-economic development. However, the increasing use of water has led to various problems that affect its quality and availability. To address these issues, accurate forecasting of water consumption is essential for the optimal operation of water collection, treatment, and distribution systems. This study aims to compare four machine learning methods for predicting daily urban water demand in a Brazilian coastal tourist city (Guaratuba – Paraná). Historical data from the city’s water distribution system, spanning from 2016 to 2019 (1,461 measurements in total), were considered along with meteorological and calendar data to conduct the investigation. Three time series cross-validation approaches were considered for each method, thus totaling 12 evaluation settings. All models were subjected to hyperparameter optimization and evaluated using appropriate performance metrics from the literature. Results demonstrate the importance of using nonlinear models to predict short-term water demand, highlighting the problem’s complexity. From the compared models, multilayer perceptron provided the best results. Finally, regardless of the model, the best results were obtained by applying an expanding window time series cross-validation, indicating that the more historical data available, the better, in this particular case.
@article{Filho2024,author={Filho, Jesuino Vieira and Scortegagna, Arlan and Vieira, Amanara Potykyt{\~a} de Sousa Dias and Jaskowiak, Pablo Andretta},title={Machine learning for water demand forecasting: Case study in a Brazilian coastal city},journal={Water Practice and Technology},year={2024},month=apr,day={23},pages={wpt2024096},issn={1751-231X},doi={10.2166/wpt.2024.096},url={https://doi.org/10.2166/wpt.2024.096}}
Comparison of Face Detection Methods Under the Influence of Lighting Variation
A Otimização Multidisciplinar possui um papel central na integração de áreas diversas em projetos de engenharia. Recursos computacionais podem, porém, ser um fator limitante em processos de otimização. Este artigo apresenta um middleware desenvolvido para integrar o software Synapse a clusters gerenciados pelo Slurm Workload Manager. O middleware facilita a execução de algoritmos de otimização populacionais (e.g., algoritmos genéticos) de maneira distribuída, favorecendo processos de otimização. Testes em um cluster heterogêneo permitiram validar a solução. Em experimentos preliminares, a solução desenvolvida apresentou speedups de até dez vezes em relação ao uso de workstations, abordagem até então suportada pelo Synapse.
@inproceedings{GabTanJas2024,author={Gabardo, Arthur Miguel Pereira and Tancredi, Thiago Pontin and Jaskowiak, Pablo Andretta},title={Synapse meets Slurm: Proposta de um Middleware para Paralelização de Algoritmos de Otimização Populacionais},pages={1-4},booktitle={XXIV Escola Regional de Alto Desempenho da Região Sul (ERAD-RS)},publisher={Sociedade Brasileira de Computação (SBC)},address={Florianópolis, SC, Brazil},location={Florianópolis, SC, Brazil},month=apr,year={2024},doi={https://doi.org/10.5753/eradrs.2024.238656},}
Advancements in Computational Tools for Integrated Mooring Systems Design: a Review
Mooring systems play a crucial role in restraining displacement, preserving structural integrity, and ensuring the safety of Floating Production Units (FPUs) and their subsystems (e.g., risers). However, designing these systems is complex due to the interaction between various project characteristics, environmental factors, and operational requirements of the floating unit. A comprehensive analysis of multiple design configurations is needed to ascertain their feasibility, resulting in a lengthy process where less than 0.1% of solutions satisfy all design requirements. Computational tools are crucial, being widely employed in the design and verification of offshore systems. Through sophisticated simulations of nonlinear dynamic systems, finite element methods (FEM), and computational fluid dynamics (CFD) analyses, engineers can evaluate the performance and viability of different mooring system configurations. This article presents the Synapse Multidisciplinary Engineering software, which leverages Multidisciplinary Optimization (MDO) methods to simplify and automate the design process, meeting project constraints and requirements. To mitigate the high computational cost of simulations, analyses, and optimizations, Synapse uses machine learning algorithms, surrogate models, and high-performance computing (HPC) cluster infrastructures. New human-machine interfaces (HMI), such as augmented and virtual reality (AR/VR), are being explored to revolutionize the design and visualization of mooring systems. This review highlights these trends in computational tools for the integrated design of mooring systems, emphasizing the continuous evolution and development of these technologies.
@inproceedings{GabTanJas2025,author={Gabardo, Arthur Miguel Pereira and Tancredi, Thiago Pontin and Jaskowiak, Pablo Andretta},title={Advancements in Computational Tools for Integrated Mooring Systems Design: a Review},pages={1-9},booktitle={Proceedings of the XII Congresso Brasileiro De Pesquisa E Desenvolvimento Em Petróleo E Gás},publisher={Associação Brasileira De Pesquisa E Desenvolvimento Em Petróleo E Gás},month=oct,year={2024},url={https://pdpetro.com.br/anais/?idpdpetro=12}}
2023
Deep Learning and Satellite Images for Photovoltaic Power Forecasting: A Case Study
The growing demand for renewable energy resources presents a supply management challenge, as photovoltaic (PV) energy exhibits intermittent generation due to meteorological factors. The unpredictability of these variations leaves power grids vulnerable to instability, quality, and balance issues. In this context, accurate forecasting of PV power generation can improve management through generation planning, allowing for the balancing of different energy sources, which is crucial for achieving widespread PV energy adoption. The rapid development and significant advancements in deep learning present new possibilities for the use of satellite imagery in PV power forecasting. In this work we build and evaluate several deep learning models in the context of PV power forecasting, aiming at 30 and 60 minutes horizons. Our models are built for the prediction of the Global Horizontal Irradiance (GHI) component which, due to its strong correlation with PV power generation, can be employed not only to derive the actual PV plant output, but also as a measure generation potential, regardless of the actual PV plant. The models take as input images from the GOES-16 satellite and ground-based meteorological measurements, which are considered as desired outcomes. Several model configurations demonstrated the viability of GHI forecasting based on satellite imagery, with the best models achieving relative root mean
squared errors (rRMSE) of 15.6% and 17.2% for 30-minute and 60-minute forecast horizons, respectively
@inproceedings{BuzWeiJas2023,author={Buzzi, Luiz Henrique and Weihmann, Lucas and Jaskowiak, Pablo Andretta},title={Deep Learning and Satellite Images for Photovoltaic Power Forecasting: A Case Study},pages={1-8},booktitle={Proceedings of the XVI Brazilian Conference on Computational Intelligence (CBIC 2023)},publisher={Sociedade Brasileira de Inteligência Computacional (SBIC)},address={Salvador, BH, Brazil},location={Salvador, BH, Brazil},month=oct,year={2023},doi={10.21528/CBIC2023-120},}
Clustering Validation with The Area Under Precision-Recall Curves
Confusion matrices and derived metrics provide a comprehensive framework for the evaluation of model performance in machine learning. These are well-known and extensively employed in the supervised learning domain, particularly classification. Surprisingly, such a framework has not been fully explored in the context of clustering validation. Indeed, just recently such a gap has been bridged with the introduction of the Area Under the ROC Curve for Clustering (AUCC), an internal/relative Clustering Validation Index (CVI) that allows for clustering validation in real application scenarios. In this work we explore the Area Under Precision-Recall Curve (and related metrics) in the context of clustering validation. We show that these are not only appropriate as CVIs, but should also be preferred in the presence of cluster imbalance. We perform a comprehensive evaluation of proposed and state-of-art CVIs on real and simulated data sets. Our observations corroborate towards an unified validation framework for supervised and unsupervised learning, given that they are consistent with existing guidelines established for the evaluation of supervised learning models.
@article{JasCos2023,author={Jaskowiak, Pablo Andretta and Costa, Ivan Gesteira},title={Clustering Validation with The Area Under Precision-Recall Curves},journal={arXiv e-prints},keywords={Computer Science - Machine Learning},year={2023},month=apr,eid={arXiv:2304.01450},pages={arXiv:2304.01450},doi={10.48550/arXiv.2304.01450},archiveprefix={arXiv},eprint={2304.01450},primaryclass={cs.LG},adsurl={https://ui.adsabs.harvard.edu/abs/2023arXiv230401450A},adsnote={Provided by the SAO/NASA Astrophysics Data System}}
2022
The area under the ROC curve as a measure of clustering quality
The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.
@article{JasCosCam2022,author={Jaskowiak, Pablo Andretta and Costa, Ivan Gesteira and Campello, Ricardo J. G. B.},title={The area under the ROC curve as a measure of clustering quality},journal={Data Mining and Knowledge Discovery},year={2022},month=may,day={01},volume={36},number={3},pages={1219-1245},issn={1573-756X},doi={10.1007/s10618-022-00829-0},url={https://link.springer.com/article/10.1007/s10618-022-00829-0}}
Tetris is one of the highest-grossing video games in all history and, despite of its age, remains quite popular. One of its most acclaimed versions was released in 1989 for the Nintendo Entertainment System (NES) and is often referred to as NES Tetris. This particular version of the game has led to the creation of the Classic Tetris World Championship (CTWC), resulting in growing popularity and alternative modes of gameplay. In one of such variants, players aim to clear as many lines as possible, with an additional constraint: piece rotations are not allowed. In this work we build and evaluate agents to play this particular variant of the game based on different metrics that grade board configurations. The relative importance of metrics is determined with the Particle Swarm Optimization. Our best results match those of top performing human players, even though the metrics we employ were not specifically developed for this game variant.
@inproceedings{SosBirJas2021,author={Soster, Adler and Birken, Michael and Jaskowiak, Pablo Andretta},title={Playing NES Tetris with No Piece Rotations},booktitle={Anais Estendidos do XX Simp\'{o}sio Brasileiro de Jogos e Entretenimento Digital (SBGames 2021)},location={Online},year={2021},issn={2179-2259},pages={339--343},publisher={Sociedade Brasileira de Computa\c{\}\~{a}o (SBC)},address={Porto Alegre, RS, Brasil},doi={10.5753/sbgames_estendido.2021.19664},url={https://sol.sbc.org.br/index.php/sbgames_estendido/article/view/19664}}
Modelagem e Identificação de Dados Epidemiológicos Associados à Pandemia de COVID- 19 em Santa Catarina
O novo coronavírus (COVID-19) difundiu-se de maneira significante por todo o globo e tornou-se uma das grandes mazelas da contemporaneidade, impactando profundamente o Brasil, o qual configura como uma das nações mais afetadas pela doença. Desse modo, a necessidade por sistemas tecnológicos de combate à crise sanitária tornou-se ainda mais urgente nesse país. À vista disso, o presente artigo apresenta um estudo comparativo entre duas técnicas de modelagem e previsão de dados epidemiológicos associados à pandemia de Covid no Brasil, especificamente no estado de Santa Catarina. Essencialmente, foram concebidos modelos do tipo Non-Linear Autoregressive model with eXogenous input (NARX) polinomiais como contraponto à modelagem de séries temporais por meio da construção de redes neurais recorrentes da variante Long short-term memory (LSTM) para séries de dados correspondentes às quantidades de casos confirmados, óbitos, pacientes recuperados e leitos do sistema público de saúde ocupados por pacientes acometidos pela doença. O desempenho preditivo dos modelos, avaliado por meio da aplicação de métricas de desempenho tradicionais, mostrou que, para três das quatro séries temporais utilizadas para previsão, o modelo NARX obteve resultados mais satisfatórios.
@inproceedings{AnsBriJas2021,author={Anschau, Eduard Hermes and Brito, Alexandro Garro and Jaskowiak, Pablo Andretta},title={Modelagem e Identifica\c{c}\~ao de Dados Epidemiol\'ogicos Associados \`a Pandemia de COVID- 19 em Santa Catarina},pages={1-8},booktitle={Anais do 15 Congresso Brasileiro de Intelig\^encia Computacional (CBIC 2021)},editor={Filho, Carmelo Jos'e Albanez Bastos and Siqueira, Hugo Valadares and Ferreira, Danton Diego and Bertol, Douglas Wildgrube and de Oliveira, Roberto C'elio Lim\~ao},publisher={Sociedade Brasileira de Inteligência Computacional (SBIC)},address={Joinville, SC, Brazil},year={2021},doi={http://dx.doi.org/10.21528/CBIC2021-122},url={https://sbic.org.br/eventos/cbic_2021/cbic2021-122/}}
Retrofitting of a two-degrees-of-freedom welding torch displacement system
@inproceedings{FabCunJas2021,author={Fabri, Jo\~{a}o Victor and Cunha, Tiago Vieira Da and Jaskowiak, Pablo Andretta},title={Retrofitting of a two-degrees-of-freedom welding torch displacement system},pages={1-7},booktitle={Proceedings of the 26th International Congress of Mechanical Engineering (COBEM 2021)},year={2021},doi={http://dx.doi.org/10.26678/ABCM.COBEM2021.COB2021-0114},address={Florian\'{o}polis, SC, Brazil}}
2020
Comparative Study of Photovoltaic Power Forecasting Methods
Electricity consumption is growing rapidly worldwide. Renewable energy resources, such as solar energy, play a crucial role in this scenario, contributing to satisfy demand sustainability. Although the share of Photovoltaic (PV) power generation has increased in the past years, PV systems are quite sensitive to climatic and meteorological conditions, leading to undesirable power production variability. In order to improve energy grid stability, reliability, and management, accurate forecasting models that relate operational conditions to power output are needed. In this work we evaluate the performance of regression methods applied to forecast short term (next day) energy production of a PV Plant. Specifically, we consider five regression methods and different configurations of feature sets. Our results suggest that MLP and SVR provide the best forecasting results, in general. Also, although features based on different solar irradiance levels play a key role in predicting power generation, the use of additional features can improve prediction results.
@inproceedings{PelCovSpeJas2020,author={Pelisson, Angelo and Cov\~{o}es, Thiago and Spengler, Anderson and Jaskowiak, Pablo Andretta},title={Comparative Study of Photovoltaic Power Forecasting Methods},booktitle={Anais do XVII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2020)},location={Online Event},year={2020},issn={2763-9061},pages={555--566},publisher={Sociedade Brasileira de Computação (SBC)},address={Porto Alegre, RS, Brazil},doi={10.5753/eniac.2020.12159},url={https://sol.sbc.org.br/index.php/eniac/article/view/12159}}
Comparação de Métodos de Deep Learning Pré-Treinados da Biblioteca OpenCV para Detecção de Pessoas em Ambientes Internos
Sistemas de monitoramento baseados em câmeras são cada vez mais onipresentes em ambientes internos e externos. A existência de um sistema de monitoramento não garante, porém, que todas as informações coletadas sejam utilizadas e/ou analisadas. Quando uma interpretação das imagens é necessária, usualmente recorre-se à visão computacional. Neste contexto particular, métodos de Deep Learning têm recebido crescente atenção. De fato, apesar de seu desenvolvimento recente, alguns destes métodos estão disponı́veis em bibliotecas e pacotes de software de forma pré-treinada, permitindo sua aplicação com relativa facilidade. Neste trabalho diferentes métodos de Deep Learning disponı́veis na biblioteca OpenCV foram comparados para a detecção e contagem de pessoas em ambientes internos. Os métodos foram comparados quanto à sua precisão, revocação e tempo de detecção. Para a aplicação considerada, os resultados obtidos sugerem que o método YOLO (v3) apresenta um bom compromisso entre medida F1 e tempo de reconhecimento. A detecção precisa e rápida de pessoas pode vir a auxiliar futuramente, por exemplo, na estimação da carga térmica observada e consequente ajuste de sistemas de condicionamento de ar.
@article{VieJas2020,title={Compara\c{c}\~{a}o de M\'{e}todos de Deep Learning Pr\'{e}-Treinados da Biblioteca OpenCV para Detec\c{c}\~{a}o de Pessoas em Ambientes Internos},volume={18},doi={10.5753/reic.2020.1766},url={https://sol.sbc.org.br/journals/index.php/reic/article/view/1766},number={4},journal={Revista Eletrônica de Inicia\c{c}\~{a}o Cient\'{i}fica em Computa\c{c}\~{a}o},author={Filho, Jesuino Vieira and Jaskowiak, Pablo Andretta},year={2020},month=nov,}
Development of a mobile application for monitoring and controlling a CNC machine using Industry 4.0 concepts
Adriano Fagali Souza, Juliana Martins, Henrique Maiochi, Aline Durrer Patelli Juliani, and Pablo Andretta Jaskowiak
The International Journal of Advanced Manufacturing Technology, Dec 2020
Industry 4.0 comprises a set of technologies that allow the interconnection, monitoring, and controlling of manufacturing processes. Today it represents a key point for the modern industry. The current work presents an Industry 4.0 system developed for monitoring and controlling a 5-axis CNC machine center, in real time, through a mobile device, providing important feedback information for users and manufacturers of the machine. Given that response time is crucial in such applications, we conducted an experimental investigation to examine the system latency with distinct database structures, based on SQL and NoSQL. The results suggest that the non-relational structure (NoSQL) presented lower response times and is, thus, best suited for the application in hand. The system allows monitoring and controlling of any CNC machine remotely—given that a middleware for connecting the machine is provided—in real time, presenting new possibilities from the perspectives of machine tool builders and shop floor management.
@article{FagMartMaiJulJas2020,author={de Souza, Adriano Fagali and Martins, Juliana and Maiochi, Henrique and Juliani, Aline Durrer Patelli and Jaskowiak, Pablo Andretta},title={Development of a mobile application for monitoring and controlling a CNC machine using Industry 4.0 concepts},journal={The International Journal of Advanced Manufacturing Technology},year={2020},month=dec,day={01},volume={111},number={9},pages={2545-2552},issn={1433-3015},url={https://link.springer.com/article/10.1007/s00170-020-06245-2},doi={10.1007/s00170-020-06245-2}}
2019
Modeling The Thermal Performance Of A Window Type Air-conditioning System With Artificial Neural Networks
@inproceedings{FabJasLon2019,author={Fabri, Jo\~{a}o Victor and Jaskowiak, Pablo Andretta and da Silva, Diogo Londero},title={Modeling The Thermal Performance Of A Window Type Air-conditioning System With Artificial Neural Networks},pages={1-8},booktitle={Proceedings of the 25th International Congress of Mechanical Engineering (COBEM 2019)},year={2019},address={Uberlandia, MG, Brazil},doi={http://dx.doi.org/10.26678/abcm.cobem2019.cob2019-0789}}
2018
Agrupamento De Dados Coletados Sobre A Rugosidade De Uma Amostra De Calçadas Na Cidade De Joinville – Sc – Brasil
@inproceedings{AndJasIslPfu2018,author={Andrade, G. A. M. and Jaskowiak, Pablo Andretta and Isler, C. A. and Pfutzenreuter, A. H.},title={Agrupamento De Dados Coletados Sobre A Rugosidade De Uma Amostra De Cal\c{c}adas Na Cidade De Joinville – Sc – Brasil},pages={686-700},booktitle={Oitavo Congresso Luso-brasileiro Para O Planeamento Urbano, Regional, Integrado E Sustent\'{a}vel (PLURIS 2018)},year={2018},address={Coimbra, Portugal}}
Clustering of RNA-Seq samples: Comparison study on cancer data
RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
@article{JasCosCam2018,title={Clustering of RNA-Seq samples: Comparison study on cancer data},journal={Methods},volume={132},pages={42-49},year={2018},issn={1046-2023},doi={10.1016/j.ymeth.2017.07.023},url={https://www.sciencedirect.com/science/article/pii/S1046202317300476},author={Jaskowiak, Pablo Andretta and Costa, Ivan G. and Campello, Ricardo J. G. B.},keywords={RNA-Seq, Gene expression, Clustering, Cluster analysis, Cancer}}
2017
Estratégias De Controle Para Sistemas De Condicionamento De Ar Automotivo
@inproceedings{JulJasLon2017,author={Juliani, A. D. P. and Jaskowiak, Pablo Andretta and {da Silva}, D. L.},title={Estrat\'{e}gias De Controle Para Sistemas De Condicionamento De Ar Automotivo},pages={1-9},booktitle={Congresso Nacional Das Engenharias Da Mobilidade},year={2017},address={Joinville, SC, Brazil}}
2016
On strategies for building effective ensembles of relative clustering validity criteria
Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.
2015
PhD Thesis
On the evaluation of clustering results: measures, ensembles, and gene expression data analysis
@phdthesis{Jaskowiak2015,author={Jaskowiak, Pablo Andretta},title={On the evaluation of clustering results: measures, ensembles, and gene expression data analysis},school={University of S\~{a}o Paulo (USP)},month=nov,url={https://teses.usp.br/teses/disponiveis/55/55134/tde-23032016-111454/pt-br.php},doi={10.11606/T.55.2016.tde-23032016-111454},year={2015}}
Impact of missing data imputation methods on gene expression clustering and classification
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes.
@article{deSouto2015,author={{de Souto}, Marcilio C. P. and Jaskowiak, Pablo A. and Costa, Ivan G.},title={Impact of missing data imputation methods on gene expression clustering and classification},journal={BMC Bioinformatics},year={2015},month=feb,day={26},volume={16},number={1},pages={64},issn={1471-2105},url={https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0494-3},doi={10.1186/s12859-015-0494-3}}
@inproceedings{JasCam2015,author={Jaskowiak, Pablo A. and Campello, Ricardo J.G.B.},booktitle={2015 Brazilian Conference on Intelligent Systems (BRACIS)},title={A Cluster Based Hybrid Feature Selection Approach},year={2015},pages={43-48},url={https://ieeexplore.ieee.org/document/7423993},doi={10.1109/BRACIS.2015.14}}
2014
A framework for bottom-up induction of oblique decision trees
Rodrigo C. Barros, Pablo A. Jaskowiak, Ricardo Cerri, and Andre C.P.L.F. de Carvalho
Decision-tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The vast majority of the oblique and univariate decision-tree induction algorithms employ a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose BUTIF—a novel Bottom-Up Oblique Decision-Tree Induction Framework. BUTIF does not rely on an impurity-measure for dividing nodes, since the data resulting from each split is known a priori. For generating the initial leaves of the tree and the splitting hyperplanes in its internal nodes, BUTIF allows the adoption of distinct clustering algorithms and binary classifiers, respectively. It is also capable of performing embedded feature selection, which may reduce the number of features in each hyperplane, thus improving model comprehension. Different from virtually every top-down decision-tree induction algorithm, BUTIF does not require the further execution of a pruning procedure in order to avoid overfitting, due to its bottom-up nature that does not overgrow the tree. We compare distinct instances of BUTIF to traditional univariate and oblique decision-tree induction algorithms. Empirical results show the effectiveness of the proposed framework.
@article{BarJasCerCar2014,title={A framework for bottom-up induction of oblique decision trees},journal={Neurocomputing},volume={135},pages={3-12},year={2014},issn={0925-2312},doi={10.1016/j.neucom.2013.01.067},url={https://www.sciencedirect.com/science/article/pii/S0925231213011351},author={Barros, Rodrigo C. and Jaskowiak, Pablo A. and Cerri, Ricardo and {de Carvalho}, Andre C.P.L.F.},keywords={Oblique decision trees, Bottom-up induction, Clustering}}
On the selection of appropriate distances for gene expression data clustering.
Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
@article{JasCamCos14,author={Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Costa, Ivan G.},doi={10.1186/1471-2105-15-S2-S2},issn={1471-2105},journal={BMC Bioinformatics},keywords={Cluster Analysis,Gene Expression Profiling,Gene Expression Profiling: methods,Humans,Neoplasms,Neoplasms: genetics,Oligonucleotide Array Sequence Analysis,Oligonucleotide Array Sequence Analysis: methods},month=jan,number={Suppl 2},pages={S2},pmid={24564555},title={{On the selection of appropriate distances for gene expression data clustering.}},url={http://www.biomedcentral.com/1471-2105/15/S2/S2},volume={15 Suppl 2},year={2014}}
Abstract One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.
2013
On the Combination of Relative Clustering Validity Criteria
Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
@article{JasCamCos13,author={Jaskowiak, Pablo A. and Campello, Ricardo J. G. B. and Costa, Ivan G.},title={Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis},journal={IEEE Transactions on Computational Biology and Bioinformatics},issue_date={July 2013},volume={10},number={4},year={2013},issn={1545-5963},pages={845--857},numpages={13},doi={http://dx.doi.org/10.1109/TCBB.2013.9},url={https://ieeexplore.ieee.org/document/6461019},acmid={2564679},publisher={IEEE Computer Society Press},address={Los Alamitos, CA, USA},keywords={Proximity measure, distance, similarity, correlation coefficient, clustering, gene expression, cancer, time course}}
2012
Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer
Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
@inproceedings{JasCamCos2012,author={Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Costa, Ivan G.},editor={de Souto, Marcilio C. and Kann, Maricel G.},title={Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer},booktitle={Advances in Bioinformatics and Computational Biology},year={2012},publisher={Springer Berlin Heidelberg},address={Berlin, Heidelberg},pages={120--131},isbn={978-3-642-31927-3},doi={10.1007/978-3-642-31927-3_11},url={https://link.springer.com/chapter/10.1007/978-3-642-31927-3_11}}
2011
MSc Thesis
Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica
@masterthesis{Jaskowiak2011,author={Jaskowiak, Pablo Andretta},title={Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica},school={University of S\~{a}o Paulo (USP)},month=mar,url={https://teses.usp.br/teses/disponiveis/55/55134/tde-05052011-143134/pt-br.php},year={2011},doi={10.11606/D.55.2011.tde-05052011-143134}}
Comparing Correlation Coefficients As Dissimilarity Measures For Cancer Classification In Gene Expression Data
An important analysis performed in gene expression data is sample classification, e.g., the classification of different types or subtypes of cancer. Different classifiers have been employed for this challenging task, among which the k -Nearest Neighbors (k NN) classifier stands out for being at the same time very simple and highly flexible in terms of discriminatory power. Although the choice of a dissimilarity measure is essential to k NN, little effort has been undertaken to evaluate how this choice affects its performance in cancer classification. To this extent,we compare seven correlation coefficients for cancer classification using kNN. Our comparison suggests that a recently introduced correlation may perform better than commonly used measures. We also show that correlation coefficients rarely considered can provide competitive results when compared to widely used dissimilarity measures
@inproceedings{JasCam2011,author={Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B.},title={Comparing Correlation Coefficients As Dissimilarity Measures For Cancer Classification In Gene Expression Data},booktitle={Proceedings Of The 6th Brazilian Symposium On Bioinformatics},year={2011},pages={1--8},}
A bottom-up oblique decision tree induction algorithm
Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak, and André C. P. L. F. Carvalho
In 2011 11th International Conference on Intelligent Systems Design and Applications, Mar 2011
@inproceedings{6121697,author={Barros, Rodrigo C. and Cerri, Ricardo and Jaskowiak, Pablo A. and de Carvalho, Andr\'{e} C. P. L. F.},booktitle={2011 11th International Conference on Intelligent Systems Design and Applications},title={A bottom-up oblique decision tree induction algorithm},year={2011},pages={450-456},doi={10.1109/ISDA.2011.6121697},url={https://ieeexplore.ieee.org/document/6121697},}
2010
A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination
@inproceedings{JasCamCovHru2010,author={Jaskowiak, Pablo Andretta and Campello, Ricardo J. G. B. and Cov\~{o}es, Thiago F. and Hruschka, Eduardo R.},booktitle={2010 Eleventh Brazilian Symposium on Neural Networks (SBRN 2010)},title={A Comparative Study on the Use of Correlation Coefficients for Redundant Feature Elimination},year={2010},pages={13-18},doi={10.1109/SBRN.2010.11},url={https://ieeexplore.ieee.org/document/5715206}}