A Otimização Multidisciplinar possui um papel central na integração de áreas diversas em projetos de engenharia. Recursos computacionais podem, porém, ser um fator limitante em processos de otimização. Este artigo apresenta um middleware desenvolvido para integrar o software Synapse a clusters gerenciados pelo Slurm Workload Manager. O middleware facilita a execução de algoritmos de otimização populacionais (e.g., algoritmos genéticos) de maneira distribuída, favorecendo processos de otimização. Testes em um cluster heterogêneo permitiram validar a solução. Em experimentos preliminares, a solução desenvolvida apresentou speedups de até dez vezes em relação ao uso de workstations, abordagem até então suportada pelo Synapse.
2023
Deep Learning and Satellite Images for Photovoltaic Power Forecasting: A Case Study
The growing demand for renewable energy resources presents a supply management challenge, as photovoltaic (PV) energy exhibits intermittent generation due to meteorological factors. The unpredictability of these variations leaves power grids vulnerable to instability, quality, and balance issues. In this context, accurate forecasting of PV power generation can improve management through generation planning, allowing for the balancing of different energy sources, which is crucial for achieving widespread PV energy adoption. The rapid development and significant advancements in deep learning present new possibilities for the use of satellite imagery in PV power forecasting. In this work we build and evaluate several deep learning models in the context of PV power forecasting, aiming at 30 and 60 minutes horizons. Our models are built for the prediction of the Global Horizontal Irradiance (GHI) component which, due to its strong correlation with PV power generation, can be employed not only to derive the actual PV plant output, but also as a measure generation potential, regardless of the actual PV plant. The models take as input images from the GOES-16 satellite and ground-based meteorological measurements, which are considered as desired outcomes. Several model configurations demonstrated the viability of GHI forecasting based on satellite imagery, with the best models achieving relative root mean
squared errors (rRMSE) of 15.6% and 17.2% for 30-minute and 60-minute forecast horizons, respectively
Clustering Validation with The Area Under Precision-Recall Curves
Confusion matrices and derived metrics provide a comprehensive framework for the evaluation of model performance in machine learning. These are well-known and extensively employed in the supervised learning domain, particularly classification. Surprisingly, such a framework has not been fully explored in the context of clustering validation. Indeed, just recently such a gap has been bridged with the introduction of the Area Under the ROC Curve for Clustering (AUCC), an internal/relative Clustering Validation Index (CVI) that allows for clustering validation in real application scenarios. In this work we explore the Area Under Precision-Recall Curve (and related metrics) in the context of clustering validation. We show that these are not only appropriate as CVIs, but should also be preferred in the presence of cluster imbalance. We perform a comprehensive evaluation of proposed and state-of-art CVIs on real and simulated data sets. Our observations corroborate towards an unified validation framework for supervised and unsupervised learning, given that they are consistent with existing guidelines established for the evaluation of supervised learning models.
2022
The area under the ROC curve as a measure of clustering quality
The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.
Tetris is one of the highest-grossing video games in all history and, despite of its age, remains quite popular. One of its most acclaimed versions was released in 1989 for the Nintendo Entertainment System (NES) and is often referred to as NES Tetris. This particular version of the game has led to the creation of the Classic Tetris World Championship (CTWC), resulting in growing popularity and alternative modes of gameplay. In one of such variants, players aim to clear as many lines as possible, with an additional constraint: piece rotations are not allowed. In this work we build and evaluate agents to play this particular variant of the game based on different metrics that grade board configurations. The relative importance of metrics is determined with the Particle Swarm Optimization. Our best results match those of top performing human players, even though the metrics we employ were not specifically developed for this game variant.
Modelagem e Identificação de Dados Epidemiológicos Associados à Pandemia de COVID- 19 em Santa Catarina
O novo coronavírus (COVID-19) difundiu-se de maneira significante por todo o globo e tornou-se uma das grandes mazelas da contemporaneidade, impactando profundamente o Brasil, o qual configura como uma das nações mais afetadas pela doença. Desse modo, a necessidade por sistemas tecnológicos de combate à crise sanitária tornou-se ainda mais urgente nesse país. À vista disso, o presente artigo apresenta um estudo comparativo entre duas técnicas de modelagem e previsão de dados epidemiológicos associados à pandemia de Covid no Brasil, especificamente no estado de Santa Catarina. Essencialmente, foram concebidos modelos do tipo Non-Linear Autoregressive model with eXogenous input (NARX) polinomiais como contraponto à modelagem de séries temporais por meio da construção de redes neurais recorrentes da variante Long short-term memory (LSTM) para séries de dados correspondentes às quantidades de casos confirmados, óbitos, pacientes recuperados e leitos do sistema público de saúde ocupados por pacientes acometidos pela doença. O desempenho preditivo dos modelos, avaliado por meio da aplicação de métricas de desempenho tradicionais, mostrou que, para três das quatro séries temporais utilizadas para previsão, o modelo NARX obteve resultados mais satisfatórios.
Retrofitting of a two-degrees-of-freedom welding torch displacement system
Electricity consumption is growing rapidly worldwide. Renewable energy resources, such as solar energy, play a crucial role in this scenario, contributing to satisfy demand sustainability. Although the share of Photovoltaic (PV) power generation has increased in the past years, PV systems are quite sensitive to climatic and meteorological conditions, leading to undesirable power production variability. In order to improve energy grid stability, reliability, and management, accurate forecasting models that relate operational conditions to power output are needed. In this work we evaluate the performance of regression methods applied to forecast short term (next day) energy production of a PV Plant. Specifically, we consider five regression methods and different configurations of feature sets. Our results suggest that MLP and SVR provide the best forecasting results, in general. Also, although features based on different solar irradiance levels play a key role in predicting power generation, the use of additional features can improve prediction results.
Comparação de Métodos de Deep Learning Pré-Treinados da Biblioteca OpenCV para Detecção de Pessoas em Ambientes Internos
Sistemas de monitoramento baseados em câmeras são cada vez mais onipresentes em ambientes internos e externos. A existência de um sistema de monitoramento não garante, porém, que todas as informações coletadas sejam utilizadas e/ou analisadas. Quando uma interpretação das imagens é necessária, usualmente recorre-se à visão computacional. Neste contexto particular, métodos de Deep Learning têm recebido crescente atenção. De fato, apesar de seu desenvolvimento recente, alguns destes métodos estão disponı́veis em bibliotecas e pacotes de software de forma pré-treinada, permitindo sua aplicação com relativa facilidade. Neste trabalho diferentes métodos de Deep Learning disponı́veis na biblioteca OpenCV foram comparados para a detecção e contagem de pessoas em ambientes internos. Os métodos foram comparados quanto à sua precisão, revocação e tempo de detecção. Para a aplicação considerada, os resultados obtidos sugerem que o método YOLO (v3) apresenta um bom compromisso entre medida F1 e tempo de reconhecimento. A detecção precisa e rápida de pessoas pode vir a auxiliar futuramente, por exemplo, na estimação da carga térmica observada e consequente ajuste de sistemas de condicionamento de ar.
Development of a mobile application for monitoring and controlling a CNC machine using Industry 4.0 concepts
Adriano Fagali Souza, Juliana Martins, Henrique Maiochi, Aline Durrer Patelli Juliani, and Pablo Andretta Jaskowiak
The International Journal of Advanced Manufacturing Technology, Dec 2020
Industry 4.0 comprises a set of technologies that allow the interconnection, monitoring, and controlling of manufacturing processes. Today it represents a key point for the modern industry. The current work presents an Industry 4.0 system developed for monitoring and controlling a 5-axis CNC machine center, in real time, through a mobile device, providing important feedback information for users and manufacturers of the machine. Given that response time is crucial in such applications, we conducted an experimental investigation to examine the system latency with distinct database structures, based on SQL and NoSQL. The results suggest that the non-relational structure (NoSQL) presented lower response times and is, thus, best suited for the application in hand. The system allows monitoring and controlling of any CNC machine remotely—given that a middleware for connecting the machine is provided—in real time, presenting new possibilities from the perspectives of machine tool builders and shop floor management.
2019
Modeling The Thermal Performance Of A Window Type Air-conditioning System With Artificial Neural Networks
RNA-Seq is becoming the standard technology for large-scale gene expression level measurements, as it offers a number of advantages over microarrays. Standards for RNA-Seq data analysis are, however, in its infancy when compared to those of microarrays. Clustering, which is essential for understanding gene expression data, has been widely investigated w.r.t. microarrays. In what concerns the clustering of RNA-Seq data, however, a number of questions remain open, resulting in a lack of guidelines to practitioners. Here we evaluate computational steps relevant for clustering cancer samples via an empirical analysis of 15mRNA-seq datasets. Our evaluation considers strategies regarding expression estimates, number of genes after non-specific filtering and data transformations. We evaluate the performance of four clustering algorithms and twelve distance measures, which are commonly used for gene expression analysis. Results support that clustering cancer samples based on a gene quantification should be preferred. The use of non-specific filtering leading to a small number of features (1,000) presents, in general, superior results. Data should be log-transformed previously to cluster analysis. Regarding the choice of clustering algorithms, Average-Linkage and k-medoids provide, in general, superior recoveries. Although specific cases can benefit from a careful selection of a distance measure, Symmetric Rank-Magnitude correlation provides consistent and sound results in different scenarios.
2017
Estratégias De Controle Para Sistemas De Condicionamento De Ar Automotivo
Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.
2015
PhD Thesis
On the evaluation of clustering results: measures, ensembles, and gene expression data analysis
Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes.
Decision-tree induction algorithms are widely used in knowledge discovery and data mining, specially in scenarios where model comprehensibility is desired. A variation of the traditional univariate approach is the so-called oblique decision tree, which allows multivariate tests in its non-terminal nodes. Oblique decision trees can model decision boundaries that are oblique to the attribute axes, whereas univariate trees can only perform axis-parallel splits. The vast majority of the oblique and univariate decision-tree induction algorithms employ a top-down strategy for growing the tree, relying on an impurity-based measure for splitting nodes. In this paper, we propose BUTIF—a novel Bottom-Up Oblique Decision-Tree Induction Framework. BUTIF does not rely on an impurity-measure for dividing nodes, since the data resulting from each split is known a priori. For generating the initial leaves of the tree and the splitting hyperplanes in its internal nodes, BUTIF allows the adoption of distinct clustering algorithms and binary classifiers, respectively. It is also capable of performing embedded feature selection, which may reduce the number of features in each hyperplane, thus improving model comprehension. Different from virtually every top-down decision-tree induction algorithm, BUTIF does not require the further execution of a pruning procedure in order to avoid overfitting, due to its bottom-up nature that does not overgrow the tree. We compare distinct instances of BUTIF to traditional univariate and oblique decision-tree induction algorithms. Empirical results show the effectiveness of the proposed framework.
On the selection of appropriate distances for gene expression data clustering.
Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions. Results and conclusions: We analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.
Abstract One of the most challenging aspects of clustering is validation, which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within- and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.
2013
On the Combination of Relative Clustering Validity Criteria
Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper, we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 data sets from time course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and euclidean distance. Given that different measures stood out for time course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data, we preprocessed and compiled 17 data sets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
2012
Evaluating Correlation Coefficients for Clustering Gene Expression Profiles of Cancer
Cluster analysis is usually the first step adopted to unveil information from gene expression data. One of its common applications is the clustering of cancer samples, associated with the detection of previously unknown cancer subtypes. Although guidelines have been established concerning the choice of appropriate clustering algorithms, little attention has been given to the subject of proximity measures. Whereas the Pearson correlation coefficient appears as the de facto proximity measure in this scenario, no comprehensive study analyzing other correlation coefficients as alternatives to it has been conducted. Considering such facts, we evaluated five correlation coefficients (along with Euclidean distance) regarding the clustering of cancer samples. Our evaluation was conducted on 35 publicly available datasets covering both (i) intrinsic separation ability and (ii) clustering predictive ability of the correlation coefficients. Our results support that correlation coefficients rarely considered in the gene expression literature may provide competitive results to more generally employed ones. Finally, we show that a recently introduced measure arises as a promising alternative to the commonly employed Pearson, providing competitive and even superior results to it.
2011
MSc Thesis
Estudo de coeficientes de correlação para medidas de proximidade em dados de expressão gênica
An important analysis performed in gene expression data is sample classification, e.g., the classification of different types or subtypes of cancer. Different classifiers have been employed for this challenging task, among which the k -Nearest Neighbors (k NN) classifier stands out for being at the same time very simple and highly flexible in terms of discriminatory power. Although the choice of a dissimilarity measure is essential to k NN, little effort has been undertaken to evaluate how this choice affects its performance in cancer classification. To this extent,we compare seven correlation coefficients for cancer classification using kNN. Our comparison suggests that a recently introduced correlation may perform better than commonly used measures. We also show that correlation coefficients rarely considered can provide competitive results when compared to widely used dissimilarity measures
A bottom-up oblique decision tree induction algorithm
Rodrigo C. Barros, Ricardo Cerri, Pablo A. Jaskowiak, and André C. P. L. F. Carvalho
In 2011 11th International Conference on Intelligent Systems Design and Applications, Mar 2011