Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (2024)

Ruizhuo Song, Member, IEEE, Beiming Yuan, Frank L. Lewis, Fellow, IEEEThis work was supported by the National Natural Science Foundation of China under Grants 62273036. Corresponding author: Ruizhuo Song, ruizhuosong@ustb.edu.cnRuizhuo Song and Beiming Yuan are with the Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China (Ruizhuo Song email: ruizhuosong@ustb.edu.cn and Beiming Yuan email: d202310354@xs.ustb.edu.cn). F. L. Lewis is with the UTA Research Institute, The University of Texas atArlington, Arlington, TX 76019 USA (e-mail: lewis@uta.edu).Ruizhuo Song and Beiming Yuan contributed equally to this work.

Abstract

Abstract reasoning poses significant challenges to artificial intelligence algorithms, demanding a higher level of cognitive ability than that required for perceptual tasks. In this study, we introduce the Triple-CFN method to tackle the Bongard Logo problem, achieving remarkable reasoning accuracy by implicitly reorganizing the conflicting concept spaces of instances. Furthermore, with necessary modifications, the Triple-CFN paradigm has also proven effective on the RPM (Raven’s Progressive Matrices) problem, yielding competitive results. To further enhance Triple-CFN’s performance on the RPM problem, we have upgraded it to the Meta Triple-CFN network, which explicitly constructs the concept space of RPM problems, ensuring high reasoning accuracy while achieving conceptual interpretability. The success of Meta Triple-CFN can be attributed to its paradigm of modeling the concept space, which is tantamount to normalizing reasoning information. Based on this idea, we have introduced the Re-space layer, boosting the performance of both Meta Triple-CFN and Triple-CFN. This paper aims to contribute to the advancement of machine intelligence and pave the way for further breakthroughs in this field by exploring innovative network designs for solving abstract reasoning problems.

Index Terms:

Abstract reasoning, RPM problem, Bongard-logo problem.

I Introduction

Deep neural networks have achieved remarkable success in various domains, including computer vision[1, 3, 2], natural language processing[4, 5, 6], generative models[8, 7, 9], visual question answering[10, 11], and abstract reasoning[12, 13, 14]. The advancement of deep learning in the realm of graphical abstract reasoning is a particularly intriguing and complex research area.

Initially, deep learning was introduced into machine learning, bringing it closer to its original goal of artificial intelligence. It is regarded as learning the inherent patterns and hierarchical representations within sample data, greatly aiding in the interpretation of data types such as text, images, and sound. The ultimate objective is to endow machines with human-like analytical learning capabilities, enabling them to recognize and interpret text, images, and sound.

In the domain of graphical abstract reasoning, the significance of deep learning lies primarily in its ability to tackle complex pattern recognition challenges. Through deep learning, machines can mimic human activities like perception, audition, and cognition, leading to significant strides in artificial intelligence-related technologies.

Moreover, deep learning has yielded numerous achievements in areas like search technologies, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech recognition, recommendations, and personalization. Notably, in speech and image recognition, deep learning has demonstrated remarkable efficacy, with recognition accuracies surpassing preceding technologies.

However, the applications of deep learning extend beyond these. For instance, utilizing the outcomes from upper-level training as initialization parameters for lower-level training processes enhances the efficiency of deep model training. Meanwhile, adopting a layer-wise initialization approach and employing unsupervised learning for training is a pivotal strategy in deep learning.

Collectively, the progression of deep learning in graphical abstract reasoning is an ongoing research sphere that offers substantial support to the development of artificial intelligence. Nevertheless, despite the extensive and profound applications of deep learning, numerous unresolved issues and challenges demand further investigation and exploration.

Notably, following the remarkable accomplishments of deep learning in intelligent visual tasks, machine intelligence is poised to reach even greater heights. The academic community has presented a challenge to deep learning’s abstract reasoning capabilities using graphical reasoning problems. Initially, graphical reasoning entails comprehending and analyzing both global and local characteristics of graphics, posing a significant challenge for deep learning models. Typically, deep learning models extract features by learning from extensive datasets. However, in graphical reasoning problems, the complexity and variability of graphics make it arduous for models to learn effective feature representations.

Secondly, graphical reasoning problems require models to possess reasoning and induction capabilities. This necessitates models to comprehend graphic structures, relationships, and rules and perform reasoning and induction based on this information. However, existing deep learning models often exhibit subpar performance when tackling such problems due to their limited reasoning and induction abilities.

In addition, graphical reasoning problems mandate models to have generalization capabilities. This means models must be adept at handling graphics of various shapes, sizes, and colors while delivering accurate reasoning outcomes. Nevertheless, due to the limited generalization capabilities of deep learning models, they often encounter overfitting or underfitting issues when dealing with such problems.

Lastly, datasets for graphical reasoning problems are typically small-scale, posing challenges for the training of deep learning models. These models require vast amounts of data for training to achieve optimal performance. However, in the context of graphical reasoning problems, the limited dataset size makes it challenging for models to acquire sufficient information for problem-solving. Furthermore, datasets for these problems are often artificially designed, potentially leading to discrepancies between data distributions and real-world scenarios, further complicating model training.

Thus, addressing the challenges posed by graphical reasoning problems to deep learning constitutes a pivotal research direction. This necessitates the design of more effective deep learning models, enhancements in model training methodologies, and optimizations in dataset quality among other aspects.

For instance, Ravens Progressive Matrices (RPM) problems[12] and Bongard problems[13, 14] present learning demands ranging from perception to reasoning. Addressing these demands necessitates advancements in deep learning capabilities to handle abstract reasoning tasks associated with graphical representations effectively.

I-A RAVEN Database as an RPM Problem: Construction and Characteristics

The RAVEN database[16] presents a unique challenge in the realm of RAVEN progressive matrix (RPM) problems, with each question typically comprising 16 images enriched with geometric entities. Half of these images, specifically 8, form the problem stem while the remaining 8 constitute the answer pool. Subjects are tasked with selecting appropriate images from the answer pool to complete a 3×3 matrix, following aprogressive pattern of geometric images along the rows to convey specific abstract concepts.

As illustrated in figure 1, the construction of a RAVEN problem speaks to its generality and sophistication. Within these problems, certain human-defined concepts within the geometric images, such as ”shape” or ”color”, are deliberately abstracted into bounded, countable, and precise ”visual attributes”. The notion of ”rule” is then employed to delineate the progressive transformation of a finite set of these visual attribute values. However, it’s worth noting that some visual attributes remain freedom of the rule, potentially posing as distractions for deep model reasoning.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (1)

To curate a comprehensive RAVEN problem, samples of rules are drawn from a predefined rule pool, guiding the design of visual attribute values. Attributes not bound by these rules are assigned values at random. Subsequently, images are rendered based on the generated attribute information.

The RVEN database is further diversified into multiple sub-databases, namely: single-rule groups—center single (center), distribute four (G2×2), distribute nine (G3×3)—and dual-rule groups: in center single out center single (O-IC), up center single down center single (U-D), left center single up center single (I-L), and in distribute four out center single (O-IG). In problems with a singular rule, the progressive transformation of an entity’s attributes within the image adheres to one set of rules, while in those with dual rules, two independent rule sets govern this transformation.

I-B PGM Database

The design logic of PGM[17] and RAVEN problems is remarkably similar, with both types of problems represented by a problem stem composed of 8 images and an answer pool formed by another 8 images. Notably, in PGM problems, the concept of ”rule” not only describes the progressive pattern of ”visual attributes” in the row-wise direction within the matrix but also constrains the progressive pattern in the column-wise direction. An example of a PGM problem is illustrated in the provided figure 2.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (2)

Consequently, the difficulty of RPM problems lies not only in the exploration of visual attributes at different levels but also in the induction and learning of the progressive patterns of ”visual attributes.”

I-C Bongard-logo Database

Bongard problems[13] exhibit significant differences from RPM problems, with Bongard problems being a type of small sample learning problem. Typically composed of multiple images, these problems divide the images into two groups: a primary group and a secondary group. All images within the primary group express abstract concepts constrained by certain rules, while the images in the secondary group reject these rules to varying degrees. Bongard problems challenge deep learning algorithms to correctly categorize ungrouped images into the appropriate small groups. Bongard-logo, an instantiation of Bongard problems within the realm of abstract reasoning, poses considerable reasoning difficulties. Each Bongard-logo[14] problem consists of 14 images, with 6 images in the primary group, 6 in the secondary group, and the remaining 2 serving as options for grouping. The images contain numerous geometric shapes, and their arrangements serve as the basis for grouping. Figure 3 illustrates an example Bongard-logo problem. In Figure 3, each Bongard problem is composed of two sets of images: the primary group A and the secondary group B. The primary group A contains 6 images, with the geometric entities within each image following a specific set of rules, while the secondary group B includes 6 images that reject the rules in group A. The task is to determine whether the images in the test set satisfy the rules expressed by group A. The difficulty level varies depending on the problem’s structure.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (3)

Bongard-logo problems are categorized into three types based on conceptual categories: 1) Free Form problems (ff), where each shape is composed of randomly sampled action strokes, with each image potentially containing one or two shapes. 2) Basic Shape problems (ba), where the concept corresponds to identifying one shape category or a combination of two shape categories represented in the given shape patterns. 3) High-level Abstraction problems (hd), designed to test a model’s ability to discover and reason about abstract concepts, such as concavity and convexity, symmetry, among others.

II Related work

II-A RPM solver

In image reasoning problems, discriminative models typically produce outputs in the form of a multi-dimensional vector, with each dimension representing the probability of selecting a certain graphic from given candidate answers as the final solution. This output format provides rich information for subsequent decision-making and analysis. However, traditional discriminative models often face numerous challenges when dealing with complex image reasoning tasks, such as capturing subtle differences and digging underlying rules. To address these issues, researchers have proposed a series of innovative models.

Among them, the CoPINet[20] model stands out with its innovative introduction of a contrast module. The primary function of this contrast module is to learn the differences between input graphics, enabling the model, through contrastive learning, to more sensitively capture subtle variations in graphics and thus more accurately determine their attributes during the reasoning process. Additionally, CoPINet incorporates a reasoning module tasked with summarizing potential fundamental rules. By combining contrastive learning with reasoning learning, the CoPINet model has achieved remarkable results in image reasoning problems.

Distinct from CoPINet, the LEN+teacher model[21] relies on a student-teacher architecture to determine the training sequence and make predictions. This architecture facilitates more effective knowledge transfer and model optimization by introducing a teacher model to guide the training of the student model. Specifically, the teacher model leverages its own experience to direct the learning process of the student model, helping it converge more rapidly to better solutions. Through this approach, the LEN+teacher model has yielded impressive outcomes in image reasoning problems.

The DCNet model[22] is notable for its use of a dual-contrast module to accomplish two tasks: comparing rule rows and columns and exploring differences among candidate answers. This dual-contrast mechanism enables DCNet to more comprehensively consider various factors in image reasoning problems, thereby enhancing accuracy and efficiency during the reasoning process.

The NCD model[23] operates in an unsupervised environment and employs methods of introducing pseudo-targets and decentralization. These techniques not only effectively address certain challenges in unsupervised learning but also enhance the model’s generalization capabilities. Specifically, NCD augments the model’s exploration capabilities by introducing pseudo-targets and leverages decentralization methods to reduce the model’s reliance on specific data, thereby bolstering robustness and adaptability.

In the SCL model[24], multiple monitoring mechanisms are applied to sub-graphs within reasoning problems, with the expectation that each branch will focus on specific visual attributes or rules. This multiple monitoring mechanism enhances the model’s flexibility and efficiency when tackling complex image reasoning tasks. Concurrently, SCL leverages relationships between sub-graphs to further strengthen the model’s reasoning capabilities, leading to significant advancements in solving image reasoning problems.

The SAVIR-T model[25] extracts information from within sub-graphs of reasoning problems and relationships between sub-graphs from multiple perspectives, aiming to elevate reasoning effectiveness. This approach enables the efficient capture of diverse information within and between sub-graphs, providing a more comprehensive and accurate foundation for subsequent reasoning processes. Furthermore, SAVIR-T utilizes multi-perspective information fusion methods to further augment the model’s reasoning capabilities, ensuring greater efficiency and accuracy when dealing with intricate image reasoning problems.

RS-Tran[30] adopts a multi-view point and multi-evaluation reasoning approach, which effectively solves the RPM problem and achieves impressive prediction accuracy. Furthermore, by utilizing the accompanying Meta data from RPM tasks for the pre-training of its encoder, RS-Tran has once again made a breakthrough in terms of performance. This pre-training with Meta data enhances the model’s ability to capture underlying patterns and relationships within the RPM problems, enabling it to make more accurate predictions and reason more effectively.

CRAB[31] has established a “greenhouse” tailored to its own methodology, which takes the form of a brand-new RAVEN database. This greenhouse, while sacrificing the core challenges inherent in RAVEN—namely, the diversity and uncertainty of answers—has nevertheless enabled CRAB to achieve remarkable outcomes. Within the confines of this meticulously crafted “greenhouse”, CRAB’s Bayesian methodology has demonstrated remarkable proficiency and efficacy. The controlled setting, tailored to optimize the probabilistic framework, has allowed for a profound exploration and exploitation of the inherent strengths of the Bayesian paradigm, thereby facilitating significant advancements in the field. The scientific community eagerly awaits the implications of this innovative approach for future research.

Additionally, research indicates that relatively decoupled perceptual visual features can contribute to improved reasoning performance[26]. These perceptual visual features not only capture fundamental elements and attributes within images but also effectively express relationships and structures among them. By introducing such perceptual visual features into image reasoning problems, significant enhancements can be achieved in both the model’s reasoning performance and efficiency[26].

Symbolic approaches have brought about higher reasoning precision and enhanced model interpretability.[27, 28, 29] These methods bolster the reasoning capabilities and interpretability of models by incorporating symbolic representations and operations. Specifically, symbolic approaches endow models with increased flexibility and efficiency when addressing intricate image reasoning tasks while also enhancing model transparency and interpretability, facilitating a deeper understanding and analysis of the model’s decision-making processes.

II-B Bongard-logo solver

In recent years, researchers have been exploring various potential solutions to address the highly challenging Bongard problems, leading to the emergence of three dominant strategies: language-based feature model approaches, methods relying on convolutional neural network models, and techniques involving generated datasets.

Firstly, language-based feature model methods[13], exemplified by the work of Depweg and others, aim to decipher visual characteristics within image information through a formalized linguistic system. They have devised a formal language capable of symbolizing visual elements within images, utilizing logical operators to extract these visual features and transform them into a symbolic visual vocabulary. Subsequently, they employ symbolic language and Bayesian reasoning to tackle BP problems. However, this approach is severely constrained by its symbolic representation, making it difficult to handle BP issues involving intricate abstract concepts. Specifically, the method can only manage basic shape-based BP problems and is unable to represent or process more sophisticated abstract concept types. Additionally, whenever confronted with a novel BP problem, the need to reconstruct an appropriate symbolic system adds complexity and limitation to the method. After filtering out BP problems that cannot be expressed using this visual language, only 39 of the original 100 BP problems remain, with 35 of them being resolvable.

Secondly, convolutional neural network model-based methods[32], as exemplified by Kharagorgiev and Yun, favor the use of deep learning techniques for automated feature extraction from images. Kharagorgiev constructed an image dataset containing simple shapes and utilized a pre-training process to develop a feature extractor. This feature extractor is then employed to extract image features from Bongard problems, facilitating image classification to determine if test images conform to specified rules. Yun adopted a similar approach but placed greater emphasis on utilizing images containing visual characteristics from BP problems for pre-training to extract BP image features, subsequently linking additional classifiers for discrimination. While these methods can automatically extract and learn features from images, their performance is heavily reliant on the quality and quantity of training data.

Thirdly, among the strategies employed is the generation of datasets[14]. In 2020, Nie et al. applied basic CNNs, relational networks like WReN-Bongard, and Meta-learning techniques in the Bongrad-Logo database. They endeavored to enhance model generalization by generating substantial volumes of synthetic data. However, their experimental results indicate that the models did not achieve the desired level of performance, potentially due to significant disparities between the generated data and the distribution of real-world problems.

Notably, the PMoC model[15] has emerged as a notable approach, particularly in addressing the challenges posed by the Bongard-Logo problem. This tailored probability model achieves high reasoning accuracy by constructing independent probability models, demonstrating its effectiveness in discerning deeper patterns and inductive reasoning beyond explicit image features. The strength of PMoC lies in its ability to capture the underlying probabilistic relationships within the problem space, enabling more accurate reasoning and pattern recognition. By leveraging the power of probability modeling, PMoC paves the way for more robust and accurate solutions in abstract reasoning tasks.

In conclusion, it is evident that each approach offers distinct advantages and limitations. Language-based feature model methods provide a fresh perspective for comprehending and deciphering BP problems but have limited capabilities in handling complex abstract concepts. Methods based on convolutional neural network models can automatically learn and extract features from images but are constrained by the quality and quantity of training data. While techniques involving generated datasets hold potential for enhancing model generalization, their effectiveness is contingent on the alignment between generated data and real-world problem scenarios. This underscores the need for a more comprehensive and integrated strategy in addressing Bongard problems.

II-C Transformer and Vision Transformer

The Transformer model[4] diverges from conventional RNN and CNN designs, utilizing a fully attentional mechanism for capturing long-range input sequence dependencies. Its core comprises self-attention and feed-forward neural networks, integrated via residual connections and layer normalization to form its encoders and decoders. The self-attention mechanism, analogous to social network influence diffusion, assigns weights based on input sequence position similarities, fostering flexible non-sequential processing. Additionally, the Transformer incorporates encoder-decoder attention, akin to translation dictionary consultation, where the decoder references the encoder’s output to enhance output sequence accuracy.

The Vision Transformer (ViT)[33] is an innovative approach to computer vision tasks that eschews traditional convolutional neural networks in favor of a pure transformer-based architecture. By dividing images into fixed-size patches and treating them as sequences of tokens, ViT leverages the power of self-attention mechanisms to capture long-range dependencies within the image effectively. This shift towards transformers enables ViT to achieve state-of-the-art performance on various vision benchmarks, heralding a new era in computer vision research.

II-D Covariance matrix and correlation loss

The covariance matrix stands as a pivotal tool in multivariate statistical analysis, quantifying the relationships between multiple random variables[34]. In the realms of data science and machine learning, the covariance matrix plays a crucial role, facilitating profound insights into data structures and patterns. This matrix not only encapsulates the variances of individual variables but also the covariances between them, offering a comprehensive view of the interdependencies within a dataset. Its applications span from exploratory data analysis and dimensionality reduction to portfolio optimization and principal component analysis, underscoring its widespread significance in diverse domains of modern data analysis.

Covariance matrix serves as a metric to gauge the linear correlation between any two distributions within a set[34]. By treating each dimension of an image representation as an individual distribution and a collection of such representations as a sample from a group of distributions, one can leverage a batch of samples to assess the linear correlation among every dimension of the image representation. This approach enables a nuanced understanding of the interdependencies between various features within the image data, fostering insights that can inform downstream tasks in image analysis and processing. We calculate the covariance matrix of a multivariate distribution using Formula (1), and then compute the correlation loss of the multivariate distribution using Formula (2).

	$\displaystyle M_{\sigma}(x)=$	$\displaystyle\frac{1}{N-1}\sum_{i=1}^{N}(\mathbf{x}_{i}-\bar{\mathbf{x}})(%\mathbf{x}_{i}-\bar{\mathbf{x}})^{\top}$		(1)
	$\displaystyle L(x)=$	$\displaystyle\frac{1}{d}\sum\left(M_{\sigma}(x)^{2}\cdot(1-I)\right)$		(2)

Where $I$ denotes the identity matrix, and $M_{\sigma}(x)\in R^{d\times d}$ . $d$ represents the dimensionality of the vector $x_{i}$ , and $n$ refers to the number of samples involved in the computation, given that the covariance matrix is calculated based on a batch of samples.

II-E The Expectation-Maximization (EM) algorithm

The Expectation-Maximization (EM) algorithm[35] represents a powerful iterative method widely employed in statistics for finding maximum likelihood estimates of parameters in probabilistic models, especially when the data contain missing values or are observed in an incomplete manner. By alternating between an expectation “E” step and a maximization “M” step, the algorithm optimizes the likelihood function, gradually refining parameter estimates until convergence. Its versatility and robustness have made the EM algorithm a cornerstone technique in diverse fields such as machine learning, bioinformatics, and image processing, where complex models and data structures often demand sophisticated estimation methodologies.

Specifically, we employ a function, denoted as $P(X,Z|\,\theta)$ , to approximate the joint distribution of the observed data and their corresponding latent variables. However, both Z and $\theta$ remain unknown entities. The process initiates with the assumption of an initial, arbitrarily assigned $\theta$ , which is then used to compute the posterior distribution of the latent variables, denoted as $P(X,Z|\,\theta)$ . Given this posterior distribution, we proceed to calculate the joint distribution of X and Z, namely $P(X,Z|\,\theta)$ . Subsequently, $\theta$ is recalculated in a manner that maximizes the joint distribution $P(X,Z|\,\theta)$ . This iterative process of alternating between the computation of $\theta$ and $P(X,Z|\,\theta)$ continues until $P(X,Z|\,\theta)$ reaches its maximum value, ensuring an optimal estimation of the parameters and latent variables within the data.

III Methodology

In this section, four methods have been proposed for the Bongard-Logo[13] and RPM problems[12], namely CFN, Triple-CFN, Meta Triple-CFN, and the Re-space layer. Each of them incorporates new loss function terms or network structures compared to its predecessor, aiming for progressive improvement on the Bongard-Logo and RPM problems.

The Bongard-Logo problem and the Raven’s Progressive Matrices (RPM) problem are distinct yet equally challenging tests of abstract reasoning. Both tasks require participants to identify and interpret underlying principles or concepts that are not immediately apparent from the surface-level features of the presented materials. These principles represent a more sophisticated level of abstraction than mere pixel configuration patterns or other low-level visual properties. Instead, they often reflect human-centered preconceptions about shape, size, color, spatial relationships, the concave or convex nature of objects, and the completeness of figures. Through their respective problem formulations, the Bongard-Logo and RPM tasks seek to evaluate an individual’s ability to discern and comprehend these subtler, more abstract patterns and principles.

III-A A baseline for Bongard-Logo

Based on higher-dimensional human concepts and preferences, the creators of Bongard-logo problems have categorized the Bongard-logo dataset into four distinct problem types: FF, BA, NV, CM. Consequently, we can abstract the distribution of the primary group (positive instances) within a Bongard-logo problem as $p_{i}(x|\,y)$ and the distribution of the auxiliary group (negative instances) as $q_{i}(x|\,y)$ . Here, y denotes the problem’s reasoning type, where $y\in$ {FF, BA, NV, CM}, while $i$ represents the problem’s identifier, with $i\in[1,n]$ and $n$ signifying the total number of problems.For the purpose of convenient representation of data in the Bongard-Logo problem, we denote the Bongard-Logo images as $x_{ij}$ . Specifically, $\{x_{ij}|\,j\in[1,6]\}$ represents images in the $i$ -th primary group, while $\{x_{ij}|\,j\in[8,13]\}$ represents images in the $i$ -th auxiliary group. Additionally, $x_{i7}$ represents the test image to be potentially assigned to the $i$ -th primary group, and $x_{i14}$ represents the test image to be potentially assigned to the $i$ -th auxiliary group.

To effectively tackle Bongard-Logo problems, we are developing a deep learning algorithm, $f_{\theta}(z|\,x)$ , primarily tasked with transforming input samples $x_{ij}$ into latent variables $z_{ij}$ . Ideally, the distributional divergence between the latent variable distribution of the primary group, $p_{i}^{\prime}(z|\,y)$ , and that of the auxiliary group, $q_{i}^{\prime}(z|\,y)$ , should be minimal. However, given the nature of Bongard-logo as a small-sample learning problem, accurately estimating and constraining these two latent variable distributions poses significant challenges. Consequently, directly optimizing for minimal distributional divergence between them may encounter substantial difficulties, thereby making it arduous to train a deep model that exhibits exceptional performance.

In this manuscript, we leverage the InfoNCE loss function[36] as a reasoning loss term for the purpose of training a standard ResNet18 network.The resulting model, denoted as $f_{\theta}(z|\,x)$ , possesses the proficiency to tackle either individual or concurrent high-dimensional concept intricacies inherent within the Bongard-logo dataset. Mathematically, the InfoNCE loss function can be formalized as follows:

	$\displaystyle{\ell_{\mathbf{InfoNCE}}}({z_{pos}},{\tilde{z}_{pos}},\{{z_{ne{g_%{m}}}}\}_{m=1}^{M})$
	$\displaystyle=-\log\frac{{{e^{({z_{pos}}\cdot{\tilde{z}_{pos}})/t}}}}{{{e^{({z%_{pos}}\cdot{\tilde{z}_{pos}})/t}}+\sum\nolimits_{m=1}^{M}{{e^{({z_{pos}}\cdot%{z_{ne{g_{m}}}})/t}}}}}$		(3)

In the context of this study, $z_{pos}$ and $\tilde{z}_{pos}$ are used to denote distinct samples’ representation that belong to the primary group, whereas $z_{ne{g_{m}}}$ encompasses all $m$ samples’ representation within the auxiliary group. Here, $m$ signifies the quantity of auxiliary group samples within a single Bongard-logo case, and the variable $t$ represents the temperature coefficient, is set to $10^{-3}$ . InfoNCE can impose the following constraint on the network: the cosine similarity between vector $z_{pos}$ and vector $\tilde{z}_{pos}$ should be higher than the cosine similarity between vector $z_{pos}$ and vector set $\{z_{ne{g_{m}}}\}$ . This constrains aligns with the underlying logic of Bongard-logo.The feedforward process of the network is illustrated in the figure 4. By utilizing the InfoNCE function, we can avoid estimating the distributions $p_{i}^{\prime}(z|\,y)$ and $q_{i}^{\prime}(z|\,y)$ while simultaneously encouraging the representations of the primary group encoded by the network $f_{\theta}(z|\,x)$ to be more similar and ensuring greater mutual exclusion between the representations of the primary and auxiliary groups. In figure 4, the “ $A_{7}^{2}$ ” represents the number of distinct ways to select 2 elements from a set with 7 elements.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (4)

Utilizing the aforementioned methodologies, this study conducted rigorous experiments on the four concept databases of Bongard-Logo, both individually and in combination. The comprehensive experimental results are presented in detail in Table I. In this table, the “Bongard-Logo” entry encompasses the consolidated findings obtained by training the model using a combined dataset of all four concepts. Conversely, the “Separated Bongard-Logo” entry delineates the specific outcomes derived from training each concept independently.

The results unequivocally demonstrate that the convolutional deep model, ResNet18, has struggled to distinguish between the four distinct concepts. Convolutional networks are designed to encode image representations primarily based on the configurations of image pixels. This study hypothesizes that the Bongard-Logo cases exhibit distinct classification patterns based on varying concepts. Furthermore, it is speculated that since no additional concept labels were introduced as supervisory signals during training, the network solely relies on the configuration of image pixels to process Bongard-Logo cases, which may lead to internal confusion within the network and ambiguity concerning the concept. This ambiguity underscores the need for further investigation into the design of more robust and conceptually discriminative deep learning models for tackling such complex tasks.

	Test Accuracy(%)
Data Set	FF	BA	CM	NV
Bongard-logo	88.1	97.9	76.0	75.8
Separated Bongard-logo	97.9	99.0	75.0	72.8

III-B Cross-Feature Net (CFN)

Our aspiration is for the network to deduce concepts solely from the pixel configuration patterns inherent in images. Concepts constitute the fundamental basis for the classification of Bongard-Logo problems, making their accurate deduction crucial. Modeling abstract, high-dimensional human concepts often yields intricate and interconnected distributions, posing significant challenges.Given the difficulty in estimating and leveraging conditional distributions such as $p_{i}^{\prime}(z|\,y)$ , $q_{i}^{\prime}(z|\,y)$ , along with the complexity of modeling high-dimensional human concepts, this paper attempts to implicitly reconstruct the concept for the Bongard-Logo dataset. We reformulate the problem as $p_{i}(x|\,y^{\prime})$ , $q_{i}(x|\,y^{\prime})$ , where $y^{\prime}$ belongs to the set $\{Y_{\alpha}\}$ , the range of $\alpha$ remains unknown, and no additional settings were imposed in this study. In this context, each $Y_{\alpha}$ represents a reconstructed concept, potentially distinct from the original concepts.It is our hope that by reconstructing these concepts, or more precisely, by redefining them, the network will be able to more easily recognize the appropriate concept for categorizing Bongard-Logo instances. This recognition will be based primarily on image styles and pixel configuration patterns. Through this reorganization process, we aim to alleviate the confusion that often arises between high-dimensional Bongard-Logo concepts within deep learning models.

As for the implementation on the network, we introduce a sophisticated deep learning algorithm, designated as $g(q,k|\,x)$ , aimed at refining the representation of samples within the primary group of a Bongard-Logo problem while distinguishing them from those in the auxiliary group. The algorithm $g(q,k|\,x)$ , called Cross-Feature Net, is tailored to extract concepts, exclusively based on image styles, thereby laying the foundation for Bongard-Logo solution and categorization.

To elaborate, each image in a Bongard-Logo problem is encoded with a concept vector $q$ $(q\in R^{d})$ and an associated feature vector set $\{k_{\beta}|\,\beta\in[1,m],k_{\beta}\in R^{d}\}$ . Subsequently, cross-attention mechanisms are employed to analyze the interactions between the conceptual vector $q$ and the style vector set $\{k_{\beta}\}$ , with vector $q$ serving as the query and $\{k_{\beta}\}$ as the corresponding key-value pairs.We claim that reinterpreting the underlying concept conveyed by the image through a deep learning model and adjusting the weighting of the image’s feature vectors based on this reinterpretation can effectively tackle such challenging problems. This innovative approach obviates the need for an extensive and potentially cumbersome search for optimal concept partitioning strategies.

In detail, we employ ResNet18 as $g_{\omega}(q|\,x)$ to learn the concept vector $q$ for Bongard-Logo images, and ResNet50 as $g_{\theta}(k|\,x)$ to learn the feature vector set $\{k_{\beta}\}$ for the images. The cross-attention mechanism is computed between q and $\{k_{\beta}\}$ , yielding the final feature representation $z_{ij}$ of the image $x_{ij}$ , which constitutes the core computational process of the Cross-Feature Net.We utilize the new backbone $g(q,k|\,x)$ to replace the ResNet18 used in the aforementioned baseline, with the expectation that the new backbone $g(q,k|\,x)$ will be more suitable for addressing the Bonard-logo problem compared to ResNet18. we persist in utilizing the InfoNCE loss function for training the network $g(q,k|\,x)$ .Additionally, drawing inspiration from the EM clustering algorithm, we alternately train the parameters of $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ within the Cross-Feature Net. Since the cross-attention mechanism can be interpreted as a weighted sum process of the feature vector set $\{k_{\beta}\}$ based on its similarity to the concept vector $q$ , optimizing $g_{\omega}(q|\,x)$ is treated as a process of recalculating distribution centroids, while optimizing $g_{\theta}(k|\,x)$ is considered as maximizing the expected distribution.

In summary, our goal is to utilize a network, denoted as $g_{\omega}(q|\,x)$ , for extracting the concepts essential to solving the problem. This extraction process exclusively relies on the pixel configuration patterns within the problem images. Furthermore, we introduce another network, $g_{\theta}(k|\,x)$ , designed to craft image representations that align closely with these extracted concepts.To obtain the final representation of the Bongard-Logo images, we compute the attention result between the concepts and these representations. This attention mechanism helps us to focus on the most relevant features. Subsequently, we constrain the resulting representation using the InfoNCE loss function and alternately optimize both $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ .By adopting a unified framework, $g(k|\,x)$ , which we refer to as CFN (Cross-Feature Network), we anticipate a substantial enhancement in reasoning precision. The detailed feedforward processes associated with CFN are meticulously illustrated in Figure 5.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (5)

III-C Cross covariance-constrained Feature Net (Triple-CFN)

Afterwards, although imitating the EM algorithm to alternately update the parameters of $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ can enhance the model’s performance, the alternating training process for $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ is tedious and unstable. To address this, our paper makes further contributions. In essence, we designed two networks, $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ , with the aim of implicitly reinterpreting and effectively solving the Bongard-Logo problem. By maximizing the image representation information carried by $\{k_{\beta}\}$ in the output of $g_{\theta}(k|\,x)$ , we can improve both processes mentioned above. To a certain extent, this paper posits that maximizing the information carried by the feature vector set $\{k_{\beta}\}$ can be a viable alternative to the traditional process of maximizing expectation. By adopting this approach, the inherent two-step process arising from the simulation of the $EM$ algorithm[35] within the CFN can be effectively integrated.

Therefore, we introduced the correlation loss based on the covariance matrix as an additional term in the loss function for CFN to decrease the correlation between each dimension in $\{k_{\beta}\}$ . This resulted in the creation of the Cross covariance-constrained Feature Net (Triple-CFN). The coefficient for the newly introduced loss term was set to be 25 times that of the reasoning loss term. Moreover, the practice of alternately updating the parameters of $g_{\omega}(q|\,x)$ and $g_{\theta}(k|\,x)$ , while extending training epochs, yields only minimal improvements for the Triple-CFN. Therefore, on the RPM data, we no longer use the strategy of alternating updates between the two networks. Calculating attention results for the output of CNNs can result in attention collapse issues. We posit that reducing the correlation between dimensions in the output representation of CNNs offers a method to alleviate attention collapse. Figure 6 presents a detailed depiction of the feed-forward process within Triple-CFN. The complete loss function on a batch of Bongard-Logo problems for Triple-CFN is as follows:

	$\displaystyle{\ell_{\text{Triple-CFN}}}=$	$\displaystyle\frac{1}{b}\sum_{i=1}^{b}\sum_{j=1,\tilde{j}\neq j}^{7}\ell_{%\mathbf{InfoNCE}}\left(z_{ij},z_{i\tilde{j}},\{z_{ij}\}_{j=8}^{14}\right)$		(4)
		$\displaystyle+25\cdot\ell_{cov}(\{z_{ij}\|\,i\in[1,b],j\in[1,14]\})$

Where $b$ represents the batchsize for training. The reasoning loss term $\ell_{\mathbf{InfoNCE}}(\cdot)$ has expressed in Formula (3).The correlation loss term $\ell_{cov}(\cdot)$ represents a specific computation involving vectors enclosed within the parentheses. In this context, the set of vectors inside $\ell_{cov}(\cdot)$ is treated as samples drawn from a multivariate distribution. Based on these samples, the covariance matrix of the multivariate distribution is calculated using Formula (1). The Correlation Loss for Triple-CFN is calculated using Formula (2).

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (6)

III-D Triple-CFN on RPM problem

Particularly when tackling RPM problems, Triple-CFN is able to demonstrate its distinctive utility and value. Specifically, when addressing RPM problems, the core backbone of Triple-CFN remains unchanged, with only slight adjustments made to the design of $g(q,k|\,x)$ . These adjustments are necessary operations to enable Triple-CFN to adapt to RPM reasoning rules and inductive biases.

Following the current consensus among RPM solvers, which emphasizes the need for multi-scale or multi-viewpoint feature extraction from RAVEN images[25, 30, 24], Triple-CFN utilizes Vision Transformer (ViT) for feature extraction while preserving all output vectors as multi-viewpoint features. In this paper, the number of viewpoints is denoted as $L$ . The process of the extraction can be expressed in figure 7.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (7)

Subsequently, Triple-CFN processes each viewpoint equally.

From a viewpoints that considers all images in RPM problems, this paper employs a Multi-layer Perceptron (MLP) with a bottleneck structure to extract information from all minimal reasoning units (three images within a row for RAVEN and three images within a row or column for PGM[37]), and preserves these extracted unit informations as vectors $\{k_{\beta}|\,\beta\in[1,M]\}$ . $\beta\in[1,M]$ denotes the index of the minimal reasoning unit, and $M$ stands for the total number of minimal reasoning units.Using MLP to process sequential images within minimal reasoning units aims to retain their order in a straightforward manner. Combining the unit vectors from the problem stem with $S$ optimizable vectors, we input them into the network $g_{\omega}(q|\,x)$ , which utilizes a Transformer-Encoder as its backbone, to obtain $\{q_{\alpha}|\,\alpha\in[1,S]\}$ by extracting all the optimizable vectors from the network output. By encoding multiple concept vectors $\{q_{\alpha}\}$ through the network $g_{\omega}(q|\,x)$ , the intention is to enable the network $g(q,k|\,x)$ to solve RPM problems through a multi-evaluation reasoning approach.The calculation process of $g(q,k|\,x)$ can be expressed in figure 8.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (8)

After calculating the cross-attention results between $\{q_{\alpha}\}$ and $\{k_{\beta}\}$ , a new MLP is used to score these outputs, resulting in $S$ scores under one viewpoint. Considering all viewpoints, the current Triple-CFN network encodes $L$ sets of $\{q_{\alpha}\}$ and $\{k_{\beta}\}$ and calculate $L\times S$ scores.Subsequently, the sets of $L\times S$ scores are averaged to determine the final score.The calculation process of final score can be expressed in figure 9.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (9)

To enforce constraints on this final score, the Cross-Entropy loss function is employed as a reasoning loss term $\ell_{\text{cross-entropy}}$ . We still apply the correlation loss term on the vector set $\{k_{\beta}\}$ .Notably, the coefficient for the reasoning loss term is set at 100 times the magnitude of the correlation loss term. This coefficient setting differentiates our approach from the one employed in the Triple-CFN for Bongard-Logo tasks.Figure 10 provides a detailed illustration of the improvements made to Triple-CFN in order to adapt it to RPM problems. The locations of reasoning loss term $\ell_{\text{cross-entropy}}$ and correlationloss term $\ell_{cov}$ are shown in this figure 10.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (10)

III-E Meta Cross covariance-constrained Feature Net (Meta Triple-CFN)

After the experiment, we discovered that the Triple-CFN’s learning approach of reinterpreting abstract reasoning problem spaces exhibited slight advantages in addressing the Bongard-Logo problem. Triple-CFN was designed to tackle the hypothesis of conflicts between low-dimensional image styles and high-dimensional human concepts. When high-dimensional concepts are designed reasonably and effectively, such as the image progressive patterns in RAVEN and PGM problems, using these high-dimensional concepts as supervisory signals to constrain the training of Triple-CFN can lead to more significant contributions from the network.

The RAVEN and PGM problems are accompanied by the image progressive pattern description (Meta data). Previous works have attempted to enhance their RPM solvers by incorporating additional tasks of learning image progression patterns, aiming to improve performance on reasoning tasks and enhance interpretability. However, efforts led by MRNet[19] suggest that such approaches may be counterproductive.This paper believes that the advantage brought by the structure of Triple-CFN can balance both reasoning tasks and progressive pattern matching tasks, and even allow the two tasks to refine each other and grow together. This is something that previous RPM discriminator, including RS-Tran[30] and MRNet, could not achieve. This paper posits that strengthening the identity of $\{q_{\alpha}\}$ as a conceptual vector, by utilizing these pattern cue signals, can effectively and reasonably enhance the interpretability and reasoning accuracy of Triple-CFN.

In this study, an additional standard Transformer-Encoder is employed to process the Meta data in RAVEN and PGM problems, depicting a concept spaces for Triple-CFN. The newly introduced Meta loss term is utilized to constrain the $\{q_{\alpha}\}$ vectors, encoded by Triple-CFN, to corresponding positions in the concept space established by the Transformer-Encoder.Recall that $\{q_{\alpha}|\,\alpha\in[1,S]\}$ represents a set of $S$ concept vectors from a single viewpoint in Triple-CFN encoding. Given the necessity to impose constraints on $\{q_{\alpha}\}$ across all viewpoints, this paper opts to compute the average of $\{q_{\alpha}\}$ obtained from each viewpoint, denoted as $\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}$ , and impose constraints on this averaged representation. Regarding the implementation of concept spaces in RAVEN, we tokenize the progressive pattern descriptions using the format ‘type: XXXX, size: XXXX, color: XXXX, number/position: XXXX’. Subsequently, we combine these tokenized descriptions with an optimizable vector and process them using standard Transformer Encoders. The combined optimizable vector, extracted from the attention results generated by the Transformer Encoder, serves as a feature of the pattern descriptions. Through this process, we generate feature vectors for all kinds of pattern descriptions and establish concept spaces of progressive pattern. These feature vectors are referred to as $\{T_{k}|\,k\in[1,K]\}$ .

Finally, a new Meta loss term, based on the InfoNCE loss, is introduced. This term is applied to optimize the cosine similarity between the $\{\overline{q}_{\alpha}\}$ vectors in Triple-CFN and the feature vectors $\{T_{k}|\,\,k\in[1,K]\}$ in the concept spaces formed by Meta data. Specifically, it aims to ensure that each $\overline{q}_{\alpha}$ exhibits greater similarity to its corresponding pattern feature vector $T_{\tilde{k}}$ than to any other vectors $\{T_{k}|\,k\in[1,K],k\neq\tilde{k}\}$ .The number of vectors in $\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}$ , $S$ , is determined by how many optimizable vectors are combined to the logical input by the $g_{\omega}(q|\,x)$ network, which can be set arbitrarily. In the context of Triple-CFN, when the model is constrained by Meta data from RPM problems, we typically determine the value of $S$ by adding one to the count of decoupled progressive patterns identified within the Meta data. For instance, in the case of solving PGM problems, $S$ can be set to 3 due to the presence of two decoupled conceptual attributes: “shape” and “line”.

The rationale behind introducing $S$ , which surpasses the count of decoupled progressive patterns by one, is rooted in the necessity to guarantee an unfettered vector within the set $\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}$ . This particular vector remains unhindered by the constraints imposed by Meta data, thereby facilitating its autonomous engagement with resolving the RPM problem. Such a particular vector serves as a countermeasure against any potential unforeseen and irrational design elements embedded within the Meta data, thereby enhancing the overall robustness and adaptability of the Meta Triple-CFN.With the inclusion of progressive pattern labels and the introduction of new loss terms, Triple-CFN undergoes a transformation into Meta Triple-CFN. The Meta loss term can be expressed as follows:

	$\displaystyle{\ell_{\text{Meta}}}(\{\overline{q}_{\alpha}\|\,\alpha\in[1,S-1]\}%,\{T_{k}\|\,k\in[1,K]\})$
	$\displaystyle=-\sum_{\alpha=1,\,\tilde{k}\|\,\alpha}^{S-1}\log\frac{{{e^{({%\overline{q}_{\alpha}}\cdot{T_{\tilde{k}}})/t}}}}{{{e^{({\overline{q}_{\alpha}%}\cdot{T_{\tilde{k}}})/t}}+\sum_{k=1,\,k\neq\tilde{k}}^{K}{{e^{({\overline{q}_%{\alpha}}\cdot{T_{k}})/t}}}}}$		(5)

The temperature coefficient $t$ in the Meta loss term is set to a value of $10^{-6}$ . It is worth noting that $T_{\tilde{k}}$ represents the progressive pattern vector that should be aligned with $q_{\alpha}$ . In the formula (5), $\tilde{k}$ is determined by $\alpha$ , a fact that binds the respective progressive patterns to different vectors in $\{\overline{q}_{\alpha}|\,\alpha\in[1,S-1]\}$ . This ensures that each vector in $\{\overline{q}_{\alpha}|\,\alpha\in[1,S-1]\}$ can align with all the decoupled progressive patterns.From another viewpoint, $\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}$ can be regarded as comprising $S$ slots. The Meta loss term is designed to embed $S-1$ decoupled concepts from the Meta data into these slots, while reserving one empty slot to stabilize the Triple-CFN. This reserved slot serves as a safeguard against certain subtle and unreasonable configurations within the Meta data that might unforeseen.The calculation process of $\ell_{\text{Meta}}$ can be expressed in figure 11.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (11)

In this advanced framework, the coefficients for both the novel Meta loss term and the preexisting Cross-Entropy loss term, which jointly enforce constraints on model reasoning, are assigned equally and set at a value 100 times greater than the coefficient of the correlation loss term. The figure 12 illustrate the structure of the Meta Triple-CFN in detial. It is worth noting that Meta Triple-CFN is tailored for RPM problems due to their unambiguous and well-defined auxiliary “rule” supervision signals. Conversely, Bongard-Logo problems exhibit overlapping patterns or concepts, which constitute the source of their difficulty, rendering Meta Triple-CFN unsuitable for addressing Bongard-Logo challenges.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (12)

Intuitively, providing Meta data as additional supervisory signals directly to a deep neural network to assist in learning abstract reasoning problems should naturally improve the network’s accuracy performance. However, this is not the case in practice. Previous research has mostly shown that introducing Meta data to a network can actually decrease its reasoning accuracy[24, 19, 25]. The ingenuity of Triple-CFN lies in its ability to overcome this curse in RPM. RS-Tran has demonstrated a tangible improvement in model performance through the indirect utilization of Meta data. Specifically, RS-Tran utilizes Meta data for the pre-training of its encoder, which enhances the performance of RS-Tran. However, RS-Tran has not yet achieved concurrent interpretability of rules from a human perspective alongside exceedingly high reasoning accuracy. In contrast, Triple-CFN emerges as an excellent model that is capable of simultaneously balancing both objectives. Furthermore, in the multiple reasoning steps of RS-Tran, the content of each step needs to be verified through post-hoc masking experiments, whereas the reasoning steps in Meta Triple-CFN inherently exhibit ex-ante interpretability on progressive patterns. Meta Triple-CFN, on the other hand, is the model that successfully balances both aspects.

III-F Re-space layer

Distinguishing the sources of reasoning difficulty between Bongard-Logo and RPM problems, the challenge in Bongard-Logo partly stems from conflicts among high-dimensional concepts at a fundamental level, whereas RPM problems demand multi-level reasoning.

In this paper, both Triple-CFN and Meta triple-CFN implicitly or explicitly constrain the progressive pattern vectors $\{q_{\alpha}\}$ for RPM problems. We posit that, at its core, the constraint imposed on the progressive pattern vector $\{q_{\alpha}\}$ in Meta Triple-CFN is spiritually resembling to the code book approach, albeit implemented through the lens of a linguistic model. Thus, we contemplate that the essence of Meta Triple-CFN lies in its ability to standardize the output of Triple-CFN under the supervision of auxiliary labels. This paper designs a noval normalization method applied to the $\{k_{\beta}|\,\beta\in[1,M]\}$ vector group in both Triple-CFN and Meta Triple-CFN.

Specifically, we establish $M$ optimizable vectors for Triple-CFN, which depict a vector space $\{v_{h}|\,h\in[1,M]\}$ . Subsequently, cosine similarity is computed between the information from minimal reasoning units, $\{k_{\beta}\}$ , and each optimizable vector. And the calculate process can be expressed as follows:

	$\displaystyle k^{\prime}_{\beta h}$	$\displaystyle=\frac{v_{h}\cdot k_{\beta}}{\|\,\|\,v_{h}\|\,\|\,\times\|\,\|\,k_{%\beta}\|\,\|\,}$		(6)
	$\displaystyle k^{\prime}_{\beta}$	$\displaystyle=\{k^{\prime}_{\beta h}\|\,h\in[1,M]\}$		(7)

The resulting vector $k^{\prime}_{\beta}$ , composed of $M$ cosine similarities $\{k^{\prime}_{\beta h}|\,h\in[1,M]\}$ , represent the coordinates of the minimal reasoning unit vector $k_{\beta}$ within the vector space $\{v_{h}|\,h\in[1,M]\}$ , the original reasoning unit vectors $\{k_{\beta}\}$ are replaced by the computed coordinates $\{k^{\prime}_{\beta}\}$ for subsequent reasoning tasks. And the process of Re-space Layer is illustrated in figure 13.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (13)

We posit that this design constitutes an excellent normalization technique. During model training, the similarity among the $K$ optimizable vectors is constrained to ensure a richly diverse vector space and avoid collapse from the re-space layer output. The constraint is implemented by utilizing the following function as an additional loss term for Triple-CFN or Meta Triple-CFN:

\displaystyle{\ell_{\text{Re-space}}}(\{v_{h}\}^{M}_{h=1}))=\sum_{h=1}^{M}-%\log\frac{{{e^{({v_{h}}\cdot{v_{h}})/t}}}}{{{e^{({v_{h}}\cdot{v_{h}})/t}}+\sum%_{\tilde{h}=1,\,\tilde{h}\neq h}^{M}{{e^{({v_{h}}\cdot{v_{\tilde{h}}})/t}}}}}

(8)

Where the $t$ is set to $10^{-2}$ . The parameter $M$ is set to be as large as the dimension of $k_{\beta}$ , which is 128. When the Re-space layer is incorporated into Triple-CFN or Meta Triple-CFN, the coefficients for the aforementioned loss terms remain consistent with the coefficient for the correlation loss term in the model. In other words, the ratio between the Meta loss term, CE loss term, correlation loss term, and Re-space loss term is set to 100:100:1:1. The design of Meta Triple-CFN, which incorporates the Re-space layer, is depicted in the figure 14. Furthermore, the integration of Triple-CFN with the Re-space layer becomes self-evident.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (14)

It is worth emphasizing that the aforementioned calculation process does not equate to the process of vectors undergoing matrix mapping and subsequent $tanh$ compression. This design enhances the applicability of Triple-CFN and Meta Triple-CFN to RPM problems.

IV Experiment

All our experiments are implemented in Python using the PyTorch[38] framework.

IV-A Experiment on Bongard-Logo

In this study, we conducted experiments on the Bongard-Logo dataset using the designed CFN and Triple-CFN models.To demonstrate the impact of alternating updates of $g_{\theta}(k|\,x)$ and $g_{\omega}(q|\,x)$ , which mimic the Expectation-Maximization algorithm, on model performance, we performed ablation experiments. The results of these experiments are presented in tables II. It is important that our experiments were conducted on a single server equipped with four A100s graphics processing units (GPUs). We trained the models using mini-batch gradient descent with a batch size of 120. During training, we utilized the Adam[39] optimizer with a learning rate of $10^{-3}$ and a weight decay of $10^{-4}$ .It is worth mentioning again that Triple CFN incorporates two loss function terms: the reasoning loss term and the correlation loss term. When Triple-CFN is applied to the Bongard-Logo problem, the coefficient ratio between the reasoning loss term, which is composed of the infoNCE loss, and the correlation loss term based on the covariance matrix is set to 1:25. However, when addressing the RPM problem, the coefficient ratio between the reasoning loss term, formulated through cross-entropy loss, and the correlation loss term shifts to 100:1.Meta Triple-CFN, tailored specifically for the RPM problem, introduces a new Meta loss term rooted in InfoNCE, in addition to the components of Triple-CFN. Within Meta Triple-CFN, the coefficient ratio among the Meta loss term, reasoning loss term, and correlation loss term stands at 100:100:1.The Re-space layer emerges as an enhancement for both Triple-CFN and Meta Triple-CFN. Its integration into the network necessitates the addition of a new loss function term, which serves to ensure that the output of the Re-space layer does not succumb to mode collapse. Consequently, the coefficient ratio among the Meta loss term, reasoning loss term, correlation loss term, and the Re-space loss term is maintained at 100:100:1:1.

	Accuracy(%)
Model	Train	FF	BA	CM	NV
SNAIL	59.2	56.3	60.2	60.1	61.3
ProtoNet	73.3	64.6	72.4	62.4	65.4
MetaOptNet	75.9	60.3	71.6	65.9	67.5
ANIL	69.7	56.6	59.0	59.6	61.0
Meta-Baseline-SC	75.4	66.3	73.3	63.5	63.9
Meta-Baseline-MoCo	81.2	65.9	72.2	63.9	64.7
WReN-Bongard	78.7	50.1	50.9	53.8	54.3
SBSD	83.7	75.2	91.5	71.0	74.1
PMoC	92.0	92.6	97.7	78.3	75.0
CFN	91.2	86.5	98.1	77.0	77.5
CFN+EM	93.9	93.8	99.4	77.8	77.2
Triple-CFN	93.2	92.0	99.2	80.8	79.1
Triple-CFN+EM	95.3	94.3	99.8	80.3	80.0

By alternating the updates of $g_{\theta}(k|\,x)$ and $g_{\omega}(q|\,x)$ , we aimed to simulate the iterative nature of the EM algorithm, which is known for its effectiveness in finding maximum likelihood estimates in statistical models with latent variables. Our ablation studies revealed that this alternating update strategy contributed to improving the performance of the CFN on the Bongard-Logo task.As observed in the table II, alternating updates between networks $g_{\theta}(k|\,x)$ and $g_{\omega}(q|\,x)$ enhanced the model’s performance on the FF and BA problems without significantly affecting its ability to solve the generalization problems of ”NV” and ”CM”. This indeed suggests that simulating the EM process during training, while beneficial, may be somewhat redundant when combined with the already excellent cross-attention mechanism. Furthermore, as the CFN is upgraded to the Triple-CFN, the role of EM diminishes. In addition, compared to PMoC[15], Triple-CFN exhibits better performance on multiple quantifiable metrics of the Bongard-Logo dataset, while requiring fewer parameters and simpler computational forms. Moreover, it does not necessitate parallel reasoning tasks involving multiple perspectives and inferences.

IV-B Experiment on RPM

When confronted with the RAVEN database in RPM problems, Triple CFN has demonstrated considerable strength and performance. In this study, we conducted experiments using identical software and hardware configurations as those employed in the RS-Tran experiments. We replicated the experimental parameters from the RS-Tran setup, including batch size, learning rate, and all other factors that could potentially influence model performance. This was done to facilitate the most straightforward comparison with RS-Tran, which is currently considered the state-of-the-art model. the the accuracy of Triple-CFN on RAVEN and I-RAVEN is recorded in the Table III. The results presented in the table clearly indicate that Triple-CFN exhibits a notably superior performance when compared to RS-Tran.

	Test Accuracy(%)
Model	Average	Center	2 $\times$ 2 Grid	3 $\times$ 3 Grid	L-R	U-D	O-IC	O-IG
SAVIR-T [25]	94.0/98.1	97.8/99.5	94.7/98.1	83.8/93.8	97.8/99.6	98.2/99.1	97.6/99.5	88.0/97.2
SCL [24, 25]	91.6/95.0	98.1/99.0	91.0/96.2	82.5/89.5	96.8/97.9	96.5/97.1	96.0/97.6	80.1/87.7
MRNet [19]	96.6/-	-/-	-/-	-/-	-/-	-/-	-/-	-/-
RS-TRAN[30]	98.4/98.7	99.8/100.0	99.7/99.3	95.4/96.7	99.2/100.0	99.4/99.7	99.9/99.9	95.4/95.4
Triple-CFN	99.6/99.8	100.0/100.0	99.7/99.8	98.8/99.4	99.9/100.0	99.9/100.0	99.9/99.9	99.2/99.2

We subsequently conducted experiments on the PGM dataset using Triple-CFN and Re-space layer under the exact same experimental conditions as Rs-Tran, the accuracy of reasoning answers is recorded in the Table IV and the accuracy of Reasoning progressive patterns is recorded in Tabel V. Our aim was to demonstrate the superiority of both Triple-CFN and Meta Triple-CFN. It is worth mentioning again that Meta Triple-CFN balances both ex-ante interpretability of the progressive patterns and reasoning accuracy, which is not achievable by Rs-Tran and other previous model in Table IV.

Model	Test Accuracy(%)
SAVIR-T [25]	91.2
SCL [24, 25]	88.9
MRNet [19]	94.5
RS-CNN[30]	82.8
RS-TRAN[30]	97.5
Triple-CFN	97.8
Triple-CFN+Re-space layer	98.2
Meta Triple-CFN	98.4
Meta Triple-CFN+Re-space layer	99.3

	Accuracy(%)
Model	shape	line	answer
Meta Triple-CFN	99.5	99.9	98.4
Meta Triple-CFN+Re-space layer	99.7	99.9	99.3

Integrating the Re-space layer with Triple-CFN and Meta Triple-CFN requires to retent partial model parameters. Specifically, the parameters of the modules preceding the Re-space layer access point within Triple-CFN and Meta Triple-CFN must be preserved, while the remaining parameters are randomly initialized. More precisely, the parameters of the Vision Transformer used for image encoding and the Multi-Layer Perceptron responsible for extracting information related to the minimal reasoning units in (Meta) Triple-CFN are retained, while all other parameters undergo random initialization.

V Conclusion

This paper introduces the novel Triple-CFN approach, tailored specifically for the Bongard-Logo problem. The Triple-CFN’s unique architecture enables it to implicitly reorganize the conceptual space of conflicting Bongard-Logo instances, achieving remarkable performance on this task. Furthermore, the adaptability of the Triple-CFN paradigm is demonstrated through its effective application to the RPM problem, where necessary modifications were made to yield competitive results.

Notably, the well-defined rules, progressive patterns and clear boundaries governing the RPM problem necessitated the development of the Meta Triple-CFN network. This network explicitly structures the problem space for the RPM issue, maintaining interpretability while attaining state-of-the-art performance on the PGM problem.

Overall, this paper contributes to the advancement of machine intelligence by exploring innovative network designs tailored for abstract reasoning tasks. The proposed Triple-CFN and Meta Triple-CFN approaches represent significant steps forward in addressing the challenges posed by the Bongard-Logo and RPM problems, respectively. We believe that our findings will stimulate further research and development in this critical area of artificial intelligence. In essence, Triple-CFN aims to propose a fundamental methodology for tackling abstract reasoning problems, namely the normalization of reasoning information. Both Meta Triple-CFN and the Re-space layer are attempts at normalizing reasoning information, and they have achieved notable improvements in network performance, thereby demonstrating the effectiveness of this approach.

References

[1] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 246-255 (2009).
[2] He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 770-778 (2016).
[3] Krizhevsky, A., Sutskever, I., & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90 (2017).
[4] Vaswani, A. etal. Attention is All You Need. In Advances in Neural Information Processing Systems, (2017).
[5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
[6] Brown, T. etal. Language Models are Few-shot Learners. In Advances in Neural Information Processing Systems, 1877-1901 (2020).
[7] Kingma, D. P., & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2014).
[8] Goodfellow, I. etal. Generative adversarial networks. Communications of the ACM, 63(11), 139-144 (2020).
[9] Ho, J., Jain, A., & Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 33, 6840-6851 (2020).
[10] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. VQA: Visual question answering. In IEEE International Conference on Computer Vision, 2425-2433 (2015).
[11] Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, 2901-2910 (2017).
[12] Raven J. C. Raven’s Progressive Matrices. (Western Psychological Services, (1938).
[13] Depeweg, S., Rothkopf, C. A., & Jäkel, F. Solving Bongard Problems with a Visual Language and Pragmatic Reasoning. Preprint at https://arxiv.org/abs/1804.04452 (2018).
[14] Nie, W., Yu, Z., Mao, L., Patel, A. B., Zhu, Y., & Anandkumar, A. Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. In Advances in Neural Information Processing Systems, 16468–16480 (2020).
[15] R.Song, B.Yuan. Solving the bongard-logo problem by modeling a probabilistic model. Preprint at https://arxiv.org/abs/ arXiv:2403.03173 (2024).
[16] Zhang, C., Gao, F., Jia, B., Zhu, Y., & Zhu, S. C. Raven: A Dataset for Relational and Analogical Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5317–5327 (2019).
[17] Barrett, D., Hill, F., Santoro, A., Morcos, A., & Lillicrap, T. Measuring Abstract Reasoning in Neural Networks. In International Conference on Machine Learning, 511-520 (2018).
[18] Hu, S., Ma, Y., Liu, X., Wei, Y., & Bai, S. Stratified Rule-Aware Network for Abstract Visual Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, 1567-1574 (2021).
[19] Benny, Y., Pekar, N., & Wolf, L. Scale-Localized Abstract Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12557-12565, (2021).
[20] Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., & Zhu, S. C. Learning Perceptual Inference by Contrasting. In Proceedings of Advances in Neural Information Processing Systems, (2019).
[21] Zheng, K., Zha, Z. J., & Wei, W. Abstract Reasoning with Distracting Features. In Advances in Neural Information Processing Systems, (2019).
[22] Zhuo, T., & Kankanhalli, M. Effective Abstract Reasoning with Dual-Contrast Network. In Proceedings of International Conference on Learning Representations, (2020).
[23] Zhuo, Tao and Huang, Qiang & Kankanhalli, Mohan. Unsupervised abstract reasoning for raven’s problem matrices. IEEE Transactions on Image Processing, 8332–8341, (2021).
[24] Wu, Y., Dong, H., Grosse, R., & Ba, J. The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning. Preprint at https://arxiv.org/abs/2007.04212 (2020).
[25] Sahu, P., Basioti, K., & Pavlovic, V. SAViR-T: Spatially Attentive Visual Reasoning with Transformers. Preprint at https://arxiv.org/abs/2206.09265 (2022).
[26]Wei, Qinglai, et al. ”Raven solver: From perception to reasoning.” Information Sciences 634 (2023): 716-729.
[27] Zhang, C., Jia, B., Zhu, S. C., & Zhu, Y. Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9736-9746 (2021).
[28] Zhang, C., Xie, S., Jia, B., Wu, Y. N., Zhu, S. C., & Zhu, Y. Learning Algebraic Representation for Systematic Generalization. In Proceedings of the European Conference on Computer Vision, (2022).
[29] Hersche, M., Zeqiri, M., Benini, L., Sebastian, A., & Rahimi, A. A Neuro-vector-symbolic Architecture for Solving Raven’s Progressive Matrices. Preprint at https://arxiv.org/abs/2203.04571 (2022).
[30]Q. Wei, D. Chen, B. Yuan, Multi-viewpoint and multi-evaluation with felicitous inductive bias boost machine abstract reasoning ability, arXiv :2210 .14914, 2022.
[31]Shi, Fan, Bin Li, and Xangyang Xue. ”Abstracting Concept-Changing Rules for Solving Raven’s Progressive Matrix Problems.” arxiv preprint arxiv:2307.07734 (2023).
[32] S.Kharagorgiev,“Solvingbongardproblemswithdeeplearning,” k10v.github.io,2020.
[33] Dosovitskiy, A. etal. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Preprint at https://arxiv.org/abs/2010.11929 (2020).
[34]Bardes, Adrien, Jean Ponce, and Yann LeCun. ”Vicreg: Variance-invariance-covariance regularization for self-supervised learning.” ar**v preprint ar**v:2105.04906 (2021).
[35]Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. ”Maximum likelihood from incomplete data via the EM algorithm.” Journal of the royal statistical society: series B (methodological) 39.1 (1977): 1-22.
[36] Oord, A. V. D., Li, Y., & Vinyals, O. Representation Learning with Contrastive Predictive Coding. Preprint at https://arxiv.org/abs/1807.03748 (2019).
[37] Carpenter, P. A., Just, M. A., & Shell, P. What One Intelligence Test Measures: a Theoretical Account of the Processing in the Raven Progressive Matrices Test. Psychological review, 97(3), 404, (1990).
[38] Paszke, A. etal. Automatic Differentiation in Pytorch. In NIPS Autodiff Workshop, (2017).
[39] Kingma, D. P., & Ba, J. Adam: A Method for Stochastic Optimization. Preprint at https://arxiv.org/abs/1412.6980, (2014).