Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (2024)

Ruizhuo Song, Member, IEEE, Beiming Yuan, Frank L. Lewis, Fellow, IEEEThis work was supported by the National Natural Science Foundation of China under Grants 62273036. Corresponding author: Ruizhuo Song, ruizhuosong@ustb.edu.cnRuizhuo Song and Beiming Yuan are with the Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China (Ruizhuo Song email: ruizhuosong@ustb.edu.cn and Beiming Yuan email: d202310354@xs.ustb.edu.cn). F. L. Lewis is with the UTA Research Institute, The University of Texas atArlington, Arlington, TX 76019 USA (e-mail: lewis@uta.edu).Ruizhuo Song and Beiming Yuan contributed equally to this work.

Abstract

Abstract reasoning poses significant challenges to artificial intelligence algorithms, demanding a higher level of cognitive ability than that required for perceptual tasks. In this study, we introduce the Triple-CFN method to tackle the Bongard Logo problem, achieving remarkable reasoning accuracy by implicitly reorganizing the conflicting concept spaces of instances. Furthermore, with necessary modifications, the Triple-CFN paradigm has also proven effective on the RPM (Raven’s Progressive Matrices) problem, yielding competitive results. To further enhance Triple-CFN’s performance on the RPM problem, we have upgraded it to the Meta Triple-CFN network, which explicitly constructs the concept space of RPM problems, ensuring high reasoning accuracy while achieving conceptual interpretability. The success of Meta Triple-CFN can be attributed to its paradigm of modeling the concept space, which is tantamount to normalizing reasoning information. Based on this idea, we have introduced the Re-space layer, boosting the performance of both Meta Triple-CFN and Triple-CFN. This paper aims to contribute to the advancement of machine intelligence and pave the way for further breakthroughs in this field by exploring innovative network designs for solving abstract reasoning problems.

Index Terms:

Abstract reasoning, RPM problem, Bongard-logo problem.

publicationid: pubid: 0000–0000/00$00.00©2021 IEEE

I Introduction

Deep neural networks have achieved remarkable success in various domains, including computer vision[1, 3, 2], natural language processing[4, 5, 6], generative models[8, 7, 9], visual question answering[10, 11], and abstract reasoning[12, 13, 14]. The advancement of deep learning in the realm of graphical abstract reasoning is a particularly intriguing and complex research area.

Initially, deep learning was introduced into machine learning, bringing it closer to its original goal of artificial intelligence. It is regarded as learning the inherent patterns and hierarchical representations within sample data, greatly aiding in the interpretation of data types such as text, images, and sound. The ultimate objective is to endow machines with human-like analytical learning capabilities, enabling them to recognize and interpret text, images, and sound.

In the domain of graphical abstract reasoning, the significance of deep learning lies primarily in its ability to tackle complex pattern recognition challenges. Through deep learning, machines can mimic human activities like perception, audition, and cognition, leading to significant strides in artificial intelligence-related technologies.

Moreover, deep learning has yielded numerous achievements in areas like search technologies, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech recognition, recommendations, and personalization. Notably, in speech and image recognition, deep learning has demonstrated remarkable efficacy, with recognition accuracies surpassing preceding technologies.

However, the applications of deep learning extend beyond these. For instance, utilizing the outcomes from upper-level training as initialization parameters for lower-level training processes enhances the efficiency of deep model training. Meanwhile, adopting a layer-wise initialization approach and employing unsupervised learning for training is a pivotal strategy in deep learning.

Collectively, the progression of deep learning in graphical abstract reasoning is an ongoing research sphere that offers substantial support to the development of artificial intelligence. Nevertheless, despite the extensive and profound applications of deep learning, numerous unresolved issues and challenges demand further investigation and exploration.

Notably, following the remarkable accomplishments of deep learning in intelligent visual tasks, machine intelligence is poised to reach even greater heights. The academic community has presented a challenge to deep learning’s abstract reasoning capabilities using graphical reasoning problems. Initially, graphical reasoning entails comprehending and analyzing both global and local characteristics of graphics, posing a significant challenge for deep learning models. Typically, deep learning models extract features by learning from extensive datasets. However, in graphical reasoning problems, the complexity and variability of graphics make it arduous for models to learn effective feature representations.

Secondly, graphical reasoning problems require models to possess reasoning and induction capabilities. This necessitates models to comprehend graphic structures, relationships, and rules and perform reasoning and induction based on this information. However, existing deep learning models often exhibit subpar performance when tackling such problems due to their limited reasoning and induction abilities.

In addition, graphical reasoning problems mandate models to have generalization capabilities. This means models must be adept at handling graphics of various shapes, sizes, and colors while delivering accurate reasoning outcomes. Nevertheless, due to the limited generalization capabilities of deep learning models, they often encounter overfitting or underfitting issues when dealing with such problems.

Lastly, datasets for graphical reasoning problems are typically small-scale, posing challenges for the training of deep learning models. These models require vast amounts of data for training to achieve optimal performance. However, in the context of graphical reasoning problems, the limited dataset size makes it challenging for models to acquire sufficient information for problem-solving. Furthermore, datasets for these problems are often artificially designed, potentially leading to discrepancies between data distributions and real-world scenarios, further complicating model training.

Thus, addressing the challenges posed by graphical reasoning problems to deep learning constitutes a pivotal research direction. This necessitates the design of more effective deep learning models, enhancements in model training methodologies, and optimizations in dataset quality among other aspects.

For instance, Ravens Progressive Matrices (RPM) problems[12] and Bongard problems[13, 14] present learning demands ranging from perception to reasoning. Addressing these demands necessitates advancements in deep learning capabilities to handle abstract reasoning tasks associated with graphical representations effectively.

I-A RAVEN Database as an RPM Problem: Construction and Characteristics

The RAVEN database[16] presents a unique challenge in the realm of RAVEN progressive matrix (RPM) problems, with each question typically comprising 16 images enriched with geometric entities. Half of these images, specifically 8, form the problem stem while the remaining 8 constitute the answer pool. Subjects are tasked with selecting appropriate images from the answer pool to complete a 3×3 matrix, following aprogressive pattern of geometric images along the rows to convey specific abstract concepts.

As illustrated in figure 1, the construction of a RAVEN problem speaks to its generality and sophistication. Within these problems, certain human-defined concepts within the geometric images, such as ”shape” or ”color”, are deliberately abstracted into bounded, countable, and precise ”visual attributes”. The notion of ”rule” is then employed to delineate the progressive transformation of a finite set of these visual attribute values. However, it’s worth noting that some visual attributes remain freedom of the rule, potentially posing as distractions for deep model reasoning.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (1)

To curate a comprehensive RAVEN problem, samples of rules are drawn from a predefined rule pool, guiding the design of visual attribute values. Attributes not bound by these rules are assigned values at random. Subsequently, images are rendered based on the generated attribute information.

The RVEN database is further diversified into multiple sub-databases, namely: single-rule groups—center single (center), distribute four (G2×2), distribute nine (G3×3)—and dual-rule groups: in center single out center single (O-IC), up center single down center single (U-D), left center single up center single (I-L), and in distribute four out center single (O-IG). In problems with a singular rule, the progressive transformation of an entity’s attributes within the image adheres to one set of rules, while in those with dual rules, two independent rule sets govern this transformation.

I-B PGM Database

The design logic of PGM[17] and RAVEN problems is remarkably similar, with both types of problems represented by a problem stem composed of 8 images and an answer pool formed by another 8 images. Notably, in PGM problems, the concept of ”rule” not only describes the progressive pattern of ”visual attributes” in the row-wise direction within the matrix but also constrains the progressive pattern in the column-wise direction. An example of a PGM problem is illustrated in the provided figure 2.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (2)

Consequently, the difficulty of RPM problems lies not only in the exploration of visual attributes at different levels but also in the induction and learning of the progressive patterns of ”visual attributes.”

I-C Bongard-logo Database

Bongard problems[13] exhibit significant differences from RPM problems, with Bongard problems being a type of small sample learning problem. Typically composed of multiple images, these problems divide the images into two groups: a primary group and a secondary group. All images within the primary group express abstract concepts constrained by certain rules, while the images in the secondary group reject these rules to varying degrees. Bongard problems challenge deep learning algorithms to correctly categorize ungrouped images into the appropriate small groups. Bongard-logo, an instantiation of Bongard problems within the realm of abstract reasoning, poses considerable reasoning difficulties. Each Bongard-logo[14] problem consists of 14 images, with 6 images in the primary group, 6 in the secondary group, and the remaining 2 serving as options for grouping. The images contain numerous geometric shapes, and their arrangements serve as the basis for grouping. Figure 3 illustrates an example Bongard-logo problem. In Figure 3, each Bongard problem is composed of two sets of images: the primary group A and the secondary group B. The primary group A contains 6 images, with the geometric entities within each image following a specific set of rules, while the secondary group B includes 6 images that reject the rules in group A. The task is to determine whether the images in the test set satisfy the rules expressed by group A. The difficulty level varies depending on the problem’s structure.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (3)

Bongard-logo problems are categorized into three types based on conceptual categories: 1) Free Form problems (ff), where each shape is composed of randomly sampled action strokes, with each image potentially containing one or two shapes. 2) Basic Shape problems (ba), where the concept corresponds to identifying one shape category or a combination of two shape categories represented in the given shape patterns. 3) High-level Abstraction problems (hd), designed to test a model’s ability to discover and reason about abstract concepts, such as concavity and convexity, symmetry, among others.

II Related work

II-A RPM solver

In image reasoning problems, discriminative models typically produce outputs in the form of a multi-dimensional vector, with each dimension representing the probability of selecting a certain graphic from given candidate answers as the final solution. This output format provides rich information for subsequent decision-making and analysis. However, traditional discriminative models often face numerous challenges when dealing with complex image reasoning tasks, such as capturing subtle differences and digging underlying rules. To address these issues, researchers have proposed a series of innovative models.

Among them, the CoPINet[20] model stands out with its innovative introduction of a contrast module. The primary function of this contrast module is to learn the differences between input graphics, enabling the model, through contrastive learning, to more sensitively capture subtle variations in graphics and thus more accurately determine their attributes during the reasoning process. Additionally, CoPINet incorporates a reasoning module tasked with summarizing potential fundamental rules. By combining contrastive learning with reasoning learning, the CoPINet model has achieved remarkable results in image reasoning problems.

Distinct from CoPINet, the LEN+teacher model[21] relies on a student-teacher architecture to determine the training sequence and make predictions. This architecture facilitates more effective knowledge transfer and model optimization by introducing a teacher model to guide the training of the student model. Specifically, the teacher model leverages its own experience to direct the learning process of the student model, helping it converge more rapidly to better solutions. Through this approach, the LEN+teacher model has yielded impressive outcomes in image reasoning problems.

The DCNet model[22] is notable for its use of a dual-contrast module to accomplish two tasks: comparing rule rows and columns and exploring differences among candidate answers. This dual-contrast mechanism enables DCNet to more comprehensively consider various factors in image reasoning problems, thereby enhancing accuracy and efficiency during the reasoning process.

The NCD model[23] operates in an unsupervised environment and employs methods of introducing pseudo-targets and decentralization. These techniques not only effectively address certain challenges in unsupervised learning but also enhance the model’s generalization capabilities. Specifically, NCD augments the model’s exploration capabilities by introducing pseudo-targets and leverages decentralization methods to reduce the model’s reliance on specific data, thereby bolstering robustness and adaptability.

In the SCL model[24], multiple monitoring mechanisms are applied to sub-graphs within reasoning problems, with the expectation that each branch will focus on specific visual attributes or rules. This multiple monitoring mechanism enhances the model’s flexibility and efficiency when tackling complex image reasoning tasks. Concurrently, SCL leverages relationships between sub-graphs to further strengthen the model’s reasoning capabilities, leading to significant advancements in solving image reasoning problems.

The SAVIR-T model[25] extracts information from within sub-graphs of reasoning problems and relationships between sub-graphs from multiple perspectives, aiming to elevate reasoning effectiveness. This approach enables the efficient capture of diverse information within and between sub-graphs, providing a more comprehensive and accurate foundation for subsequent reasoning processes. Furthermore, SAVIR-T utilizes multi-perspective information fusion methods to further augment the model’s reasoning capabilities, ensuring greater efficiency and accuracy when dealing with intricate image reasoning problems.

RS-Tran[30] adopts a multi-view point and multi-evaluation reasoning approach, which effectively solves the RPM problem and achieves impressive prediction accuracy. Furthermore, by utilizing the accompanying Meta data from RPM tasks for the pre-training of its encoder, RS-Tran has once again made a breakthrough in terms of performance. This pre-training with Meta data enhances the model’s ability to capture underlying patterns and relationships within the RPM problems, enabling it to make more accurate predictions and reason more effectively.

CRAB[31] has established a “greenhouse” tailored to its own methodology, which takes the form of a brand-new RAVEN database. This greenhouse, while sacrificing the core challenges inherent in RAVEN—namely, the diversity and uncertainty of answers—has nevertheless enabled CRAB to achieve remarkable outcomes. Within the confines of this meticulously crafted “greenhouse”, CRAB’s Bayesian methodology has demonstrated remarkable proficiency and efficacy. The controlled setting, tailored to optimize the probabilistic framework, has allowed for a profound exploration and exploitation of the inherent strengths of the Bayesian paradigm, thereby facilitating significant advancements in the field. The scientific community eagerly awaits the implications of this innovative approach for future research.

Additionally, research indicates that relatively decoupled perceptual visual features can contribute to improved reasoning performance[26]. These perceptual visual features not only capture fundamental elements and attributes within images but also effectively express relationships and structures among them. By introducing such perceptual visual features into image reasoning problems, significant enhancements can be achieved in both the model’s reasoning performance and efficiency[26].

Symbolic approaches have brought about higher reasoning precision and enhanced model interpretability.[27, 28, 29] These methods bolster the reasoning capabilities and interpretability of models by incorporating symbolic representations and operations. Specifically, symbolic approaches endow models with increased flexibility and efficiency when addressing intricate image reasoning tasks while also enhancing model transparency and interpretability, facilitating a deeper understanding and analysis of the model’s decision-making processes.

II-B Bongard-logo solver

In recent years, researchers have been exploring various potential solutions to address the highly challenging Bongard problems, leading to the emergence of three dominant strategies: language-based feature model approaches, methods relying on convolutional neural network models, and techniques involving generated datasets.

Firstly, language-based feature model methods[13], exemplified by the work of Depweg and others, aim to decipher visual characteristics within image information through a formalized linguistic system. They have devised a formal language capable of symbolizing visual elements within images, utilizing logical operators to extract these visual features and transform them into a symbolic visual vocabulary. Subsequently, they employ symbolic language and Bayesian reasoning to tackle BP problems. However, this approach is severely constrained by its symbolic representation, making it difficult to handle BP issues involving intricate abstract concepts. Specifically, the method can only manage basic shape-based BP problems and is unable to represent or process more sophisticated abstract concept types. Additionally, whenever confronted with a novel BP problem, the need to reconstruct an appropriate symbolic system adds complexity and limitation to the method. After filtering out BP problems that cannot be expressed using this visual language, only 39 of the original 100 BP problems remain, with 35 of them being resolvable.

Secondly, convolutional neural network model-based methods[32], as exemplified by Kharagorgiev and Yun, favor the use of deep learning techniques for automated feature extraction from images. Kharagorgiev constructed an image dataset containing simple shapes and utilized a pre-training process to develop a feature extractor. This feature extractor is then employed to extract image features from Bongard problems, facilitating image classification to determine if test images conform to specified rules. Yun adopted a similar approach but placed greater emphasis on utilizing images containing visual characteristics from BP problems for pre-training to extract BP image features, subsequently linking additional classifiers for discrimination. While these methods can automatically extract and learn features from images, their performance is heavily reliant on the quality and quantity of training data.

Thirdly, among the strategies employed is the generation of datasets[14]. In 2020, Nie et al. applied basic CNNs, relational networks like WReN-Bongard, and Meta-learning techniques in the Bongrad-Logo database. They endeavored to enhance model generalization by generating substantial volumes of synthetic data. However, their experimental results indicate that the models did not achieve the desired level of performance, potentially due to significant disparities between the generated data and the distribution of real-world problems.

Notably, the PMoC model[15] has emerged as a notable approach, particularly in addressing the challenges posed by the Bongard-Logo problem. This tailored probability model achieves high reasoning accuracy by constructing independent probability models, demonstrating its effectiveness in discerning deeper patterns and inductive reasoning beyond explicit image features. The strength of PMoC lies in its ability to capture the underlying probabilistic relationships within the problem space, enabling more accurate reasoning and pattern recognition. By leveraging the power of probability modeling, PMoC paves the way for more robust and accurate solutions in abstract reasoning tasks.

In conclusion, it is evident that each approach offers distinct advantages and limitations. Language-based feature model methods provide a fresh perspective for comprehending and deciphering BP problems but have limited capabilities in handling complex abstract concepts. Methods based on convolutional neural network models can automatically learn and extract features from images but are constrained by the quality and quantity of training data. While techniques involving generated datasets hold potential for enhancing model generalization, their effectiveness is contingent on the alignment between generated data and real-world problem scenarios. This underscores the need for a more comprehensive and integrated strategy in addressing Bongard problems.

II-C Transformer and Vision Transformer

The Transformer model[4] diverges from conventional RNN and CNN designs, utilizing a fully attentional mechanism for capturing long-range input sequence dependencies. Its core comprises self-attention and feed-forward neural networks, integrated via residual connections and layer normalization to form its encoders and decoders. The self-attention mechanism, analogous to social network influence diffusion, assigns weights based on input sequence position similarities, fostering flexible non-sequential processing. Additionally, the Transformer incorporates encoder-decoder attention, akin to translation dictionary consultation, where the decoder references the encoder’s output to enhance output sequence accuracy.

The Vision Transformer (ViT)[33] is an innovative approach to computer vision tasks that eschews traditional convolutional neural networks in favor of a pure transformer-based architecture. By dividing images into fixed-size patches and treating them as sequences of tokens, ViT leverages the power of self-attention mechanisms to capture long-range dependencies within the image effectively. This shift towards transformers enables ViT to achieve state-of-the-art performance on various vision benchmarks, heralding a new era in computer vision research.

II-D Covariance matrix and correlation loss

The covariance matrix stands as a pivotal tool in multivariate statistical analysis, quantifying the relationships between multiple random variables[34]. In the realms of data science and machine learning, the covariance matrix plays a crucial role, facilitating profound insights into data structures and patterns. This matrix not only encapsulates the variances of individual variables but also the covariances between them, offering a comprehensive view of the interdependencies within a dataset. Its applications span from exploratory data analysis and dimensionality reduction to portfolio optimization and principal component analysis, underscoring its widespread significance in diverse domains of modern data analysis.

Covariance matrix serves as a metric to gauge the linear correlation between any two distributions within a set[34]. By treating each dimension of an image representation as an individual distribution and a collection of such representations as a sample from a group of distributions, one can leverage a batch of samples to assess the linear correlation among every dimension of the image representation. This approach enables a nuanced understanding of the interdependencies between various features within the image data, fostering insights that can inform downstream tasks in image analysis and processing. We calculate the covariance matrix of a multivariate distribution using Formula (1), and then compute the correlation loss of the multivariate distribution using Formula (2).

Mσ(x)=subscript𝑀𝜎𝑥absent\displaystyle M_{\sigma}(x)=italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_x ) =1N1i=1N(𝐱i𝐱¯)(𝐱i𝐱¯)1𝑁1superscriptsubscript𝑖1𝑁subscript𝐱𝑖¯𝐱superscriptsubscript𝐱𝑖¯𝐱top\displaystyle\frac{1}{N-1}\sum_{i=1}^{N}(\mathbf{x}_{i}-\bar{\mathbf{x}})(%\mathbf{x}_{i}-\bar{\mathbf{x}})^{\top}divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_x end_ARG ) ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG bold_x end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT(1)
L(x)=𝐿𝑥absent\displaystyle L(x)=italic_L ( italic_x ) =1d(Mσ(x)2(1I))1𝑑subscript𝑀𝜎superscript𝑥21𝐼\displaystyle\frac{1}{d}\sum\left(M_{\sigma}(x)^{2}\cdot(1-I)\right)divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ ( italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( 1 - italic_I ) )(2)

Where I𝐼Iitalic_I denotes the identity matrix, and Mσ(x)Rd×dsubscript𝑀𝜎𝑥superscript𝑅𝑑𝑑M_{\sigma}(x)\in R^{d\times d}italic_M start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ( italic_x ) ∈ italic_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT. d𝑑ditalic_d represents the dimensionality of the vector xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and n𝑛nitalic_n refers to the number of samples involved in the computation, given that the covariance matrix is calculated based on a batch of samples.

II-E The Expectation-Maximization (EM) algorithm

The Expectation-Maximization (EM) algorithm[35] represents a powerful iterative method widely employed in statistics for finding maximum likelihood estimates of parameters in probabilistic models, especially when the data contain missing values or are observed in an incomplete manner. By alternating between an expectation “E” step and a maximization “M” step, the algorithm optimizes the likelihood function, gradually refining parameter estimates until convergence. Its versatility and robustness have made the EM algorithm a cornerstone technique in diverse fields such as machine learning, bioinformatics, and image processing, where complex models and data structures often demand sophisticated estimation methodologies.

Specifically, we employ a function, denoted as P(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ), to approximate the joint distribution of the observed data and their corresponding latent variables. However, both Z and θ𝜃\thetaitalic_θ remain unknown entities. The process initiates with the assumption of an initial, arbitrarily assigned θ𝜃\thetaitalic_θ, which is then used to compute the posterior distribution of the latent variables, denoted as P(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ). Given this posterior distribution, we proceed to calculate the joint distribution of X and Z, namely P(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ). Subsequently, θ𝜃\thetaitalic_θ is recalculated in a manner that maximizes the joint distribution P(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ). This iterative process of alternating between the computation of θ𝜃\thetaitalic_θ andP(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ) continues until P(X,Z|θ)𝑃𝑋conditional𝑍𝜃P(X,Z|\,\theta)italic_P ( italic_X , italic_Z | italic_θ ) reaches its maximum value, ensuring an optimal estimation of the parameters and latent variables within the data.

III Methodology

In this section, four methods have been proposed for the Bongard-Logo[13] and RPM problems[12], namely CFN, Triple-CFN, Meta Triple-CFN, and the Re-space layer. Each of them incorporates new loss function terms or network structures compared to its predecessor, aiming for progressive improvement on the Bongard-Logo and RPM problems.

The Bongard-Logo problem and the Raven’s Progressive Matrices (RPM) problem are distinct yet equally challenging tests of abstract reasoning. Both tasks require participants to identify and interpret underlying principles or concepts that are not immediately apparent from the surface-level features of the presented materials. These principles represent a more sophisticated level of abstraction than mere pixel configuration patterns or other low-level visual properties. Instead, they often reflect human-centered preconceptions about shape, size, color, spatial relationships, the concave or convex nature of objects, and the completeness of figures. Through their respective problem formulations, the Bongard-Logo and RPM tasks seek to evaluate an individual’s ability to discern and comprehend these subtler, more abstract patterns and principles.

III-A A baseline for Bongard-Logo

Based on higher-dimensional human concepts and preferences, the creators of Bongard-logo problems have categorized the Bongard-logo dataset into four distinct problem types: FF, BA, NV, CM. Consequently, we can abstract the distribution of the primary group (positive instances) within a Bongard-logo problem as pi(x|y)subscript𝑝𝑖conditional𝑥𝑦p_{i}(x|\,y)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | italic_y ) and the distribution of the auxiliary group (negative instances) as qi(x|y)subscript𝑞𝑖conditional𝑥𝑦q_{i}(x|\,y)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | italic_y ). Here, y denotes the problem’s reasoning type, where y𝑦absenty\initalic_y ∈{FF, BA, NV, CM}, while i𝑖iitalic_i represents the problem’s identifier, with i[1,n]𝑖1𝑛i\in[1,n]italic_i ∈ [ 1 , italic_n ] and n𝑛nitalic_n signifying the total number of problems.For the purpose of convenient representation of data in the Bongard-Logo problem, we denote the Bongard-Logo images as xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Specifically, {xij|j[1,6]}conditional-setsubscript𝑥𝑖𝑗𝑗16\{x_{ij}|\,j\in[1,6]\}{ italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_j ∈ [ 1 , 6 ] } represents images in the i𝑖iitalic_i-th primary group, while {xij|j[8,13]}conditional-setsubscript𝑥𝑖𝑗𝑗813\{x_{ij}|\,j\in[8,13]\}{ italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_j ∈ [ 8 , 13 ] } represents images in the i𝑖iitalic_i-th auxiliary group. Additionally, xi7subscript𝑥𝑖7x_{i7}italic_x start_POSTSUBSCRIPT italic_i 7 end_POSTSUBSCRIPT represents the test image to be potentially assigned to the i𝑖iitalic_i-th primary group, and xi14subscript𝑥𝑖14x_{i14}italic_x start_POSTSUBSCRIPT italic_i 14 end_POSTSUBSCRIPT represents the test image to be potentially assigned to the i𝑖iitalic_i-th auxiliary group.

To effectively tackle Bongard-Logo problems, we are developing a deep learning algorithm, fθ(z|x)subscript𝑓𝜃conditional𝑧𝑥f_{\theta}(z|\,x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ), primarily tasked with transforming input samples xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT into latent variables zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Ideally, the distributional divergence between the latent variable distribution of the primary group, pi(z|y)superscriptsubscript𝑝𝑖conditional𝑧𝑦p_{i}^{\prime}(z|\,y)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ), and that of the auxiliary group, qi(z|y)superscriptsubscript𝑞𝑖conditional𝑧𝑦q_{i}^{\prime}(z|\,y)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ), should be minimal. However, given the nature of Bongard-logo as a small-sample learning problem, accurately estimating and constraining these two latent variable distributions poses significant challenges. Consequently, directly optimizing for minimal distributional divergence between them may encounter substantial difficulties, thereby making it arduous to train a deep model that exhibits exceptional performance.

In this manuscript, we leverage the InfoNCE loss function[36] as a reasoning loss term for the purpose of training a standard ResNet18 network.The resulting model, denoted as fθ(z|x)subscript𝑓𝜃conditional𝑧𝑥f_{\theta}(z|\,x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ), possesses the proficiency to tackle either individual or concurrent high-dimensional concept intricacies inherent within the Bongard-logo dataset. Mathematically, the InfoNCE loss function can be formalized as follows:

𝐈𝐧𝐟𝐨𝐍𝐂𝐄(zpos,z~pos,{znegm}m=1M)subscript𝐈𝐧𝐟𝐨𝐍𝐂𝐄subscript𝑧𝑝𝑜𝑠subscript~𝑧𝑝𝑜𝑠superscriptsubscriptsubscript𝑧𝑛𝑒subscript𝑔𝑚𝑚1𝑀\displaystyle{\ell_{\mathbf{InfoNCE}}}({z_{pos}},{\tilde{z}_{pos}},\{{z_{ne{g_%{m}}}}\}_{m=1}^{M})roman_ℓ start_POSTSUBSCRIPT bold_InfoNCE end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , { italic_z start_POSTSUBSCRIPT italic_n italic_e italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT )
=loge(zposz~pos)/te(zposz~pos)/t+m=1Me(zposznegm)/tabsentsuperscript𝑒subscript𝑧𝑝𝑜𝑠subscript~𝑧𝑝𝑜𝑠𝑡superscript𝑒subscript𝑧𝑝𝑜𝑠subscript~𝑧𝑝𝑜𝑠𝑡superscriptsubscript𝑚1𝑀superscript𝑒subscript𝑧𝑝𝑜𝑠subscript𝑧𝑛𝑒subscript𝑔𝑚𝑡\displaystyle=-\log\frac{{{e^{({z_{pos}}\cdot{\tilde{z}_{pos}})/t}}}}{{{e^{({z%_{pos}}\cdot{\tilde{z}_{pos}})/t}}+\sum\nolimits_{m=1}^{M}{{e^{({z_{pos}}\cdot%{z_{ne{g_{m}}}})/t}}}}}= - roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ⋅ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ⋅ over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_n italic_e italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG(3)

In the context of this study, zpossubscript𝑧𝑝𝑜𝑠z_{pos}italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and z~possubscript~𝑧𝑝𝑜𝑠\tilde{z}_{pos}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT are used to denote distinct samples’ representation that belong to the primary group, whereas znegmsubscript𝑧𝑛𝑒subscript𝑔𝑚z_{ne{g_{m}}}italic_z start_POSTSUBSCRIPT italic_n italic_e italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT encompasses all m𝑚mitalic_m samples’ representation within the auxiliary group. Here, m𝑚mitalic_m signifies the quantity of auxiliary group samples within a single Bongard-logo case, and the variable t𝑡titalic_t represents the temperature coefficient, is set to 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. InfoNCE can impose the following constraint on the network: the cosine similarity between vector zpossubscript𝑧𝑝𝑜𝑠z_{pos}italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and vector z~possubscript~𝑧𝑝𝑜𝑠\tilde{z}_{pos}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT should be higher than the cosine similarity between vector zpossubscript𝑧𝑝𝑜𝑠z_{pos}italic_z start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and vector set {znegm}subscript𝑧𝑛𝑒subscript𝑔𝑚\{z_{ne{g_{m}}}\}{ italic_z start_POSTSUBSCRIPT italic_n italic_e italic_g start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. This constrains aligns with the underlying logic of Bongard-logo.The feedforward process of the network is illustrated in the figure 4. By utilizing the InfoNCE function, we can avoid estimating the distributions pi(z|y)superscriptsubscript𝑝𝑖conditional𝑧𝑦p_{i}^{\prime}(z|\,y)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ) and qi(z|y)superscriptsubscript𝑞𝑖conditional𝑧𝑦q_{i}^{\prime}(z|\,y)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ) while simultaneously encouraging the representations of the primary group encoded by the network fθ(z|x)subscript𝑓𝜃conditional𝑧𝑥f_{\theta}(z|\,x)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_x ) to be more similar and ensuring greater mutual exclusion between the representations of the primary and auxiliary groups. In figure 4, the “A72superscriptsubscript𝐴72A_{7}^{2}italic_A start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT” represents the number of distinct ways to select 2 elements from a set with 7 elements.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (4)

Utilizing the aforementioned methodologies, this study conducted rigorous experiments on the four concept databases of Bongard-Logo, both individually and in combination. The comprehensive experimental results are presented in detail in Table I. In this table, the “Bongard-Logo” entry encompasses the consolidated findings obtained by training the model using a combined dataset of all four concepts. Conversely, the “Separated Bongard-Logo” entry delineates the specific outcomes derived from training each concept independently.

The results unequivocally demonstrate that the convolutional deep model, ResNet18, has struggled to distinguish between the four distinct concepts. Convolutional networks are designed to encode image representations primarily based on the configurations of image pixels. This study hypothesizes that the Bongard-Logo cases exhibit distinct classification patterns based on varying concepts. Furthermore, it is speculated that since no additional concept labels were introduced as supervisory signals during training, the network solely relies on the configuration of image pixels to process Bongard-Logo cases, which may lead to internal confusion within the network and ambiguity concerning the concept. This ambiguity underscores the need for further investigation into the design of more robust and conceptually discriminative deep learning models for tackling such complex tasks.

Test Accuracy(%)
Data SetFFBACMNV
Bongard-logo88.197.976.075.8
Separated Bongard-logo97.999.075.072.8

III-B Cross-Feature Net (CFN)

Our aspiration is for the network to deduce concepts solely from the pixel configuration patterns inherent in images. Concepts constitute the fundamental basis for the classification of Bongard-Logo problems, making their accurate deduction crucial. Modeling abstract, high-dimensional human concepts often yields intricate and interconnected distributions, posing significant challenges.Given the difficulty in estimating and leveraging conditional distributions such as pi(z|y)superscriptsubscript𝑝𝑖conditional𝑧𝑦p_{i}^{\prime}(z|\,y)italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ), qi(z|y)superscriptsubscript𝑞𝑖conditional𝑧𝑦q_{i}^{\prime}(z|\,y)italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z | italic_y ), along with the complexity of modeling high-dimensional human concepts, this paper attempts to implicitly reconstruct the concept for the Bongard-Logo dataset. We reformulate the problem as pi(x|y)subscript𝑝𝑖conditional𝑥superscript𝑦p_{i}(x|\,y^{\prime})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), qi(x|y)subscript𝑞𝑖conditional𝑥superscript𝑦q_{i}(x|\,y^{\prime})italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x | italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT belongs to the set {Yα}subscript𝑌𝛼\{Y_{\alpha}\}{ italic_Y start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT }, the range of α𝛼\alphaitalic_α remains unknown, and no additional settings were imposed in this study. In this context, each Yαsubscript𝑌𝛼Y_{\alpha}italic_Y start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT represents a reconstructed concept, potentially distinct from the original concepts.It is our hope that by reconstructing these concepts, or more precisely, by redefining them, the network will be able to more easily recognize the appropriate concept for categorizing Bongard-Logo instances. This recognition will be based primarily on image styles and pixel configuration patterns. Through this reorganization process, we aim to alleviate the confusion that often arises between high-dimensional Bongard-Logo concepts within deep learning models.

As for the implementation on the network, we introduce a sophisticated deep learning algorithm, designated as g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ), aimed at refining the representation of samples within the primary group of a Bongard-Logo problem while distinguishing them from those in the auxiliary group. The algorithm g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ), called Cross-Feature Net, is tailored to extract concepts, exclusively based on image styles, thereby laying the foundation for Bongard-Logo solution and categorization.

To elaborate, each image in a Bongard-Logo problem is encoded with a concept vector q𝑞qitalic_q (qRd)𝑞superscript𝑅𝑑(q\in R^{d})( italic_q ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) and an associated feature vector set {kβ|β[1,m],kβRd}conditional-setsubscript𝑘𝛽formulae-sequence𝛽1𝑚subscript𝑘𝛽superscript𝑅𝑑\{k_{\beta}|\,\beta\in[1,m],k_{\beta}\in R^{d}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT | italic_β ∈ [ 1 , italic_m ] , italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∈ italic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. Subsequently, cross-attention mechanisms are employed to analyze the interactions between the conceptual vector q𝑞qitalic_q and the style vector set {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }, with vector q𝑞qitalic_q serving as the query and {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } as the corresponding key-value pairs.We claim that reinterpreting the underlying concept conveyed by the image through a deep learning model and adjusting the weighting of the image’s feature vectors based on this reinterpretation can effectively tackle such challenging problems. This innovative approach obviates the need for an extensive and potentially cumbersome search for optimal concept partitioning strategies.

In detail, we employ ResNet18 as gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) to learn the concept vector q𝑞qitalic_q for Bongard-Logo images, and ResNet50 as gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) to learn the feature vector set {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } for the images. The cross-attention mechanism is computed between q and {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }, yielding the final feature representation zijsubscript𝑧𝑖𝑗z_{ij}italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of the image xijsubscript𝑥𝑖𝑗x_{ij}italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, which constitutes the core computational process of the Cross-Feature Net.We utilize the new backbone g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ) to replace the ResNet18 used in the aforementioned baseline, with the expectation that the new backbone g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ) will be more suitable for addressing the Bonard-logo problem compared to ResNet18. we persist in utilizing the InfoNCE loss function for training the network g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ).Additionally, drawing inspiration from the EM clustering algorithm, we alternately train the parameters of gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) within the Cross-Feature Net. Since the cross-attention mechanism can be interpreted as a weighted sum process of the feature vector set {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } based on its similarity to the concept vector q𝑞qitalic_q, optimizing gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) is treated as a process of recalculating distribution centroids, while optimizing gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) is considered as maximizing the expected distribution.

In summary, our goal is to utilize a network, denoted as gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ), for extracting the concepts essential to solving the problem. This extraction process exclusively relies on the pixel configuration patterns within the problem images. Furthermore, we introduce another network, gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ), designed to craft image representations that align closely with these extracted concepts.To obtain the final representation of the Bongard-Logo images, we compute the attention result between the concepts and these representations. This attention mechanism helps us to focus on the most relevant features. Subsequently, we constrain the resulting representation using the InfoNCE loss function and alternately optimize both gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ).By adopting a unified framework, g(k|x)𝑔conditional𝑘𝑥g(k|\,x)italic_g ( italic_k | italic_x ), which we refer to as CFN (Cross-Feature Network), we anticipate a substantial enhancement in reasoning precision. The detailed feedforward processes associated with CFN are meticulously illustrated in Figure 5.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (5)

III-C Cross covariance-constrained Feature Net (Triple-CFN)

Afterwards, although imitating the EM algorithm to alternately update the parameters of gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) can enhance the model’s performance, the alternating training process for gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) is tedious and unstable. To address this, our paper makes further contributions. In essence, we designed two networks, gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ), with the aim of implicitly reinterpreting and effectively solving the Bongard-Logo problem. By maximizing the image representation information carried by {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } in the output of gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ), we can improve both processes mentioned above. To a certain extent, this paper posits that maximizing the information carried by the feature vector set {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } can be a viable alternative to the traditional process of maximizing expectation. By adopting this approach, the inherent two-step process arising from the simulation of the EM𝐸𝑀EMitalic_E italic_M algorithm[35] within the CFN can be effectively integrated.

Therefore, we introduced the correlation loss based on the covariance matrix as an additional term in the loss function for CFN to decrease the correlation between each dimension in {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }. This resulted in the creation of the Cross covariance-constrained Feature Net (Triple-CFN). The coefficient for the newly introduced loss term was set to be 25 times that of the reasoning loss term. Moreover, the practice of alternately updating the parameters of gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) and gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ), while extending training epochs, yields only minimal improvements for the Triple-CFN. Therefore, on the RPM data, we no longer use the strategy of alternating updates between the two networks. Calculating attention results for the output of CNNs can result in attention collapse issues. We posit that reducing the correlation between dimensions in the output representation of CNNs offers a method to alleviate attention collapse. Figure 6 presents a detailed depiction of the feed-forward process within Triple-CFN. The complete loss function on a batch of Bongard-Logo problems for Triple-CFN is as follows:

Triple-CFN=subscriptTriple-CFNabsent\displaystyle{\ell_{\text{Triple-CFN}}}=roman_ℓ start_POSTSUBSCRIPT Triple-CFN end_POSTSUBSCRIPT =1bi=1bj=1,j~j7𝐈𝐧𝐟𝐨𝐍𝐂𝐄(zij,zij~,{zij}j=814)1𝑏superscriptsubscript𝑖1𝑏superscriptsubscriptformulae-sequence𝑗1~𝑗𝑗7subscript𝐈𝐧𝐟𝐨𝐍𝐂𝐄subscript𝑧𝑖𝑗subscript𝑧𝑖~𝑗superscriptsubscriptsubscript𝑧𝑖𝑗𝑗814\displaystyle\frac{1}{b}\sum_{i=1}^{b}\sum_{j=1,\tilde{j}\neq j}^{7}\ell_{%\mathbf{InfoNCE}}\left(z_{ij},z_{i\tilde{j}},\{z_{ij}\}_{j=8}^{14}\right)divide start_ARG 1 end_ARG start_ARG italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 , over~ start_ARG italic_j end_ARG ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT bold_InfoNCE end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i over~ start_ARG italic_j end_ARG end_POSTSUBSCRIPT , { italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 14 end_POSTSUPERSCRIPT )(4)
+25cov({zij|i[1,b],j[1,14]})25subscript𝑐𝑜𝑣conditional-setsubscript𝑧𝑖𝑗formulae-sequence𝑖1𝑏𝑗114\displaystyle+25\cdot\ell_{cov}(\{z_{ij}|\,i\in[1,b],j\in[1,14]\})+ 25 ⋅ roman_ℓ start_POSTSUBSCRIPT italic_c italic_o italic_v end_POSTSUBSCRIPT ( { italic_z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_i ∈ [ 1 , italic_b ] , italic_j ∈ [ 1 , 14 ] } )

Where b𝑏bitalic_b represents the batchsize for training. The reasoning loss term 𝐈𝐧𝐟𝐨𝐍𝐂𝐄()subscript𝐈𝐧𝐟𝐨𝐍𝐂𝐄\ell_{\mathbf{InfoNCE}}(\cdot)roman_ℓ start_POSTSUBSCRIPT bold_InfoNCE end_POSTSUBSCRIPT ( ⋅ ) has expressed in Formula (3).The correlation loss term cov()subscript𝑐𝑜𝑣\ell_{cov}(\cdot)roman_ℓ start_POSTSUBSCRIPT italic_c italic_o italic_v end_POSTSUBSCRIPT ( ⋅ ) represents a specific computation involving vectors enclosed within the parentheses. In this context, the set of vectors inside cov()subscript𝑐𝑜𝑣\ell_{cov}(\cdot)roman_ℓ start_POSTSUBSCRIPT italic_c italic_o italic_v end_POSTSUBSCRIPT ( ⋅ ) is treated as samples drawn from a multivariate distribution. Based on these samples, the covariance matrix of the multivariate distribution is calculated using Formula (1). The Correlation Loss for Triple-CFN is calculated using Formula (2).

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (6)

III-D Triple-CFN on RPM problem

Particularly when tackling RPM problems, Triple-CFN is able to demonstrate its distinctive utility and value. Specifically, when addressing RPM problems, the core backbone of Triple-CFN remains unchanged, with only slight adjustments made to the design of g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ). These adjustments are necessary operations to enable Triple-CFN to adapt to RPM reasoning rules and inductive biases.

Following the current consensus among RPM solvers, which emphasizes the need for multi-scale or multi-viewpoint feature extraction from RAVEN images[25, 30, 24], Triple-CFN utilizes Vision Transformer (ViT) for feature extraction while preserving all output vectors as multi-viewpoint features. In this paper, the number of viewpoints is denoted as L𝐿Litalic_L. The process of the extraction can be expressed in figure 7.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (7)

Subsequently, Triple-CFN processes each viewpoint equally.

From a viewpoints that considers all images in RPM problems, this paper employs a Multi-layer Perceptron (MLP) with a bottleneck structure to extract information from all minimal reasoning units (three images within a row for RAVEN and three images within a row or column for PGM[37]), and preserves these extracted unit informations as vectors {kβ|β[1,M]}conditional-setsubscript𝑘𝛽𝛽1𝑀\{k_{\beta}|\,\beta\in[1,M]\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT | italic_β ∈ [ 1 , italic_M ] }.β[1,M]𝛽1𝑀\beta\in[1,M]italic_β ∈ [ 1 , italic_M ] denotes the index of the minimal reasoning unit, and M𝑀Mitalic_M stands for the total number of minimal reasoning units.Using MLP to process sequential images within minimal reasoning units aims to retain their order in a straightforward manner. Combining the unit vectors from the problem stem with S𝑆Sitalic_S optimizable vectors, we input them into the network gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ), which utilizes a Transformer-Encoder as its backbone, to obtain {qα|α[1,S]}conditional-setsubscript𝑞𝛼𝛼1𝑆\{q_{\alpha}|\,\alpha\in[1,S]\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] } by extracting all the optimizable vectors from the network output. By encoding multiple concept vectors {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } through the network gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ), the intention is to enable the network g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ) to solve RPM problems through a multi-evaluation reasoning approach.The calculation process of g(q,k|x)𝑔𝑞conditional𝑘𝑥g(q,k|\,x)italic_g ( italic_q , italic_k | italic_x ) can be expressed in figure 8.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (8)

After calculating the cross-attention results between {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } and {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }, a new MLP is used to score these outputs, resulting in S𝑆Sitalic_S scores under one viewpoint. Considering all viewpoints, the current Triple-CFN network encodes L𝐿Litalic_L sets of {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } and {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } and calculate L×S𝐿𝑆L\times Sitalic_L × italic_S scores.Subsequently, the sets of L×S𝐿𝑆L\times Sitalic_L × italic_S scores are averaged to determine the final score.The calculation process of final score can be expressed in figure 9.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (9)

To enforce constraints on this final score, the Cross-Entropy loss function is employed as a reasoning loss term cross-entropysubscriptcross-entropy\ell_{\text{cross-entropy}}roman_ℓ start_POSTSUBSCRIPT cross-entropy end_POSTSUBSCRIPT. We still apply the correlation loss term on the vector set {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }.Notably, the coefficient for the reasoning loss term is set at 100 times the magnitude of the correlation loss term. This coefficient setting differentiates our approach from the one employed in the Triple-CFN for Bongard-Logo tasks.Figure 10 provides a detailed illustration of the improvements made to Triple-CFN in order to adapt it to RPM problems. The locations of reasoning loss term cross-entropysubscriptcross-entropy\ell_{\text{cross-entropy}}roman_ℓ start_POSTSUBSCRIPT cross-entropy end_POSTSUBSCRIPT and correlationloss term covsubscript𝑐𝑜𝑣\ell_{cov}roman_ℓ start_POSTSUBSCRIPT italic_c italic_o italic_v end_POSTSUBSCRIPT are shown in this figure 10.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (10)

III-E Meta Cross covariance-constrained Feature Net (Meta Triple-CFN)

After the experiment, we discovered that the Triple-CFN’s learning approach of reinterpreting abstract reasoning problem spaces exhibited slight advantages in addressing the Bongard-Logo problem. Triple-CFN was designed to tackle the hypothesis of conflicts between low-dimensional image styles and high-dimensional human concepts. When high-dimensional concepts are designed reasonably and effectively, such as the image progressive patterns in RAVEN and PGM problems, using these high-dimensional concepts as supervisory signals to constrain the training of Triple-CFN can lead to more significant contributions from the network.

The RAVEN and PGM problems are accompanied by the image progressive pattern description (Meta data). Previous works have attempted to enhance their RPM solvers by incorporating additional tasks of learning image progression patterns, aiming to improve performance on reasoning tasks and enhance interpretability. However, efforts led by MRNet[19] suggest that such approaches may be counterproductive.This paper believes that the advantage brought by the structure of Triple-CFN can balance both reasoning tasks and progressive pattern matching tasks, and even allow the two tasks to refine each other and grow together. This is something that previous RPM discriminator, including RS-Tran[30] and MRNet, could not achieve. This paper posits that strengthening the identity of {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } as a conceptual vector, by utilizing these pattern cue signals, can effectively and reasonably enhance the interpretability and reasoning accuracy of Triple-CFN.

In this study, an additional standard Transformer-Encoder is employed to process the Meta data in RAVEN and PGM problems, depicting a concept spaces for Triple-CFN. The newly introduced Meta loss term is utilized to constrain the {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } vectors, encoded by Triple-CFN, to corresponding positions in the concept space established by the Transformer-Encoder.Recall that {qα|α[1,S]}conditional-setsubscript𝑞𝛼𝛼1𝑆\{q_{\alpha}|\,\alpha\in[1,S]\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] } represents a set of S𝑆Sitalic_S concept vectors from a single viewpoint in Triple-CFN encoding. Given the necessity to impose constraints on {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } across all viewpoints, this paper opts to compute the average of {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } obtained from each viewpoint, denoted as {q¯α|α[1,S]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] }, and impose constraints on this averaged representation. Regarding the implementation of concept spaces in RAVEN, we tokenize the progressive pattern descriptions using the format ‘type: XXXX, size: XXXX, color: XXXX, number/position: XXXX’. Subsequently, we combine these tokenized descriptions with an optimizable vector and process them using standard Transformer Encoders. The combined optimizable vector, extracted from the attention results generated by the Transformer Encoder, serves as a feature of the pattern descriptions. Through this process, we generate feature vectors for all kinds of pattern descriptions and establish concept spaces of progressive pattern. These feature vectors are referred to as {Tk|k[1,K]}conditional-setsubscript𝑇𝑘𝑘1𝐾\{T_{k}|\,k\in[1,K]\}{ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ [ 1 , italic_K ] }.

Finally, a new Meta loss term, based on the InfoNCE loss, is introduced. This term is applied to optimize the cosine similarity between the {q¯α}subscript¯𝑞𝛼\{\overline{q}_{\alpha}\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } vectors in Triple-CFN and the feature vectors {Tk|k[1,K]}conditional-setsubscript𝑇𝑘𝑘1𝐾\{T_{k}|\,\,k\in[1,K]\}{ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ [ 1 , italic_K ] } in the concept spaces formed by Meta data. Specifically, it aims to ensure that each q¯αsubscript¯𝑞𝛼\overline{q}_{\alpha}over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT exhibits greater similarity to its corresponding pattern feature vector Tk~subscript𝑇~𝑘T_{\tilde{k}}italic_T start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT than to any other vectors {Tk|k[1,K],kk~}conditional-setsubscript𝑇𝑘formulae-sequence𝑘1𝐾𝑘~𝑘\{T_{k}|\,k\in[1,K],k\neq\tilde{k}\}{ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ [ 1 , italic_K ] , italic_k ≠ over~ start_ARG italic_k end_ARG }.The number of vectors in {q¯α|α[1,S]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] }, S𝑆Sitalic_S, is determined by how many optimizable vectors are combined to the logical input by the gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) network, which can be set arbitrarily. In the context of Triple-CFN, when the model is constrained by Meta data from RPM problems, we typically determine the value of S𝑆Sitalic_S by adding one to the count of decoupled progressive patterns identified within the Meta data. For instance, in the case of solving PGM problems, S𝑆Sitalic_S can be set to 3 due to the presence of two decoupled conceptual attributes: “shape” and “line”.

The rationale behind introducing S𝑆Sitalic_S, which surpasses the count of decoupled progressive patterns by one, is rooted in the necessity to guarantee an unfettered vector within the set {q¯α|α[1,S]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] }. This particular vector remains unhindered by the constraints imposed by Meta data, thereby facilitating its autonomous engagement with resolving the RPM problem. Such a particular vector serves as a countermeasure against any potential unforeseen and irrational design elements embedded within the Meta data, thereby enhancing the overall robustness and adaptability of the Meta Triple-CFN.With the inclusion of progressive pattern labels and the introduction of new loss terms, Triple-CFN undergoes a transformation into Meta Triple-CFN. The Meta loss term can be expressed as follows:

Meta({q¯α|α[1,S1]},{Tk|k[1,K]})subscriptMetaconditional-setsubscript¯𝑞𝛼𝛼1𝑆1conditional-setsubscript𝑇𝑘𝑘1𝐾\displaystyle{\ell_{\text{Meta}}}(\{\overline{q}_{\alpha}|\,\alpha\in[1,S-1]\}%,\{T_{k}|\,k\in[1,K]\})roman_ℓ start_POSTSUBSCRIPT Meta end_POSTSUBSCRIPT ( { over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S - 1 ] } , { italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_k ∈ [ 1 , italic_K ] } )
=α=1,k~|αS1loge(q¯αTk~)/te(q¯αTk~)/t+k=1,kk~Ke(q¯αTk)/tabsentsuperscriptsubscript𝛼1conditional~𝑘𝛼𝑆1superscript𝑒subscript¯𝑞𝛼subscript𝑇~𝑘𝑡superscript𝑒subscript¯𝑞𝛼subscript𝑇~𝑘𝑡superscriptsubscriptformulae-sequence𝑘1𝑘~𝑘𝐾superscript𝑒subscript¯𝑞𝛼subscript𝑇𝑘𝑡\displaystyle=-\sum_{\alpha=1,\,\tilde{k}|\,\alpha}^{S-1}\log\frac{{{e^{({%\overline{q}_{\alpha}}\cdot{T_{\tilde{k}}})/t}}}}{{{e^{({\overline{q}_{\alpha}%}\cdot{T_{\tilde{k}}})/t}}+\sum_{k=1,\,k\neq\tilde{k}}^{K}{{e^{({\overline{q}_%{\alpha}}\cdot{T_{k}})/t}}}}}= - ∑ start_POSTSUBSCRIPT italic_α = 1 , over~ start_ARG italic_k end_ARG | italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S - 1 end_POSTSUPERSCRIPT roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT ( over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG(5)

The temperature coefficient t𝑡titalic_t in the Meta loss term is set to a value of 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT. It is worth noting that Tk~subscript𝑇~𝑘T_{\tilde{k}}italic_T start_POSTSUBSCRIPT over~ start_ARG italic_k end_ARG end_POSTSUBSCRIPT represents the progressive pattern vector that should be aligned with qαsubscript𝑞𝛼q_{\alpha}italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. In the formula (5), k~~𝑘\tilde{k}over~ start_ARG italic_k end_ARG is determined by α𝛼\alphaitalic_α, a fact that binds the respective progressive patterns to different vectors in {q¯α|α[1,S1]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆1\{\overline{q}_{\alpha}|\,\alpha\in[1,S-1]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S - 1 ] }. This ensures that each vector in {q¯α|α[1,S1]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆1\{\overline{q}_{\alpha}|\,\alpha\in[1,S-1]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S - 1 ] } can align with all the decoupled progressive patterns.From another viewpoint, {q¯α|α[1,S]}conditional-setsubscript¯𝑞𝛼𝛼1𝑆\{\overline{q}_{\alpha}|\,\alpha\in[1,S]\}{ over¯ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT | italic_α ∈ [ 1 , italic_S ] } can be regarded as comprising S𝑆Sitalic_S slots. The Meta loss term is designed to embed S1𝑆1S-1italic_S - 1 decoupled concepts from the Meta data into these slots, while reserving one empty slot to stabilize the Triple-CFN. This reserved slot serves as a safeguard against certain subtle and unreasonable configurations within the Meta data that might unforeseen.The calculation process of MetasubscriptMeta\ell_{\text{Meta}}roman_ℓ start_POSTSUBSCRIPT Meta end_POSTSUBSCRIPT can be expressed in figure 11.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (11)

In this advanced framework, the coefficients for both the novel Meta loss term and the preexisting Cross-Entropy loss term, which jointly enforce constraints on model reasoning, are assigned equally and set at a value 100 times greater than the coefficient of the correlation loss term. The figure 12 illustrate the structure of the Meta Triple-CFN in detial. It is worth noting that Meta Triple-CFN is tailored for RPM problems due to their unambiguous and well-defined auxiliary “rule” supervision signals. Conversely, Bongard-Logo problems exhibit overlapping patterns or concepts, which constitute the source of their difficulty, rendering Meta Triple-CFN unsuitable for addressing Bongard-Logo challenges.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (12)

Intuitively, providing Meta data as additional supervisory signals directly to a deep neural network to assist in learning abstract reasoning problems should naturally improve the network’s accuracy performance. However, this is not the case in practice. Previous research has mostly shown that introducing Meta data to a network can actually decrease its reasoning accuracy[24, 19, 25]. The ingenuity of Triple-CFN lies in its ability to overcome this curse in RPM. RS-Tran has demonstrated a tangible improvement in model performance through the indirect utilization of Meta data. Specifically, RS-Tran utilizes Meta data for the pre-training of its encoder, which enhances the performance of RS-Tran. However, RS-Tran has not yet achieved concurrent interpretability of rules from a human perspective alongside exceedingly high reasoning accuracy. In contrast, Triple-CFN emerges as an excellent model that is capable of simultaneously balancing both objectives. Furthermore, in the multiple reasoning steps of RS-Tran, the content of each step needs to be verified through post-hoc masking experiments, whereas the reasoning steps in Meta Triple-CFN inherently exhibit ex-ante interpretability on progressive patterns. Meta Triple-CFN, on the other hand, is the model that successfully balances both aspects.

III-F Re-space layer

Distinguishing the sources of reasoning difficulty between Bongard-Logo and RPM problems, the challenge in Bongard-Logo partly stems from conflicts among high-dimensional concepts at a fundamental level, whereas RPM problems demand multi-level reasoning.

In this paper, both Triple-CFN and Meta triple-CFN implicitly or explicitly constrain the progressive pattern vectors {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } for RPM problems. We posit that, at its core, the constraint imposed on the progressive pattern vector {qα}subscript𝑞𝛼\{q_{\alpha}\}{ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } in Meta Triple-CFN is spiritually resembling to the code book approach, albeit implemented through the lens of a linguistic model. Thus, we contemplate that the essence of Meta Triple-CFN lies in its ability to standardize the output of Triple-CFN under the supervision of auxiliary labels. This paper designs a noval normalization method applied to the {kβ|β[1,M]}conditional-setsubscript𝑘𝛽𝛽1𝑀\{k_{\beta}|\,\beta\in[1,M]\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT | italic_β ∈ [ 1 , italic_M ] } vector group in both Triple-CFN and Meta Triple-CFN.

Specifically, we establish M𝑀Mitalic_M optimizable vectors for Triple-CFN, which depict a vector space {vh|h[1,M]}conditional-setsubscript𝑣1𝑀\{v_{h}|\,h\in[1,M]\}{ italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_h ∈ [ 1 , italic_M ] }. Subsequently, cosine similarity is computed between the information from minimal reasoning units, {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT }, and each optimizable vector. And the calculate process can be expressed as follows:

kβhsubscriptsuperscript𝑘𝛽\displaystyle k^{\prime}_{\beta h}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β italic_h end_POSTSUBSCRIPT=vhkβvh×kβabsentsubscript𝑣subscript𝑘𝛽normsubscript𝑣normsubscript𝑘𝛽\displaystyle=\frac{v_{h}\cdot k_{\beta}}{|\,|\,v_{h}|\,|\,\times|\,|\,k_{%\beta}|\,|\,}= divide start_ARG italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG start_ARG | | italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | | × | | italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT | | end_ARG(6)
kβsubscriptsuperscript𝑘𝛽\displaystyle k^{\prime}_{\beta}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT={kβh|h[1,M]}absentconditional-setsubscriptsuperscript𝑘𝛽1𝑀\displaystyle=\{k^{\prime}_{\beta h}|\,h\in[1,M]\}= { italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β italic_h end_POSTSUBSCRIPT | italic_h ∈ [ 1 , italic_M ] }(7)

The resulting vector kβsubscriptsuperscript𝑘𝛽k^{\prime}_{\beta}italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, composed of M𝑀Mitalic_M cosine similarities {kβh|h[1,M]}conditional-setsubscriptsuperscript𝑘𝛽1𝑀\{k^{\prime}_{\beta h}|\,h\in[1,M]\}{ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β italic_h end_POSTSUBSCRIPT | italic_h ∈ [ 1 , italic_M ] }, represent the coordinates of the minimal reasoning unit vector kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT within the vector space {vh|h[1,M]}conditional-setsubscript𝑣1𝑀\{v_{h}|\,h\in[1,M]\}{ italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT | italic_h ∈ [ 1 , italic_M ] }, the original reasoning unit vectors {kβ}subscript𝑘𝛽\{k_{\beta}\}{ italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } are replaced by the computed coordinates {kβ}subscriptsuperscript𝑘𝛽\{k^{\prime}_{\beta}\}{ italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT } for subsequent reasoning tasks. And the process of Re-space Layer is illustrated in figure 13.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (13)

We posit that this design constitutes an excellent normalization technique. During model training, the similarity among the K𝐾Kitalic_K optimizable vectors is constrained to ensure a richly diverse vector space and avoid collapse from the re-space layer output. The constraint is implemented by utilizing the following function as an additional loss term for Triple-CFN or Meta Triple-CFN:

Re-space({vh}h=1M))=h=1Mloge(vhvh)/te(vhvh)/t+h~=1,h~hMe(vhvh~)/t\displaystyle{\ell_{\text{Re-space}}}(\{v_{h}\}^{M}_{h=1}))=\sum_{h=1}^{M}-%\log\frac{{{e^{({v_{h}}\cdot{v_{h}})/t}}}}{{{e^{({v_{h}}\cdot{v_{h}})/t}}+\sum%_{\tilde{h}=1,\,\tilde{h}\neq h}^{M}{{e^{({v_{h}}\cdot{v_{\tilde{h}}})/t}}}}}roman_ℓ start_POSTSUBSCRIPT Re-space end_POSTSUBSCRIPT ( { italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT ) ) = ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT - roman_log divide start_ARG italic_e start_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG = 1 , over~ start_ARG italic_h end_ARG ≠ italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG end_POSTSUBSCRIPT ) / italic_t end_POSTSUPERSCRIPT end_ARG(8)

Where the t𝑡titalic_t is set to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT. The parameter M𝑀Mitalic_M is set to be as large as the dimension of kβsubscript𝑘𝛽k_{\beta}italic_k start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, which is 128. When the Re-space layer is incorporated into Triple-CFN or Meta Triple-CFN, the coefficients for the aforementioned loss terms remain consistent with the coefficient for the correlation loss term in the model. In other words, the ratio between the Meta loss term, CE loss term, correlation loss term, and Re-space loss term is set to 100:100:1:1. The design of Meta Triple-CFN, which incorporates the Re-space layer, is depicted in the figure 14. Furthermore, the integration of Triple-CFN with the Re-space layer becomes self-evident.

Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (14)

It is worth emphasizing that the aforementioned calculation process does not equate to the process of vectors undergoing matrix mapping and subsequent tanh𝑡𝑎𝑛tanhitalic_t italic_a italic_n italic_h compression. This design enhances the applicability of Triple-CFN and Meta Triple-CFN to RPM problems.

IV Experiment

All our experiments are implemented in Python using the PyTorch[38] framework.

IV-A Experiment on Bongard-Logo

In this study, we conducted experiments on the Bongard-Logo dataset using the designed CFN and Triple-CFN models.To demonstrate the impact of alternating updates of gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) and gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ), which mimic the Expectation-Maximization algorithm, on model performance, we performed ablation experiments. The results of these experiments are presented in tables II. It is important that our experiments were conducted on a single server equipped with four A100s graphics processing units (GPUs). We trained the models using mini-batch gradient descent with a batch size of 120. During training, we utilized the Adam[39] optimizer with a learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and a weight decay of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT.It is worth mentioning again that Triple CFN incorporates two loss function terms: the reasoning loss term and the correlation loss term. When Triple-CFN is applied to the Bongard-Logo problem, the coefficient ratio between the reasoning loss term, which is composed of the infoNCE loss, and the correlation loss term based on the covariance matrix is set to 1:25. However, when addressing the RPM problem, the coefficient ratio between the reasoning loss term, formulated through cross-entropy loss, and the correlation loss term shifts to 100:1.Meta Triple-CFN, tailored specifically for the RPM problem, introduces a new Meta loss term rooted in InfoNCE, in addition to the components of Triple-CFN. Within Meta Triple-CFN, the coefficient ratio among the Meta loss term, reasoning loss term, and correlation loss term stands at 100:100:1.The Re-space layer emerges as an enhancement for both Triple-CFN and Meta Triple-CFN. Its integration into the network necessitates the addition of a new loss function term, which serves to ensure that the output of the Re-space layer does not succumb to mode collapse. Consequently, the coefficient ratio among the Meta loss term, reasoning loss term, correlation loss term, and the Re-space loss term is maintained at 100:100:1:1.

Accuracy(%)
ModelTrainFFBACMNV
SNAIL59.256.360.260.161.3
ProtoNet73.364.672.462.465.4
MetaOptNet75.960.371.665.967.5
ANIL69.756.659.059.661.0
Meta-Baseline-SC75.466.373.363.563.9
Meta-Baseline-MoCo81.265.972.263.964.7
WReN-Bongard78.750.150.953.854.3
SBSD83.775.291.571.074.1
PMoC92.092.697.778.375.0
CFN91.286.598.177.077.5
CFN+EM93.993.899.477.877.2
Triple-CFN93.292.099.280.879.1
Triple-CFN+EM95.394.399.880.380.0

By alternating the updates of gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) and gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ), we aimed to simulate the iterative nature of the EM algorithm, which is known for its effectiveness in finding maximum likelihood estimates in statistical models with latent variables. Our ablation studies revealed that this alternating update strategy contributed to improving the performance of the CFN on the Bongard-Logo task.As observed in the table II, alternating updates between networks gθ(k|x)subscript𝑔𝜃conditional𝑘𝑥g_{\theta}(k|\,x)italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_k | italic_x ) and gω(q|x)subscript𝑔𝜔conditional𝑞𝑥g_{\omega}(q|\,x)italic_g start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_q | italic_x ) enhanced the model’s performance on the FF and BA problems without significantly affecting its ability to solve the generalization problems of ”NV” and ”CM”. This indeed suggests that simulating the EM process during training, while beneficial, may be somewhat redundant when combined with the already excellent cross-attention mechanism. Furthermore, as the CFN is upgraded to the Triple-CFN, the role of EM diminishes. In addition, compared to PMoC[15], Triple-CFN exhibits better performance on multiple quantifiable metrics of the Bongard-Logo dataset, while requiring fewer parameters and simpler computational forms. Moreover, it does not necessitate parallel reasoning tasks involving multiple perspectives and inferences.

IV-B Experiment on RPM

When confronted with the RAVEN database in RPM problems, Triple CFN has demonstrated considerable strength and performance. In this study, we conducted experiments using identical software and hardware configurations as those employed in the RS-Tran experiments. We replicated the experimental parameters from the RS-Tran setup, including batch size, learning rate, and all other factors that could potentially influence model performance. This was done to facilitate the most straightforward comparison with RS-Tran, which is currently considered the state-of-the-art model. the the accuracy of Triple-CFN on RAVEN and I-RAVEN is recorded in the Table III. The results presented in the table clearly indicate that Triple-CFN exhibits a notably superior performance when compared to RS-Tran.

Test Accuracy(%)
ModelAverageCenter2 ×\times× 2 Grid3 ×\times× 3 GridL-RU-DO-ICO-IG
SAVIR-T [25]94.0/98.197.8/99.594.7/98.183.8/93.897.8/99.698.2/99.197.6/99.588.0/97.2
SCL [24, 25]91.6/95.098.1/99.091.0/96.282.5/89.596.8/97.996.5/97.196.0/97.680.1/87.7
MRNet [19]96.6/--/--/--/--/--/--/--/-
RS-TRAN[30]98.4/98.799.8/100.099.7/99.395.4/96.799.2/100.099.4/99.799.9/99.995.4/95.4
Triple-CFN99.6/99.8100.0/100.099.7/99.898.8/99.499.9/100.099.9/100.099.9/99.999.2/99.2

We subsequently conducted experiments on the PGM dataset using Triple-CFN and Re-space layer under the exact same experimental conditions as Rs-Tran, the accuracy of reasoning answers is recorded in the Table IV and the accuracy of Reasoning progressive patterns is recorded in Tabel V. Our aim was to demonstrate the superiority of both Triple-CFN and Meta Triple-CFN. It is worth mentioning again that Meta Triple-CFN balances both ex-ante interpretability of the progressive patterns and reasoning accuracy, which is not achievable by Rs-Tran and other previous model in Table IV.

ModelTest Accuracy(%)
SAVIR-T [25]91.2
SCL [24, 25]88.9
MRNet [19]94.5
RS-CNN[30]82.8
RS-TRAN[30]97.5
Triple-CFN97.8
Triple-CFN+Re-space layer98.2
Meta Triple-CFN98.4
Meta Triple-CFN+Re-space layer99.3
Accuracy(%)
Modelshapelineanswer
Meta Triple-CFN99.599.998.4
Meta Triple-CFN+Re-space layer99.799.999.3

Integrating the Re-space layer with Triple-CFN and Meta Triple-CFN requires to retent partial model parameters. Specifically, the parameters of the modules preceding the Re-space layer access point within Triple-CFN and Meta Triple-CFN must be preserved, while the remaining parameters are randomly initialized. More precisely, the parameters of the Vision Transformer used for image encoding and the Multi-Layer Perceptron responsible for extracting information related to the minimal reasoning units in (Meta) Triple-CFN are retained, while all other parameters undergo random initialization.

V Conclusion

This paper introduces the novel Triple-CFN approach, tailored specifically for the Bongard-Logo problem. The Triple-CFN’s unique architecture enables it to implicitly reorganize the conceptual space of conflicting Bongard-Logo instances, achieving remarkable performance on this task. Furthermore, the adaptability of the Triple-CFN paradigm is demonstrated through its effective application to the RPM problem, where necessary modifications were made to yield competitive results.

Notably, the well-defined rules, progressive patterns and clear boundaries governing the RPM problem necessitated the development of the Meta Triple-CFN network. This network explicitly structures the problem space for the RPM issue, maintaining interpretability while attaining state-of-the-art performance on the PGM problem.

Overall, this paper contributes to the advancement of machine intelligence by exploring innovative network designs tailored for abstract reasoning tasks. The proposed Triple-CFN and Meta Triple-CFN approaches represent significant steps forward in addressing the challenges posed by the Bongard-Logo and RPM problems, respectively. We believe that our findings will stimulate further research and development in this critical area of artificial intelligence. In essence, Triple-CFN aims to propose a fundamental methodology for tackling abstract reasoning problems, namely the normalization of reasoning information. Both Meta Triple-CFN and the Re-space layer are attempts at normalizing reasoning information, and they have achieved notable improvements in network performance, thereby demonstrating the effectiveness of this approach.

References

  • [1] Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 246-255 (2009).
  • [2] He, K., Zhang, X., Ren, S., & Sun, J. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 770-778 (2016).
  • [3] Krizhevsky, A., Sutskever, I., & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84-90 (2017).
  • [4] Vaswani, A. etal. Attention is All You Need. In Advances in Neural Information Processing Systems, (2017).
  • [5] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
  • [6] Brown, T. etal. Language Models are Few-shot Learners. In Advances in Neural Information Processing Systems, 1877-1901 (2020).
  • [7] Kingma, D. P., & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2014).
  • [8] Goodfellow, I. etal. Generative adversarial networks. Communications of the ACM, 63(11), 139-144 (2020).
  • [9] Ho, J., Jain, A., & Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 33, 6840-6851 (2020).
  • [10] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. VQA: Visual question answering. In IEEE International Conference on Computer Vision, 2425-2433 (2015).
  • [11] Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. Girshick. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, 2901-2910 (2017).
  • [12] Raven J. C. Raven’s Progressive Matrices. (Western Psychological Services, (1938).
  • [13] Depeweg, S., Rothkopf, C. A., & Jäkel, F. Solving Bongard Problems with a Visual Language and Pragmatic Reasoning. Preprint at https://arxiv.org/abs/1804.04452 (2018).
  • [14] Nie, W., Yu, Z., Mao, L., Patel, A. B., Zhu, Y., & Anandkumar, A. Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning. In Advances in Neural Information Processing Systems, 16468–16480 (2020).
  • [15] R.Song, B.Yuan. Solving the bongard-logo problem by modeling a probabilistic model. Preprint at https://arxiv.org/abs/ arXiv:2403.03173 (2024).
  • [16] Zhang, C., Gao, F., Jia, B., Zhu, Y., & Zhu, S. C. Raven: A Dataset for Relational and Analogical Visual Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5317–5327 (2019).
  • [17] Barrett, D., Hill, F., Santoro, A., Morcos, A., & Lillicrap, T. Measuring Abstract Reasoning in Neural Networks. In International Conference on Machine Learning, 511-520 (2018).
  • [18] Hu, S., Ma, Y., Liu, X., Wei, Y., & Bai, S. Stratified Rule-Aware Network for Abstract Visual Reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, 1567-1574 (2021).
  • [19] Benny, Y., Pekar, N., & Wolf, L. Scale-Localized Abstract Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12557-12565, (2021).
  • [20] Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., & Zhu, S. C. Learning Perceptual Inference by Contrasting. In Proceedings of Advances in Neural Information Processing Systems, (2019).
  • [21] Zheng, K., Zha, Z. J., & Wei, W. Abstract Reasoning with Distracting Features. In Advances in Neural Information Processing Systems, (2019).
  • [22] Zhuo, T., & Kankanhalli, M. Effective Abstract Reasoning with Dual-Contrast Network. In Proceedings of International Conference on Learning Representations, (2020).
  • [23] Zhuo, Tao and Huang, Qiang & Kankanhalli, Mohan. Unsupervised abstract reasoning for raven’s problem matrices. IEEE Transactions on Image Processing, 8332–8341, (2021).
  • [24] Wu, Y., Dong, H., Grosse, R., & Ba, J. The Scattering Compositional Learner: Discovering Objects, Attributes, Relationships in Analogical Reasoning. Preprint at https://arxiv.org/abs/2007.04212 (2020).
  • [25] Sahu, P., Basioti, K., & Pavlovic, V. SAViR-T: Spatially Attentive Visual Reasoning with Transformers. Preprint at https://arxiv.org/abs/2206.09265 (2022).
  • [26]Wei, Qinglai, et al. ”Raven solver: From perception to reasoning.” Information Sciences 634 (2023): 716-729.
  • [27] Zhang, C., Jia, B., Zhu, S. C., & Zhu, Y. Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9736-9746 (2021).
  • [28] Zhang, C., Xie, S., Jia, B., Wu, Y. N., Zhu, S. C., & Zhu, Y. Learning Algebraic Representation for Systematic Generalization. In Proceedings of the European Conference on Computer Vision, (2022).
  • [29] Hersche, M., Zeqiri, M., Benini, L., Sebastian, A., & Rahimi, A. A Neuro-vector-symbolic Architecture for Solving Raven’s Progressive Matrices. Preprint at https://arxiv.org/abs/2203.04571 (2022).
  • [30]Q. Wei, D. Chen, B. Yuan, Multi-viewpoint and multi-evaluation with felicitous inductive bias boost machine abstract reasoning ability, arXiv :2210 .14914, 2022.
  • [31]Shi, Fan, Bin Li, and Xangyang Xue. ”Abstracting Concept-Changing Rules for Solving Raven’s Progressive Matrix Problems.” arxiv preprint arxiv:2307.07734 (2023).
  • [32] S.Kharagorgiev,“Solvingbongardproblemswithdeeplearning,” k10v.github.io,2020.
  • [33] Dosovitskiy, A. etal. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Preprint at https://arxiv.org/abs/2010.11929 (2020).
  • [34]Bardes, Adrien, Jean Ponce, and Yann LeCun. ”Vicreg: Variance-invariance-covariance regularization for self-supervised learning.” ar**v preprint ar**v:2105.04906 (2021).
  • [35]Dempster, Arthur P., Nan M. Laird, and Donald B. Rubin. ”Maximum likelihood from incomplete data via the EM algorithm.” Journal of the royal statistical society: series B (methodological) 39.1 (1977): 1-22.
  • [36] Oord, A. V. D., Li, Y., & Vinyals, O. Representation Learning with Contrastive Predictive Coding. Preprint at https://arxiv.org/abs/1807.03748 (2019).
  • [37] Carpenter, P. A., Just, M. A., & Shell, P. What One Intelligence Test Measures: a Theoretical Account of the Processing in the Raven Progressive Matrices Test. Psychological review, 97(3), 404, (1990).
  • [38] Paszke, A. etal. Automatic Differentiation in Pytorch. In NIPS Autodiff Workshop, (2017).
  • [39] Kingma, D. P., & Ba, J. Adam: A Method for Stochastic Optimization. Preprint at https://arxiv.org/abs/1412.6980, (2014).
Triple-CFN: Restructuring Conceptual Spaces for Enhancing Abstract Reasoning Process (2024)

References

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6280

Rating: 4.8 / 5 (78 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.