From words to visuals: Bridging text and visual insights using MetA-MARC framework for enhanced scholarly article categorization

The rapid growth of technology has led to approximately 28,100 journals disseminating 2.5 million research articles annually, posing significant challenges in locating and categorizing articles of interest. Search engines, citation indexes, and digital libraries often return predominantly irrelevant...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 324; p. 113896
Main Authors	Mitra, Abhijit, Paul, Jayanta, Ahamed, Tanis, Basak, Sagar, Sil, Jaya
Format	Journal Article
Language	English
Published	Elsevier B.V 03.08.2025
Subjects	Adaptive Re-weighting Attention mechanism CompScholar Cross-modal learning Feature alignment Feature fusion FusionWeave Image–text embedding Metadata-driven analysis MoDAR Multimodal classification Scholarly article categorization Text and image integration Metadata-driven analysis MoDAR Adaptive Re-weighting Feature fusion FusionWeave CompScholar Scholarly article categorization Multimodal classification Attention mechanism Cross-modal learning Image–text embedding Feature alignment Text and image integration
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The rapid growth of technology has led to approximately 28,100 journals disseminating 2.5 million research articles annually, posing significant challenges in locating and categorizing articles of interest. Search engines, citation indexes, and digital libraries often return predominantly irrelevant papers due to limited indexing. Existing classification techniques leveraging content and metadata face challenges such as incomplete data and lack of semantic context. Metadata-based methods frequently rely on statistical metrics that neglect semantic meanings and require subject expertise for threshold setting. To address these issues, we propose Metadata-Driven Attention-Based Multimodal Academic Research Classifier (MetA-MARC), a framework leveraging the pretrained CLIP model to integrate text and image modalities for enhanced scholarly article classification. MetA-MARC captures semantic and contextual meaning by integrating metadata, OCR-extracted features, and images through CLIP (Contrastive Language-Image Pre-training). It introduces a novel textual inversion approach to map images to pseudo-word tokens in the CLIP embedding space for robust multimodal representations. The framework employs FusionWeave, a multimodal fusion network combining features using concatenation, cross fusion, and attention-based techniques, alongside Modality-Driven Adaptive Re-weighting (MoDAR) to dynamically prioritize relevant features. Experiments on JUCS, ACM, and proprietary CompScholar datasets demonstrate average accuracies of 0.86, 0.84, and 0.8848, respectively, surpassing state-of-the-art methods by up to 4.05%. These results highlight MetA-MARC’s potential as a robust, adaptive tool for automated scholarly article classification, effectively bridging text and visual modalities. •Purpose: The paper introduces MetA-MARC, a Metadata-Driven Attention-Based Multimodal Academic Research Classifier, designed to enhance scholarly article classification by integrating text and image modalities; The framework addresses limitations of traditional content-based and metadata-driven approaches, focusing on semantic and contextual accuracy.•Key Points: CLIP-Based Learning: Leverages the pretrained CLIP model for cross-modal similarity, enabling robust integration of text and image features; Textual Inversion: Introduces a novel approach to map images into pseudo-word tokens in the CLIP embedding space, creating semantically rich multimodal representations; FusionWeave: Implements an innovative multimodal fusion network combining features using concatenation, cross fusion, and attention-based techniques; MoDAR: Employs Modality-Driven Adaptive Re-weighting for dynamically prioritizing relevant features across modalities.•Major Contributions: Proposes a novel multimodal classification framework integrating metadata, OCR-extracted text, and image features; Demonstrates significant performance improvement, achieving classification accuracies of 0.86 (JUCS), 0.84 (ACM), and 0.8848 (CompScholar); Surpasses state-of-the-art methodologies with up to 4.05% higher accuracy.•Novelty: First to utilize textual inversion for enhancing multimodal feature alignment in document classification tasks; Introduces a robust fusion strategy through FusionWeave and MoDAR, enabling adaptive multimodal feature representation.•Outcomes: Validates MetA-MARC on three datasets, demonstrating its robustness and flexibility for multimodal academic classification; Captures nuanced semantic information, establishing itself as a powerful tool for automated scholarly article categorization.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2025.113896