ScImage reveals how AI models like GPT-4o excel in some tasks but struggle with spatial reasoning, highlighting the challenges of turning text into scientifically accurate images.
Image Credit: Shutterstock AI—Prompt (Some circles are different sizes, and the largest one is in black). Study: ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Image Credit: Shutterstock AI—Prompt (Some circles are different sizes, and the largest one is in black). Study: ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In an article recently submitted to the arXiv* server, researchers introduced ScImage, a benchmark to evaluate the ability of multimodal large language models (LLMs) to generate scientific images from textual descriptions. Using diverse input languages, they assessed models like generative pre-trained transformers GPT-4o, Llama, and StableDiffusion on spatial, numeric, and attribute comprehension.
The authors highlighted limitations in generating accurate scientific images, especially for complex prompts, despite some models performing well with more straightforward tasks. Notably, spatial reasoning emerged as the most significant challenge across all models.
Illustration of scientific text-to-image generation. The text shown below is the generation query. Images on the left meet the expectations for general text-to-image tasks, while those on the right highlight the specific requirements of scientific image generation. All figures are from our ScImage experiments.
Illustration of scientific text-to-image generation. The text shown below is the generation query. Images on the left meet the expectations for general text-to-image tasks, while those on the right highlight the specific requirements of scientific image generation. All figures are from our ScImage experiments.
Background
Artificial Intelligence (AI) has significantly advanced scientific research, aiding tasks from literature reviews to generating research ideas. While tools like Elicit and Grammarly streamline workflows, AI models such as The AI Scientist demonstrate capabilities for end-to-end research generation.
Despite these achievements, AI-driven scientific visualization—a cornerstone of scientific communication—remains underexplored. Existing models struggle with precise spatial, numeric, and attribute representation essential for scientific accuracy.
Benchmarks like comprehensive benchmark for open-world compositional text-to-image generation (T2I-CompBench) focus on general image generation but overlook scientific-specific requirements.
To address this gap, ScImage introduced a comprehensive benchmark assessing multimodal LLMs in generating scientific images from text. This work evaluated models such as GPT-4o, Llama, and DALL-E across languages, output formats, and key comprehension dimensions.
ScImage's dataset construction leveraged a structured approach, incorporating a dictionary and query templates to ensure coverage of diverse scientific tasks. These included specific challenges like generating 3D shapes, bar charts, and parabolic trajectories.
Human evaluation of around 3,000 images highlighted performance gaps in scientific visualization. By establishing standardized metrics, ScImage provided a critical foundation for advancing AI-driven scientific image generation.
ScImage Evaluation Framework
The ScImage framework assessed multimodal LLMs’ ability to generate scientific graphs from textual descriptions, focusing on spatial, numeric, and attribute understanding. Prompts tested models' capacity to interpret spatial relationships, numeric details, and attribute binding. Models generated graphs directly as text-images or via intermediate code. Pilot tests optimized prompt design for consistent output across models.
Illustration of the three understanding dimensions. The first row shows the individual dimensions of Attribute, Numeric and Spatial understanding. The second row illustrates the combination of two or three dimensions.Illustration of the three understanding dimensions. The first row shows the individual dimensions of Attribute, Numeric and Spatial understanding. The second row illustrates the combination of two or three dimensions.
Dataset construction integrated insights from scientific datasets and benchmarks, emphasizing objects' attributes, spatial relations, and numeric requirements.
A structured dictionary and query templates ensured diverse prompts covering two-dimensional (2D)/three-dimensional (3D) shapes, graphs, charts, and more. The dataset included 101 templates and 404 queries, incorporating varied complexities and combinations.
Evaluation combined human and automatic assessments. Human evaluators, including domain experts, rated correctness, relevance, and scientific style on a 1–5 scale. Agreement scores (Spearman’s correlation, Pearson’s correlation, and weighted Kappa) ranged from 0.52 to 0.80, reflecting high reliability. However, automatic metrics, with a maximum Kendall correlation of 0.26, showed limited alignment with human evaluations, emphasizing the necessity of human assessment.
Multilingual evaluations revealed that Chinese prompts achieved higher relevance and scientificness than English, while Farsi had the lowest performance across all models.
Experiment and Analysis
The authors evaluated various models for image generation across two modes: direct text-to-image (DALL·E, Stable Diffusion) and text-to-code-to-image (GPT-4o, LLAMA 3.1 8B, AUTOMATIKZ). Models generating images via code demonstrated superior performance in scientific and structured contexts but faced higher compilation error rates.
GPT-4o excelled across correctness, relevance, and scientific style, outperforming others significantly, particularly in the text-to-code-to-image mode. However, it scored below 4.0, indicating room for improvement. Compilation errors affected scores, particularly for LLAMA 3.1 8B, which had high failure rates (28% of prompts). AUTOMATIKZ showed the best compilation success.
Specific challenges included generating accurate parabolic trajectories, proper placement of liquid in containers, and correctly positioning objects at specific angles on slopes. Models struggled the most with spatial understanding. GPT-4O maintained the highest correctness scores for attributes but showed weaknesses in tasks requiring combined numerical, spatial, and attribute understanding.
Visual models like DALL·E and Stable Diffusion underperformed in numerical understanding but excelled in real-life object modeling and 3D shapes.
GPT-4o performed well across multiple languages, with Chinese prompts achieving higher relevance and scientificness than English.
OpenAI o1-preview surpassed GPT-4o in English but lagged in non-English languages. Python code output was superior to TikZ across correctness, relevance, and scientific style, likely due to better training resources. Lower compile error rates also favored Python.
Comparison of Generated Images by Different Models from Various Prompts: Each column in this table presents the side-by-side comparison of images generated by the different models in response to the corresponding prompts. Results for each prompt are shown in each row, demonstrating the diversity of model approaches and styles. Image outputs for the first four columns are generated through the models’ code, while the others are generated directly by prompting the models.
Comparison of Generated Images by Different Models from Various Prompts: Each column in this table presents the side-by-side comparison of images generated by the different models in response to the corresponding prompts. Results for each prompt are shown in each row, demonstrating the diversity of model approaches and styles. Image outputs for the first four columns are generated through the models’ code, while the others are generated directly by prompting the models.
Conclusion
In conclusion, the researchers introduced ScImage, a benchmark for evaluating multimodal LLMs in generating scientific images from textual descriptions.
The authors assessed models like GPT-4o, Llama, and DALL-E across spatial, numeric, and attribute comprehension, highlighting performance gaps, particularly with complex prompts.
Models generating images via code (such as GPT-4o, LLAMA 3.1) performed better in structured contexts but faced higher compilation errors. GPT-4o outperformed others in correctness and scientific style, though it still had weaknesses in spatial and numeric tasks.
Future improvements should address issues such as spatial reasoning, compilation reliability, and the inclusion of more robust open-source solutions, which currently lag behind proprietary models like GPT-4o.
The research emphasized the need for improvements in spatial reasoning and consistency, particularly for open-source models, to advance AI-driven scientific image generation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
Preliminary scientific report. Zhang et al., (2024). ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation? ArXiv.org. DOI: 10.48550/arXiv.2412.02368, https://arxiv.org/abs/2412.02368