nature.com

DeepJiandu Dataset for Character Detection and Recognition on Jiandu Manuscript

AbstractIn ancient China, bamboo and wooden slips, known as “Jiandu”, were the primary mediums for recording historical events before the invention of paper. These artifacts are rich in historical data and cultural significance. Accurate identification of characters on Jiandu is essential for deciphering the historical narratives they contain and plays a vital role in processing Jiandu manuscripts. In this study, we introduce the DeepJiandu dataset, specifically designed for the detection and recognition of Jiandu characters. The dataset comprises 7,416 images annotated with 99,852 characters across 2,242 categories. It addresses a variety of complex challenges encountered in Jiandu character recognition, including character degradation, diverse layouts, and variable forms and shapes. The authenticity and reliability of the DeepJiandu dataset render it an effective tool for training and evaluating models geared towards Jiandu character recognition, thereby streamlining the research and organization of Jiandu information.

Background & SummaryBefore the widespread adoption of paper in ancient China, inscribing information on wooden or bamboo slips—collectively referred to as “Jiandu”—constituted the predominant means for documentation1. Archaeological research reveals that the use of Jiandu was especially prominent during the Warring States, Qin, Han, and Wei-Jin periods, spanning from the mid-5th century BCE to the early 4th century CE. These artifacts are a chronological continuation of earlier oracle bone and bronze inscriptions, and they set the stage for the advent of paper manuscripts and engraved texts. The emergence of Jiandu significantly influenced the development and dissemination of Chinese civilization, marking it as a crucial element in both historical and cultural contexts. Jiandu manuscripts, as a vital category of unearthed cultural heritage, hold significant academic and historical value.Nonetheless, the preservation and scholarly examination of Jiandu manuscripts encounter significant obstacles. The materials of Jiandu are inherently fragile and prone to rapid degradation and ink loss, exacerbating the deterioration of the characters inscribed on them. Furthermore, the ropes traditionally used to bind these slips into volumes frequently sustain damage during excavation. Conventional organizational methods are often compromised by disarray and widespread dispersal of materials, complicating the management of Jiandu collections2. Consequently, digitization emerges as a critical strategy for preserving, inheriting, and disseminating this cultural heritage. By digitizing Jiandu manuscripts, advanced computer vision and artificial intelligence techniques can be applied to perform semantic analyses, enhancing the ability of scholars to accurately interpret these ancient texts. This digitization process not only facilitates a deeper understanding of historical narratives but also promotes the study of connections among historical documents.Presently, Jiandu manuscripts unearthed across China exhibit wide geographical distribution, with those from Gansu Province meriting particular attention due to their abundance, relatively good preservation, and the breadth of cultural information they encapsulate. These manuscripts encompass various dimensions of ancient Chinese life during the Qin and Han dynasties (From 221 BCE to 220 CE), including filial piety, religious practices, transportation, culinary traditions, and medicinal knowledge, thus providing a vivid portrayal of the society’s diverse cultural facets during that era. Gansu Province, notable for its rich trove of Jiandu, especially from the Han dynasty, is recognized for having the most substantial collection of high-quality Jiandu manuscripts in the country3. This distinction underscores its pivotal role in research related to Jiandu of the Han dynasty. Consequently, our research team, including principal members from Northwest Normal University and the Gansu Jiandu Museum in Gansu Province, is dedicated to the digitization of these Jiandu artifacts. This initiative aims to safeguard and perpetuate this profound historical and cultural legacy. Some of the digitized Jiandu images are illustrated in Fig. 1.Fig. 1Sample of Jiandu manuscript.Full size imageThe automation of character writing region detection and information extraction in Jiandu research is invaluable. This technology is pivotal for deciphering ancient history and accessing the rich historical data preserved within Jiandu manuscripts. Such processes not only bridge contemporary and ancient scholarly knowledge but also unlock historical mysteries and unearth hidden information from these ancient texts. However, as the primary writing medium in ancient China, Jiandu developed diverse physical materials and forms under different writing environments and historical contexts, endowing Jiandu manuscripts and their characters with the following characteristics:

Indistinct Characters. Historical wear and suboptimal preservation conditions often lead to Jiandu manuscripts exhibiting various degrees of damage, such as cracks, stains, erosion from environmental factors or even intentional damage. These conditions result in blurred or partially incomplete characters, as depicted in Fig. 2(a).Fig. 2Characteristics of Some Jiandu and Jiandu Characters.Full size image

Diverse Layouts. The arrangement of characters within Jiandu manuscripts varies significantly, as illustrated in Fig. 2(b). While some layouts are straightforward, enhancing the ease of character localization, others are intricate, with varying character sizes, complicating the localization process.

Significant Character Variability. There is a considerable diversity in the complexity of strokes and structural differences from contemporary scripts within Jiandu manuscripts. Characters within the same category can show substantial variations in form and structure, as shown in Fig. 2(c). Furthermore, characters from different categories may appear remarkably similar, such as “八,” “人,” and “入,” making them challenging to distinguish, as shown in Fig. 2(d).

Various Writing Styles. The individual styles of scribes introduce significant variation in how characters are inscribed on Jiandu manuscripts. Some characters are written in a clear and well-spaced manner, facilitating their recognition. In contrast, others are inscribed in a flatter and more compact style, which poses challenges for recognition, as shown in Fig. 2(e).

Given the aforementioned characteristics, the detection and recognition of Jiandu characters are fraught with substantial difficulties. The indistinct script, extensive variations in shape, disparate layout styles, and diverse forms of characters considerably complicate the task of identifying and interpreting Jiandu text. Traditionally, these activities have depended heavily on the expertise of trained scholars to manually read, interpret, and annotate Jiandu manuscripts. This process is both time-consuming and expensive.With advancements in artificial intelligence and deep learning technologies, the use of deep learning for intelligent information extraction and character detection and recognition in Jiandu manuscripts is becoming increasingly significant and innovative. This method enhances experts’ ability to comprehend and interpret these ancient texts. Nevertheless, deep learning relies heavily on well-annotated datasets for supervised training. Despite possessing a substantial collection of physical Jiandu manuscripts, the two largest Jiandu museums in China—the Gansu Jiandu Museum and the Changsha Jiandu Museum—face challenges due to the lack of systematic planning and platform development, leading to incomplete digitization and data processing of these manuscripts. This task necessitates not only the expertise of Jiandu specialists but also the involvement of computer science professionals.Currently, there is no comprehensive public Jiandu dataset available, highlighting the critical need for such a resource. Research on historical character datasets has progressed significantly. For instance, the Mongolian Handwritten Character dataset MOLHW4 supports character recognition, character generation, and signature verification. The Oracle-MNIST5 dataset aids in preserving and understanding oracle bone inscriptions, contributing to the broader study of ancient civilizations. The Swedish Historical Handwritten Character dataset CArDIS6 enhances OCR system performance, while the DeepLontar7 dataset facilitates the training and evaluation of character detection in Balinese palm-leaf manuscripts. These datasets have been extensively curated, published, and researched, providing essential support for scholarly endeavors. Nonetheless, the digitization and intelligent analysis of Jiandu manuscripts are still nascent, primarily due to the absence of a dedicated Jiandu dataset. Although some research institutions have curated Jiandu collections8, providing valuable materials for academic research, these resources are limited in scope and quantity. Moreover, while some annotations exist, they have not been organized into datasets suitable for deep learning applications. Consequently, creating a comprehensive Jiandu character dataset is necessary and valuable for advancing research in this field.In order to address the complexities of Jiandu character detection and recognition, support Jiandu collation research, and accelerate progress in Jiandu studies, this paper draws on existing experience in building related text datasets to construct the DeepJiandu dataset-a Jiandu character dataset specifically designed for character detection and recognition, thereby promoting the digitization of Jiandu research.Due to the significant information loss encountered when photographing Jiandu manuscripts with standard RGB cameras, we employed an advanced hyperspectral imaging technique to enhance the fidelity of character restoration. Specifically, visible light photography was used for RGB image capture, and infrared scanning was performed to obtain infrared digital images, as shown in Fig. 3. Because bamboo and wood substrates, as well as the ink used in Jiandu, exhibit unique absorption properties for infrared light, infrared imaging more distinctly reveals the details of the characters. Consequently, we have selected infrared images to constitute the DeepJiandu dataset, which includes 7,416 expert-validated, annotated infrared images. This dataset was meticulously assembled through a series of processes, including image acquisition, character annotation, and validation by Jiandu experts to ensure its credibility and recognition within the Jiandu scholarly community.Fig. 3Example of RGB images and infrared images of Jiandu.Full size imageFigure 4 shows the hyperspectral equipment used to capture Jiandu images and the process of obtaining Jiandu hyperspectral images. The hyperspectral scanner can cover visible light, near-infrared, and partial infrared bands, capturing spectral information beyond the range visible to the human eye. This high-resolution imaging technology enables detailed examinations of Jiandu artifacts, allowing in-depth detection of subtle chemical changes and structural features on the surface of Jiandu, thereby assisting in the recovery and reconstruction of deteriorated characters. Through collaboration with Jiandu cultural heritage conservation units and Jiandu experts, we scanned over 10,000 Jiandu artifacts, obtaining 7,416 Jiandu infrared digital images, which serve as the image data of DeepJiandu dataset.Fig. 4Acquisition of Jiandu manuscripts using hyperspectral camera.Full size imageThe collected Jiandu infrared digital images require annotations for character recognition, which enables semantic analysis through artificial intelligence in downstream applications. Unlike modern standardized Chinese characters, a portion of Jiandu characters omit certain strokes from the standard forms, resulting in partially simplified shapes. Each Jiandu character exhibits slight variations, and these characters are handwritten, introducing significant variability that poses challenges for character recognition. All characters within the Jiandu digital images have been annotated accordingly.To accurately interpret Jiandu characters and analyze the semantic information they convey, it is essential to first locate and detect the regions containing Jiandu characters and subsequently recognize the characters within these regions. Thus, annotations must include both the characters’ positional and semantic information. Figure 5 displays examples of these annotated Jiandu images. The annotation process utilized the LabelImg tool to mark and classify the character positions within the images. We collaborated with Jiandu experts to ensure the accuracy and professionalism of character recognition. Their expertise was employed to guide the annotation of characters in the digital images.Fig. 5Jiandu character annotation on Jiandu manuscript using LabelImg.Full size imageTo assess the suitability of the proposed dataset for deep learning applications, we utilized deep neural networks to evaluate the DeepJiandu dataset. The testing demonstrated promising detection performance and reasonable categorization accuracy across various character classes. Figure 6 shows an example of the character detection results achieved using the DeepJiandu dataset.Fig. 6Example of the detection results on Jiandu manuscripts.Full size imageMethodsThe development of the dataset was structured into five key stages, as illustrated in Fig. 7. These stages include data collection, data preprocessing, data annotation, data partitioning, and data validation.Fig. 7Overview of the processing steps to generate DeepJiandu dataset.Full size imageIn the initial stage, we employed hyperspectral and infrared imaging technologies to scan Jiandu artifacts and acquire Jiandu image data. In collaboration with Jiandu cultural heritage preservation experts, we scanned over 10,000 Jiandu artifacts, ultimately obtained 7,416 Jiandu infrared digital images. These images capture the unique spectral characteristics of the Jiandu manuscripts, providing a foundational dataset for further analysis.In the second stage, we focused on ensuring the high quality of the dataset through comprehensive data preprocessing. This stage involved several key procedures designed to enhance the clarity of the characters in the collected images:

Data Filtering. Initial steps included the detection and removal of any anomalous images, such as those that were damaged or deemed invalid. This was essential to maintain the integrity and quality standards of the dataset.

Data Cleaning. Each digital image was cropped to remove irrelevant background regions and focus on the Jiandu manuscript content. This process ensured that the dataset maintained a consistent focus on the essential features while preserving the original image resolution.

Noise Reduction. A mild Gaussian smoothing filter (kernel size: 3 × 3, σ = 0.6) was applied to reduce excessive sensor and environmental noise in the digital infrared images. By selecting a relatively small kernel and sigma, potential distortion of character edges was minimized while improving overall image clarity.

Image Enhancement. Adjustments were made to contrast, brightness, and other relevant image parameters. These enhancements were critical for bringing out the detailed features of the characters within the images, thereby aiding in their recognition and analysis.

The above steps resulted in a set of standardized Jiandu infrared images, ready for further analysis. Although these preprocessing steps may slightly modify the artifacts’ original appearance, they are essential for ensuring that the critical features of the characters are more discernible, ultimately improving the accuracy of character detection and recognition. These adjustments allow the dataset to provide clearer and more reliable input for downstream analysis while maintaining a balance between historical authenticity and practical usability.The unique properties of Jiandu characters, such as blurriness, distortion, and the presence of missing strokes, pose significant challenges in character recognition. Additionally, the character sample distribution within the dataset is highly imbalanced, with the maximum imbalance ratio reaching 2097:1. Despite these challenges, our preprocessing efforts have successfully enhanced the ink features on Jiandu manuscripts and improved image contrast, thus highlighting crucial character information. These improvements have substantially contributed to enhancing the average precision of our character detection models, showcasing the effectiveness of our preprocessing strategies in addressing the complexities inherent in Jiandu character analysis.In the third stage of our study, we employed the LabelImg tool for annotating characters within Jiandu manuscripts, as depicted in Fig. 5. Jiandu manuscripts exhibit complex handwriting, diverse writing styles, and significant differences from modern characters, making it difficult for the general public to accurately recognize. Therefore, to determine the interpretation of each character, we collaborated with experienced Jiandu scholars to ensure that each character’s corresponding information could be confirmed. We assembled a specialized annotation team consisting of veteran Jiandu experts and annotators with both extensive Jiandu text interpretation experience and computer expertise, enabling them to accurately recognize and annotate Jiandu text. Under the guidance of these experts, the team developed detailed annotation guidelines and procedures. Each character’s annotation underwent multiple expert reviews to guarantee accuracy and consistency. Bounding boxes were used to label the text within Jiandu images as ground truth, ultimately annotating 99,852 Jiandu characters across 2,242 categories. These annotations record each character’s spatial location and text category in the Jiandu images. By annotating the text on Jiandu images, we produced a new Jiandu text dataset specifically for Jiandu character detection and recognition, called DeepJiandu.In the fourth stage, we organized the DeepJiandu dataset, which comprises 7,416 annotated and expert-validated infrared images, into training, testing, and validation sets. Initially, we classified the Jiandu digital images based on criteria such as clarity and character layout. This categorization yielded 6,076 images with clear Jiandu characters and 1,340 images with blurry characters. Further subdivision of the clear character images resulted in 4,832 single-column, 588 double-column, and 656 complex layout images. Similarly, the blurry character images were divided into 1,101 single-column, 122 double-column, and 117 complex layout images. To ensure a balanced and scientifically valid distribution, we adopted a stratified sampling strategy with an 8:1:1 ratio for the training, testing, and validation sets. This approach helped maintain a consistent distribution of categories across each subset, effectively reducing the risk of biases such as model overfitting or underfitting to specific categories. As a result, we compiled a training set comprising 5,922 images, a testing set of 743 images, and a validation set of 751 images. This partitioning facilitates robust model training and evaluation, enhancing the reliability of our findings in Jiandu character recognition studies.To further illustrate the dataset’s composition and the challenges it poses for character detection and recognition, we present Figs. 8, 9, which analyze the distribution of image resolution, and aspect ratios.Fig. 8Image Resolution and Aspect Ratio Distributions.Full size imageFig. 9Image Dimensions and Aspect Ratio Analysis.Full size imageFigure 8 provides a comprehensive overview of the image resolution and aspect ratio distributions in the dataset. Figure 8(a) presents the histogram of image widths, showing that most images have widths between 150 and 400 pixels, with a peak around 200 pixels. This indicates that the majority of Jiandu manuscripts are relatively narrow. However, some images extend beyond 600 pixels, suggesting the presence of wider manuscript layouts. Figure 8(b) illustrates the distribution of image heights, which, unlike width distribution, spans a broader range with two prominent peaks around 1000 pixels and 3200 pixels. This highlights the coexistence of moderately tall and extremely tall Jiandu manuscripts. Figure 8(c) depicts the histogram of aspect ratios, revealing that most images have an aspect ratio below 0.5, reaffirming that Jiandu manuscripts are predominantly tall and narrow. However, a small subset of images exhibits aspect ratios closer to or exceeding 1.0, suggesting horizontally arranged texts or manuscript fragments.Figure 9 examines the relationship between aspect ratios and image dimensions to provide further insights into the dataset’s structural variability. Figure 9(a) presents a scatter plot illustrating the correlation between image widths and aspect ratios. Most data points are concentrated in the lower left, confirming that Jiandu images are typically narrow with low aspect ratios. However, some outliers exhibit greater widths and higher aspect ratios, indicating that certain images deviate from the predominant vertical format. Figure 9(b) visualizes the relationship between image heights and aspect ratios, reinforcing the observation that most images have low aspect ratios while exhibiting substantial height variations. The densely packed points in the lower range indicate that while many images share similar aspect ratios, their absolute heights can vary significantly.These visual analyses provide a deeper understanding of the dataset’s structural characteristics. The predominant features, such as narrow widths, tall heights, and low aspect ratios, introduce unique challenges for deep learning-based character detection and recognition.Following this analysis, we present comprehensive statistical evaluations in Tables 1–4, including image distribution by clarity and layout, character category frequencies, and bounding box dimension statistics.Table 1 Image Composition and Data Splits.Full size tableTable 2 Character Category Distribution.Full size tableTable 3 Bounding Box Dimension Statistics per Category.Full size tableTable 4 Global Statistics of Bounding Box Sizes.Full size tableTable 1 provides an overview of the dataset composition and how it is split into training, validation, and testing sets. We categorize the images into two principal groups (clear vs. blurry), further subdividing them by layout type (single-column, double-column, complex). This tabulation facilitates a clear understanding of the dataset’s overall structure, ensuring a balanced distribution across subsets for more robust model training and evaluation.Table 2 outlines the frequency distribution of each character category in the dataset. For each of the 2,242 categories, we list the number of images in which it appears and the total number of bounding boxes. For frequently occurring categories, a single image may contain multiple bounding boxes of the same character, which can result in a bounding box count higher than the number of images containing that category. In contrast, for rare categories, these two values are often the same because each category appears only once per image. Notably, certain categories (e.g., “□”, “月”, and “十”) occur in numerous images, indicating a higher frequency of appearance, whereas others appear extremely rarely (e.g., “媼”, “螯”, “獒”), illustrating a long-tail distribution. The “□” category specifically denotes unidentifiable characters on the Jiandu manuscripts, ensuring that every character is accounted for even if its structure or meaning cannot be deciphered due to damage or illegibility. This imbalance could pose challenges for recognition models, highlighting the need for potential class rebalancing strategies or data augmentation in future studies.Table 3 illustrates the range of bounding box dimensions for each character category, including the minimum and maximum height, width, and area observed. The substantial variations (e.g., ‘年‘ spanning up to 1129 pixels in height, whereas many other characters remain below 300 pixels) highlight the multi-scale nature of Jiandu manuscripts. This dimension-based analysis underscores the importance of employing detection models with robust multi-scale capabilities to handle both small, intricate characters and larger, more dominant ones.Table 4 summarizes the global bounding box statistics across all annotated characters in the DeepJiandu dataset. While the average height and width hover around 115 and 129 pixels respectively, the standard deviations are notable, reflecting the significant variability in character size. The multi-scale challenge is further confirmed by the minimum and maximum area, which spans from around 1.6k to over 139k pixels2.The final stage is data validation. We conducted a comprehensive evaluation of various character detection and recognition algorithms applied to the DeepJiandu dataset. For character detection, the experiments were performed on an NVIDIA GeForce RTX 3080 Ti with 12 GB of memory (CUDA 11.3) in a Python 3.8 environment with Torch 1.11.0 + cu113. Each detection model was trained for 40 epochs.For character recognition, the experiments were implemented in Python using the PyTorch framework on an NVIDIA GeForce RTX2080 SUPER GPU with 8 GB of memory. The extracted character images were resized to 128 × 128 pixels and processed using the AdamW optimizer (β1 = 0.9, β2 = 0.99) with an initial learning rate of 1 × 10−4, a minimum learning rate of 1 × 10−7, and a CosineAnnealingLR schedule with a 5-epoch warmup. A batch size of 64 was used, and the network was trained for 150 epochs. The performance outcomes of these assessments are detailed in Tables 5, 6.Table 5 The results of DeepJiandu on various character detection models.Full size tableTable 6 The results of DeepJiandu on various character recognition models.Full size tableWe selected a suite of state-of-the-art character detection models from recent years for testing. DBNet9, proposed in 2020, enhances character boundary accuracy through differentiable binarization, particularly excelling in detecting irregular characters in complex scenes. DBNet++10, an extension of DBNet proposed in 2021, further optimizes the architecture or post-processing strategies, improving detection accuracy and speed. Mask-RCNN11, introduced in 2017, incorporates instance segmentation, enabling precise pixel-level detection of character regions, suitable for scenarios demanding high precision. EAST12, introduced in 2017, is a single-stage end-to-end detector that simplifies the process and enhances efficiency, suitable for rapid character detection applications. FCENet13, proposed in 2020, innovatively uses Fourier descriptors to handle curved characters, significantly improving the detection capability of curved characters. PSENet14, introduced in 2019, proposes progressive scale expansion, effectively addressing characters of different sizes, particularly excelling in detecting small characters. PANet15, proposed in 2019, introduces position attention mechanisms, enhancing the robustness of character recognition in complex backgrounds and effectively dealing with noise interference.Experimental results demonstrate that these models achieved promising outcomes on the DeepJiandu dataset. DBNet and DBNet++, leveraging pretrained models, delivered robust detection metrics with a recall of 83.37%, precision of 91.48%, and an F-measure of 86.17%. Conversely, PSENet, though designed for curved character scenes, showed lower efficacy on Jiandu characters, with an F-measure of just 72.39%.We selected several classic character recognition models to test the DeepJiandu dataset. ResNet-18/5016, introduced in 2015, resolved the training difficulty of deep networks by introducing residual connections, providing a powerful and easy-to-train foundational model for character recognition. In 2019, MobileNetV317 significantly reduced model size and computational requirements while maintaining high performance by introducing new features such as SE layers and the Hard Swish activation function, making it suitable for character recognition in resource-constrained environments. EfficientNetV218, launched in 2021, enhanced model efficiency with refined scaling strategies and improved MBConv blocks, achieving high efficacy in character recognition with smaller model footprints. CSPNet19, also from 2019, minimized redundant computations with its cross-stage partial network design, which facilitated faster convergence without sacrificing accuracy. EdgeNeXt20, proposed in 2020, optimized model structures to reduce the computational burden, targeting edge computing, making it suitable for efficient character recognition applications on edge devices, balancing performance and efficiency. ConvNeXt21, introduced in 2021, demonstrated state-of-the-art image recognition performance even without self-attention mechanisms, showing that well-designed convolutional networks can achieve advanced image recognition performance.The evaluation of these models on the DeepJiandu dataset revealed considerable variations in performance. CSPNet notably outperformed others, achieving a top-1 accuracy of 64.08% and a top-5 accuracy of 79.93%. Conversely, ResNet-18 struggled, managing only a top-1 accuracy of 58.82%, likely due to its relatively shallow architecture that may hinder adequate feature extraction. In general, these conventional models performed generally poorly on the task of recognizing Jiandu characters. This can be attributed to some challenges present in the Jiandu dataset, such as blurry, distorted, or incomplete samples. Additionally, the imbalanced distribution of samples in the Jiandu dataset could lead to poorer recognition performance for characters with fewer training samples, thereby affecting overall performance.In conclusion, the inherent complexities of the Jiandu dataset, characterized by blurry, distorted, and incomplete samples alongside an imbalanced distribution of samples, offer substantial opportunities and challenges for research. By strategically refining network architectures, employing more effective training methodologies, and leveraging pre-trained models, future research is poised to enhance both the accuracy and robustness of character detection and recognition tasks using the Jiandu dataset. This dataset is critically important for advancing the field of character recognition, providing a valuable resource for developing more sophisticated analytical tools and methodologies.Data RecordsThe DeepJiandu dataset is available on Science Data Bank22 and is freely accessible to the public. DeepJiandu comprises 7,416 images of Jiandu manuscripts, each paired with a corresponding text file that encapsulates annotation details in the VOC format. This dataset enriches the research community with over 99,852 expertly validated annotations of Jiandu characters. All files are named according to the following format:

BMP images: .bmp, for instance: 1.bmp

XML annotations: .xml, for instance: 1.xml

Annotation files format follows the VOC format, as follow: ImageFileName.bmp Unknown ImageWidth ImageHeight ImageDepth 0 where specifies the name of the image file, denotes the origin of the annotation data, provides dimensions of the annotated image, indicates whether the annotation has been segmented; typically, ‘0’ denotes unsegmented,