communities.springernature.com

Multi-task learning for panoptic perception in railroad environments

Embedded vision for panoptic perception

With the rapid evolution of deep learning, embedded panoptic vision has become a key enabler for intelligent perception in resource-constrained environments. Panoptic perception unifies object detection and segmentation, allowing machines to distinguish objects while understanding the overall scene structure. This ability is crucial in various applications, including robotics, smart surveillance, augmented reality, and industrial automation, where real-time and high-precision scene analysis enhances decision-making.

One of the most advanced applications of panoptic perception is autonomous driving. Modern vehicles equipped with Advanced Driver Assistance Systems (ADAS) and Autonomous Driving Systems (ADS) rely on a deep understanding of their surroundings to ensure safety and efficiency. By leveraging camera-based deep learning models, vehicles can detect objects, segment drivable areas, and recognize lane markings with high precision. This technology enables safer navigation in complex urban environments, handling dynamic obstacles, pedestrians, and varying road conditions.

Beyond road vehicles, panoptic perception is also attracting interest in railroad environments, where autonomous systems can enhance smart rail transport. However, research in panoptic vision for rail applications remains significantly less developed compared to automotive domain. Deep learning is anticipated to play a major role in various railway-specific domains over the medium to long term, including automated inventory, autonomous driving, predictive maintenance, and traffic flow optimization. Its adoption is expected to enhance line capacity, allow lower life cycle costs, and enable predictive maintenance for increased reliability.

Real-time multi-task YOLO model for railway systems

Given the constraints of edge devices, achieving high precision while maintaining real-time performance remains a challenge. This is especially true that maintaining a frame rate exceeding 30 frames per second (FPS) is imperative for ADAS and ADS. Numerous methods have been proposed for individual object detection and segmentation tasks, many of which have achieved relevant results. The integration of these computer vision tasks into a single model brings challenges, primarily due to the differing resolution of features required. Segmentation operates at the pixel level, whereas object detection uses selective search either in one-stage or two-stage approaches. While their focus differs, both segmentation and detection tasks require initial feature extraction from input images. Consequently, it is possible for them to share a common backbone of YOLO structure. Compared to using separate models for each task, integrating distinct necks and heads into a unified model with a shared backbone can significantly save computing resources and reduce inference time.

A-YOLOM is a multi-task lightweight model dedicated to the automotive environment. It is able to integrate three tasks into a single unified model : vehicles detection, drivable areas segmentation, and lane lines segmentation. The A-YOLOM model adopts a single-stage network architecture, as illustrated in the figure below, with an encoder-decoder pipeline. The encoder is responsible for extracting features from the input image, while the decoder generates predictions for the different tasks. The model architecture consists of three main components: a shared backbone, distinct necks for each task, and task-specific heads. This approach enables the model to handle multiple tasks simultaneously while reducing computational complexity and inference time.

A-YOLOM model

In order to perform similar tasks in railways, we opt for transfer learning to adapt the A-YOLOM model, originally pre-trained on automotive dataset, to railroad environments. The early layers of the network, responsible for extracting general features, are frozen, while the layers dedicated to railway-specific elements are adjusted. The weights transfer involves the A-YOLOM fine-tuning on RailSem19 dataset. The aim is to carry out the segmentation of rails, tracks, and poles as well as the detection of vehicles, pedestrians, signs, and signals. To the best of our knowledge, this is the first multi-task model designed specifically for railway environments. We typically opt for the nano version of A-YOLOM (A-YOLOM(n)) for its lightweight nature (approximately 4.43M parameters) and its ability to operate in real time (39.9 FPS for a batch size of 1).

In-domain and out-of-domain performances

In-domain experiments refer to assessing a model on data that come from the same distribution or domain as the data used for training. Being 20% of the training set, the data used in these experiments closely matches the conditions, features, and patterns found during the training. A-YOLOM(n) model achieves recall (%) of 82.7, 74.1, 49.4, and 53.4 for the respective classes of vehicle, pedestrian, signs, and signals detection. For segmentation, it obtains mIoU (%) values of 79.4, 90.7, and 66.3 for the rails, tracks, and poles classes.

In-domain results

Out-Of-Domain experiments refer to evaluating a model on data that come from a different distribution or domain than the one it was trained on. These experiments help assess the model ability to generalize to potentially more challenging data that are not covered by the original training set, and to handle real-world situations where the conditions or data might not match the training environment. Typically, we used the dataset of the Tunisian National Railway Company (Société Nationale des Chemins de Fer Tunisiens, SNCFT) that is collected using GoPro cameras mounted on the front end of trains on some railway lines. It is important to note that the SNCFT dataset is not annotated. Therefore, we had to manually annotate some images to obtain quantitative results. The YOLOM(n) model demonstrates recall (%) rates of 75.3, 71.9, 43.4, and 50.5 for detecting vehicles, pedestrians, signs, and signals, respectively. In terms of segmentation, it achieves mIoU (%) values of 75.3, 88.4, and 61.8 for the rails, tracks, and poles classes.

Out-of-domain results

Read full news in source page