SwishFormer for Robust Firmness and Ripeness Recognition of Fruits using Visual Tactile Imagery

1Khalifa University Center for Autonomous Robotic Systems (KUCARS), Khalifa University, UAE 2Department of Electrical, Computer and Biomedical Engineering, Abu Dhabi University, UAE

Avocado ripeness estimation using Hello Robot equipped with DIGIT sensor-based gripper

Abstract

The accurate assessment of fruit ripeness is a critical task in the agricultural industry. It affects the fruit quality, shelf-life, and consumer satisfaction. Traditional methods for estimating fruit ripeness rely on subjective human judgment and invasive sampling techniques, which are both infeasible and time-consuming. This paper presents a novel method for estimating firmness and ripeness of fruits using their palpation motion encoded within the visual tactile scans. Moreover, these tactile scans are passed to the proposed SwishFormer model coupled with Random Forest head to predict the fruits firmness, which is later used in classifying the fruits ripeness stage. SwishFormer, unlike the existing state-of-the-art models, encompasses hardswish activation as a token mixer which allows it to generate distinctive set of features from the candidate tactile scans. These rich feature representations are then fed to the Random Forest regressor to robustly estimate the fruit firmness values and the estimated firmness values are then used in accurately predicting the ripeness level of the fruits. Apart from this, SwishFormer is extensively evaluated on the proposed dataset, containing the palpation visual tactile scans, and ,it outperforms state-of-the-art works by achieving 4.77%, 4.09%, 13.69%, and 4.65% better performance in terms of MSE, RMSE, R2, and MAE scores, while possessing 2.02 times less parameters, and 2.09 times lesser GMACs. Additionally, the ripeness recognition performance of the proposed system is thoroughly tested through real-world experiments using a Stretch Robot, where it achieves a success rate of 96.6%, 98.3%, and 93.3% for recognizing avocados as underripe, ripe, and overripe, respectively. To the best of our knowledge, this paper introduces a first non-destructive approach to estimate fruit firmness and ripeness using off-the-shelf vision-based tactile information.

Overall Architecture

Interpolate start reference image.

DIGIT sensor-based gripper palpates an avocado or kiwi after grasping. Three consecutive images are fed as input to the proposed SwishFormer model that generate distinct feature representations. These features are then concatenated together and are used in predicting the fruit firmness using random forest. The predicted firmness are then compared with the standard fruit firmness thresholds to determine their ripeness levels.

DIGIT-Based Tactile Gripper

A custom-designed robotic gripper integrated with a DIGIT tactile sensor for non-destructive fruit firmness estimation.

HardSwish Token Mixer

(a) Original Transformer architecture, (b) Metaformer: A general architecture abstracted from the transformer architecture, (c) SwishFormer: The proposed architecture in which HardSwish activation function is used as a token mixer.

Interpolate start reference image.

Dataset

Some random samples from the proposed dataset. The left column in each pair represents the RGB image of kiwi and avocado fruits, while the right column shows their corresponding VBTS palpation scans obtained using a DIGIT. The RGB images lack visual cues related to fruit ripeness, whereas the tactile palpation encode valuable palpation information for firmness estimation. The dataset also includes ground truth firmness values measured using a penetrometer, serving as a benchmark for future works. The total size of the dataset is 4,760 sets of frames along with penetrometer readings.

Interpolate start reference image.

Results

Performance evaluation of the proposed model with state-of-the-art architectures in terms of MSE, RMSE, R2, and MAE. Bold indicates the best performance, while the second-best performance is underlined.

Interpolate start reference image.

Youtube Video

BibTeX

@article{mohsan2025swishformer,
      title={SwishFormer for robust firmness and ripeness recognition of fruits using visual tactile imagery},
      author={Mohsan, Mashood M and Hasanen, Basma B and Hassan, Taimur and Din, Muhayy Ud and Werghi, Naoufel and Seneviratne, Lakmal and Hussain, Irfan},
      journal={Postharvest Biology and Technology},
      volume={225}, 
      pages={113487},
      year={2025},
      publisher={Elsevier}
    }