Hi,
I’m working on an industrial bin picking system using TorchVision Mask R-CNN and I’m facing a problem with inconsistent and incomplete segmentation masks, even for identical objects.
Problem description:
I am detecting flat metal parts (thin sheet metal) in a bin picking scenario. The objects are identical, often overlapping, sometimes partially occluded, and have a reflective surface.
The model performs well in terms of detection (high confidence scores), but the predicted masks are often incomplete (only the visible part of the object), inconsistent between frames, and sometimes contain noise (small false positives / debris).
This creates a major issue: I am unable to extract a stable and repeatable grasp point because the mask shape changes every time.
Expected behavior:
For industrial use, I need consistent mask shapes for identical objects, preferably full object segmentation (even when partially occluded, if possible), and stable geometry for downstream processing (grasping).
Current behavior:
Mask R-CNN predicts only the visible parts of objects. For partially occluded items, the mask is incomplete. Confidence remains high (e.g. 0.95–1.0), even for poor-quality masks. Small irrelevant regions are sometimes detected as valid objects.
Setup:
Model: maskrcnn_resnet50_fpn_v2
Framework: TorchVision
Input: 1920x1080 (letterboxed)
Dataset: custom (COCO format)
Objects: flat metal parts (bin picking)
Training: standard TorchVision pipeline
Questions:
1. Is Mask R-CNN in TorchVision expected to always segment only the visible part of an object (no amodal segmentation)?
2. What techniques can improve mask completeness and consistency?
3. Would increasing mask resolution (e.g. from 28×28 to 56×56) help in practice?
4. Are there recommended ways to enforce shape consistency and reduce noise / false positives?
Additional context:
This is a real industrial application (robot bin picking), so consistency is more important than raw detection accuracy. I need repeatable geometry, not just object detection.
TorchVision currently performs best among the tested solutions, but this issue is blocking further system development.
Thanks in advance for any guidance.
Hi,
I’m working on an industrial bin picking system using TorchVision Mask R-CNN and I’m facing a problem with inconsistent and incomplete segmentation masks, even for identical objects.
Problem description:
I am detecting flat metal parts (thin sheet metal) in a bin picking scenario. The objects are identical, often overlapping, sometimes partially occluded, and have a reflective surface.
The model performs well in terms of detection (high confidence scores), but the predicted masks are often incomplete (only the visible part of the object), inconsistent between frames, and sometimes contain noise (small false positives / debris).
This creates a major issue: I am unable to extract a stable and repeatable grasp point because the mask shape changes every time.
Expected behavior:
For industrial use, I need consistent mask shapes for identical objects, preferably full object segmentation (even when partially occluded, if possible), and stable geometry for downstream processing (grasping).
Current behavior:
Mask R-CNN predicts only the visible parts of objects. For partially occluded items, the mask is incomplete. Confidence remains high (e.g. 0.95–1.0), even for poor-quality masks. Small irrelevant regions are sometimes detected as valid objects.
Setup:
Model: maskrcnn_resnet50_fpn_v2
Framework: TorchVision
Input: 1920x1080 (letterboxed)
Dataset: custom (COCO format)
Objects: flat metal parts (bin picking)
Training: standard TorchVision pipeline
Questions:
1. Is Mask R-CNN in TorchVision expected to always segment only the visible part of an object (no amodal segmentation)?
2. What techniques can improve mask completeness and consistency?
3. Would increasing mask resolution (e.g. from 28×28 to 56×56) help in practice?
4. Are there recommended ways to enforce shape consistency and reduce noise / false positives?
Additional context:
This is a real industrial application (robot bin picking), so consistency is more important than raw detection accuracy. I need repeatable geometry, not just object detection.
TorchVision currently performs best among the tested solutions, but this issue is blocking further system development.
Thanks in advance for any guidance.