Skip to content

Problem with mask #9471

@kalibovers

Description

@kalibovers

Hi,

I’m working on an industrial bin picking system using TorchVision Mask R-CNN and I’m facing a problem with inconsistent and incomplete segmentation masks, even for identical objects.

Problem description:
I am detecting flat metal parts (thin sheet metal) in a bin picking scenario. The objects are identical, often overlapping, sometimes partially occluded, and have a reflective surface.

The model performs well in terms of detection (high confidence scores), but the predicted masks are often incomplete (only the visible part of the object), inconsistent between frames, and sometimes contain noise (small false positives / debris).

This creates a major issue: I am unable to extract a stable and repeatable grasp point because the mask shape changes every time.

Expected behavior:
For industrial use, I need consistent mask shapes for identical objects, preferably full object segmentation (even when partially occluded, if possible), and stable geometry for downstream processing (grasping).

Current behavior:
Mask R-CNN predicts only the visible parts of objects. For partially occluded items, the mask is incomplete. Confidence remains high (e.g. 0.95–1.0), even for poor-quality masks. Small irrelevant regions are sometimes detected as valid objects.

Setup:
Model: maskrcnn_resnet50_fpn_v2
Framework: TorchVision
Input: 1920x1080 (letterboxed)
Dataset: custom (COCO format)
Objects: flat metal parts (bin picking)
Training: standard TorchVision pipeline

Questions:
1. Is Mask R-CNN in TorchVision expected to always segment only the visible part of an object (no amodal segmentation)?
2. What techniques can improve mask completeness and consistency?
3. Would increasing mask resolution (e.g. from 28×28 to 56×56) help in practice?
4. Are there recommended ways to enforce shape consistency and reduce noise / false positives?

Additional context:
This is a real industrial application (robot bin picking), so consistency is more important than raw detection accuracy. I need repeatable geometry, not just object detection.

TorchVision currently performs best among the tested solutions, but this issue is blocking further system development.

Thanks in advance for any guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions