Skip to content

Hash PLINK export filenames by content-affecting selection parameters to prevent collisions #1113

@rehanxt5

Description

@rehanxt5

Summary

The PLINK export methods currently generate output filenames from only a subset of their input parameters. As a result, different export requests can resolve to the same filename prefix even when they produce different data.

This affects:

  1. biallelic_snps_to_plink()
  2. biallelic_snps_ld_pruned_to_plink()

Problem

At the moment, the filename prefix is built from parameters such as region-level SNP filtering, but it excludes several parameters that also affect the exported dataset contents, including:

  • sample_sets
  • sample_query
  • sample_query_options
  • sample_indices
  • site_mask
  • random_seed

Because these parameters are not reflected in the filename, two calls with different sample-selection inputs can write to the same .bed/.bim/.fam prefix.

Example collision

Two calls such as:

api.biallelic_snps_to_plink(
    output_dir="/plink",
    region="2L",
    n_snps=1000,
    sample_sets=["set_a"],
    random_seed=42,
)

and

api.biallelic_snps_to_plink(
    output_dir="/plink",
    region="2L",
    n_snps=1000,
    sample_sets=["set_b"],
    random_seed=42,
)

can currently generate the same output prefix, despite exporting different sample cohorts.

Why this is a problem

  • Different exports can silently overwrite each other.
  • Cached/exported file paths are not a stable identifier of dataset contents.
  • The risk increases when users run multiple exports into the same output_dir.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions