Skip to content

QueryProcessor::searchJoin can throw Not Implemented during isOrdered checks, aborting join strategy selection #290

@edwardanderson

Description

@edwardanderson

Hello maintainers, and thank you for your work on hdt-cpp.

I would like to report what looks like a bug in QueryProcessor::searchJoin planning.

Summary:

  • QueryProcessor chooses MergeJoin when estimatedNumResults > 200000 and both sides report ordered
  • That gate calls isOrdered directly:
    if (left->estimatedNumResults() > 200000
    && left->isOrdered( left->getVarIndex(joinVar.c_str()) )
    && right->isOrdered( right->getVarIndex(joinVar.c_str()))) {
    root = new MergeJoinBinding((char *) joinVar.c_str(), left, right);
    } else {
    // TODO: Heuristic to use presort+mergejoin when the number of left results is big.
    root = new IndexJoinBinding((char *) joinVar.c_str(), left, right);
    }
  • In some iterator paths, isOrdered/isSorted throws std::logic_error("Not Implemented")
  • The exception propagates and planning fails

I think this may be related to #265, but this report is a distinct failure mode: exception safety during join strategy selection

To reproduce:

// queryprocessor_isordered_repro.cpp
// Minimal standalone reproduction of QueryProcessor's merge-join gate.

#include <exception>
#include <iostream>
#include <stdexcept>
#include <string>

struct BindingLike {
  virtual ~BindingLike() {}
  virtual std::size_t estimatedNumResults() const = 0;
  virtual std::size_t getVarIndex(const char *varName) const = 0;
  virtual bool isOrdered(std::size_t numvar) const = 0;
};

struct ThrowingBinding : public BindingLike {
  // Must exceed the QueryProcessor merge-join threshold (200000).
  std::size_t estimatedNumResults() const override { return 300001; }

  std::size_t getVarIndex(const char * /*varName*/) const override { return 0; }

  bool isOrdered(std::size_t /*numvar*/) const override {
    throw std::logic_error("Not Implemented");
  }
};

static bool bindingIsOrderedSafe(const BindingLike &binding, const char *varName) {
  try {
    std::size_t idx = binding.getVarIndex(varName);
    return binding.isOrdered(idx);
  } catch (...) {
    return false;
  }
}

static std::string plannerDecisionOriginal(const BindingLike &left, const BindingLike &right,
                                           const char *joinVar) {
  // Mirrors upstream QueryProcessor.cpp behaviour.
  if (left.estimatedNumResults() > 200000 &&
      left.isOrdered(left.getVarIndex(joinVar)) &&
      right.isOrdered(right.getVarIndex(joinVar))) {
    return "MergeJoin";
  }
  return "IndexJoin";
}

static std::string plannerDecisionSafe(const BindingLike &left, const BindingLike &right,
                                       const char *joinVar) {
  // Abandon MergeJoin on error
  if (left.estimatedNumResults() > 200000 &&
      bindingIsOrderedSafe(left, joinVar) &&
      bindingIsOrderedSafe(right, joinVar)) {
    return "MergeJoin";
  }
  return "IndexJoin";
}

int main() {
  ThrowingBinding left;
  ThrowingBinding right;
  const char *joinVar = "?offer";

  std::cout << "== Repro: QueryProcessor merge-join gate ==\n";

  try {
    std::string mode = plannerDecisionOriginal(left, right, joinVar);
    std::cout << "Original gate decision: " << mode << "\n";
  } catch (const std::exception &e) {
    std::cout << "Original gate throws: " << e.what() << "\n";
  }

  try {
    std::string mode = plannerDecisionSafe(left, right, joinVar);
    std::cout << "Safe gate decision: " << mode << "\n";
  } catch (const std::exception &e) {
    std::cout << "Safe gate throws: " << e.what() << "\n";
  }

  return 0;
}

... which can be run with:

#!/usr/bin/env bash
set -euo pipefail

ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
REPRO_DIR="$ROOT_DIR/repro"
BIN="$REPRO_DIR/queryprocessor_isordered_repro"
SRC="$REPRO_DIR/queryprocessor_isordered_repro.cpp"

c++ -std=c++11 -O2 "$SRC" -o "$BIN"

"$BIN"

echo
echo "== Source-path proof in hdt-cpp =="
grep -nE "left->isOrdered|right->isOrdered" "$ROOT_DIR/libhdt/src/sparql/QueryProcessor.cpp"
grep -nE "throw std::logic_error\(\"Not Implemented\"\)|isSorted\(" "$ROOT_DIR/libhdt/src/triples/BitmapTriplesIterators.cpp"

Output:

== Repro: QueryProcessor merge-join gate ==
Original gate throws: Not Implemented
Safe gate decision: IndexJoin

== Source-path proof in hdt-cpp ==
149:                                                    && left->isOrdered(     left->getVarIndex(joinVar.c_str()) )
150:                                                    && right->isOrdered( right->getVarIndex(joinVar.c_str()))) {
296:bool BitmapTriplesSearchIterator::isSorted(TripleComponentRole role) {
550:bool MiddleWaveletIterator::isSorted(TripleComponentRole role) {
697:    throw std::logic_error("Not Implemented");
700:bool IteratorY::isSorted(TripleComponentRole role) {
701:    throw std::logic_error("Not Implemented");
983:bool ObjectIndexIterator::isSorted(TripleComponentRole role) {

Current behaviour: an exception from isOrdered propagates out of searchJoin, so the merge-vs-index decision does not complete.

Proposed behaviour: if ordering capability is unavailable, treat that branch as unordered and continue with IndexJoin.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions