Hello maintainers, and thank you for your work on hdt-cpp.
I would like to report what looks like a bug in QueryProcessor::searchJoin planning.
Summary:
- QueryProcessor chooses
MergeJoin when estimatedNumResults > 200000 and both sides report ordered
- That gate calls
isOrdered directly:
|
if (left->estimatedNumResults() > 200000 |
|
&& left->isOrdered( left->getVarIndex(joinVar.c_str()) ) |
|
&& right->isOrdered( right->getVarIndex(joinVar.c_str()))) { |
|
root = new MergeJoinBinding((char *) joinVar.c_str(), left, right); |
|
} else { |
|
// TODO: Heuristic to use presort+mergejoin when the number of left results is big. |
|
root = new IndexJoinBinding((char *) joinVar.c_str(), left, right); |
|
} |
- In some iterator paths,
isOrdered/isSorted throws std::logic_error("Not Implemented")
- The exception propagates and planning fails
I think this may be related to #265, but this report is a distinct failure mode: exception safety during join strategy selection
To reproduce:
// queryprocessor_isordered_repro.cpp
// Minimal standalone reproduction of QueryProcessor's merge-join gate.
#include <exception>
#include <iostream>
#include <stdexcept>
#include <string>
struct BindingLike {
virtual ~BindingLike() {}
virtual std::size_t estimatedNumResults() const = 0;
virtual std::size_t getVarIndex(const char *varName) const = 0;
virtual bool isOrdered(std::size_t numvar) const = 0;
};
struct ThrowingBinding : public BindingLike {
// Must exceed the QueryProcessor merge-join threshold (200000).
std::size_t estimatedNumResults() const override { return 300001; }
std::size_t getVarIndex(const char * /*varName*/) const override { return 0; }
bool isOrdered(std::size_t /*numvar*/) const override {
throw std::logic_error("Not Implemented");
}
};
static bool bindingIsOrderedSafe(const BindingLike &binding, const char *varName) {
try {
std::size_t idx = binding.getVarIndex(varName);
return binding.isOrdered(idx);
} catch (...) {
return false;
}
}
static std::string plannerDecisionOriginal(const BindingLike &left, const BindingLike &right,
const char *joinVar) {
// Mirrors upstream QueryProcessor.cpp behaviour.
if (left.estimatedNumResults() > 200000 &&
left.isOrdered(left.getVarIndex(joinVar)) &&
right.isOrdered(right.getVarIndex(joinVar))) {
return "MergeJoin";
}
return "IndexJoin";
}
static std::string plannerDecisionSafe(const BindingLike &left, const BindingLike &right,
const char *joinVar) {
// Abandon MergeJoin on error
if (left.estimatedNumResults() > 200000 &&
bindingIsOrderedSafe(left, joinVar) &&
bindingIsOrderedSafe(right, joinVar)) {
return "MergeJoin";
}
return "IndexJoin";
}
int main() {
ThrowingBinding left;
ThrowingBinding right;
const char *joinVar = "?offer";
std::cout << "== Repro: QueryProcessor merge-join gate ==\n";
try {
std::string mode = plannerDecisionOriginal(left, right, joinVar);
std::cout << "Original gate decision: " << mode << "\n";
} catch (const std::exception &e) {
std::cout << "Original gate throws: " << e.what() << "\n";
}
try {
std::string mode = plannerDecisionSafe(left, right, joinVar);
std::cout << "Safe gate decision: " << mode << "\n";
} catch (const std::exception &e) {
std::cout << "Safe gate throws: " << e.what() << "\n";
}
return 0;
}
... which can be run with:
#!/usr/bin/env bash
set -euo pipefail
ROOT_DIR="$(cd "$(dirname "$0")/.." && pwd)"
REPRO_DIR="$ROOT_DIR/repro"
BIN="$REPRO_DIR/queryprocessor_isordered_repro"
SRC="$REPRO_DIR/queryprocessor_isordered_repro.cpp"
c++ -std=c++11 -O2 "$SRC" -o "$BIN"
"$BIN"
echo
echo "== Source-path proof in hdt-cpp =="
grep -nE "left->isOrdered|right->isOrdered" "$ROOT_DIR/libhdt/src/sparql/QueryProcessor.cpp"
grep -nE "throw std::logic_error\(\"Not Implemented\"\)|isSorted\(" "$ROOT_DIR/libhdt/src/triples/BitmapTriplesIterators.cpp"
Output:
== Repro: QueryProcessor merge-join gate ==
Original gate throws: Not Implemented
Safe gate decision: IndexJoin
== Source-path proof in hdt-cpp ==
149: && left->isOrdered( left->getVarIndex(joinVar.c_str()) )
150: && right->isOrdered( right->getVarIndex(joinVar.c_str()))) {
296:bool BitmapTriplesSearchIterator::isSorted(TripleComponentRole role) {
550:bool MiddleWaveletIterator::isSorted(TripleComponentRole role) {
697: throw std::logic_error("Not Implemented");
700:bool IteratorY::isSorted(TripleComponentRole role) {
701: throw std::logic_error("Not Implemented");
983:bool ObjectIndexIterator::isSorted(TripleComponentRole role) {
Current behaviour: an exception from isOrdered propagates out of searchJoin, so the merge-vs-index decision does not complete.
Proposed behaviour: if ordering capability is unavailable, treat that branch as unordered and continue with IndexJoin.
Hello maintainers, and thank you for your work on hdt-cpp.
I would like to report what looks like a bug in
QueryProcessor::searchJoinplanning.Summary:
MergeJoinwhenestimatedNumResults > 200000and both sides report orderedisOrdereddirectly:hdt-cpp/libhdt/src/sparql/QueryProcessor.cpp
Lines 148 to 155 in 88110cc
isOrdered/isSortedthrowsstd::logic_error("Not Implemented")I think this may be related to #265, but this report is a distinct failure mode: exception safety during join strategy selection
To reproduce:
... which can be run with:
Output:
Current behaviour: an exception from
isOrderedpropagates out ofsearchJoin, so the merge-vs-index decision does not complete.Proposed behaviour: if ordering capability is unavailable, treat that branch as unordered and continue with
IndexJoin.