Skip to content

fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418

Open
JWuCines wants to merge 2 commits intoapache:masterfrom
JWuCines:hdfs-no-mapreduce
Open

fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418
JWuCines wants to merge 2 commits intoapache:masterfrom
JWuCines:hdfs-no-mapreduce

Conversation

@JWuCines
Copy link
Copy Markdown

@JWuCines JWuCines commented May 6, 2026

Fixes #19411.

Description

When using index_parallel with HdfsInputSource on a Kerberized HDFS cluster where the NameNode has KMS configured, the ingestion task unnecessarily attempts to acquire a KMS delegation token. This happens because HdfsInputSource.getPaths() uses FileInputFormat.getSplits() for path/glob expansion, which internally calls TokenCache.obtainTokensForNamenodes(), cascading into KMSClientProvider.getDelegationToken(). Druid's native ingestion authenticates directly via Kerberos TGT and never needs these delegation tokens.

Replaced FileInputFormat with direct FileSystem.globStatus() calls

Replaced the FileInputFormat/Job-based path expansion in HdfsInputSource.getPaths() with direct FileSystem.globStatus() calls. This achieves the same HDFS glob expansion without entering the MapReduce TokenCache code path, eliminating the unnecessary KMS contact.

The inner HdfsFileInputFormat helper class and all org.apache.hadoop.mapreduce imports have been removed. No other file in the druid-hdfs-storage module references the MapReduce API.

Preserved FileInputFormat filtering semantics

  • Hidden file filter: Files and directories whose names start with _ or . are excluded, matching Hadoop's FileInputFormat.hiddenFileFilter.
  • Non-recursive directory listing: When a path points to a directory, only immediate files are listed; subdirectories are not traversed. This matches FileInputFormat's default behavior when mapreduce.input.fileinputformat.input.dir.recursive is not set.
  • Comma-separated path splitting: Input path strings are split using org.apache.hadoop.util.StringUtils.split(), preserving the same comma-separation and escape behavior as FileInputFormat.addInputPaths().

Updated documentation

Updated the paths property description in docs/ingestion/input-sources.md to document the non-recursive directory traversal behavior, hidden file filtering, and the use of glob patterns (e.g., **/*.json) for ingesting files from nested directories.

Added unit tests for getPaths() edge cases

Added a new GetPathsTest inner class to HdfsInputSourceTest with seven tests:

  • testGetPathsWithGlobMatchingNoFiles — glob matching no files returns an empty collection
  • testGetPathsFiltersZeroLengthFiles — zero-length files are excluded, non-empty files are included
  • testGetPathsWithMultipleInputPaths — multiple distinct glob patterns are resolved correctly
  • testGetPathsWithCommaSeparatedString — comma-separated path strings are split and resolved
  • testGetPathsFiltersHiddenFiles — files starting with _ or . are excluded from glob results
  • testGetPathsDirectoryListsFilesNonRecursively — subdirectories and hidden files within a directory are skipped
  • testGetPathsSkipsHiddenDirectories — hidden directories matched by a glob are not descended into

Release note

Fixed an issue where HdfsInputSource with index_parallel unnecessarily contacted KMS when using Kerberized HDFS, causing task failures if KMS was unreachable. The fix replaces the internal use of Hadoop MapReduce FileInputFormat for path expansion with direct FileSystem.globStatus() calls, while preserving hidden file filtering and non-recursive directory listing semantics.


Key changed/added classes in this PR
  • HdfsInputSource
  • HdfsInputSourceTest
  • docs/ingestion/input-sources.md

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

@JWuCines JWuCines force-pushed the hdfs-no-mapreduce branch from 3800037 to 4689a57 Compare May 6, 2026 12:01
Copy link
Copy Markdown
Member

@FrankChen021 FrankChen021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 1
P3 0
Total 2

This is an automated review by Codex GPT-5

FileInputFormat.addInputPaths(job, inputPath);
final Set<Path> paths = new LinkedHashSet<>();
for (final String inputPath : inputPaths) {
final Path p = new Path(inputPath);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Comma-separated paths string no longer works

The documented HDFS paths property can be a comma-separated string, and the previous FileInputFormat.addInputPaths(job, inputPath) split that string using Hadoop's path parser. The replacement constructs one Path from the entire string, so a spec like "hdfs://nn/a.json,hdfs://nn/b.json" is now treated as a single path/glob and will fail or match nothing. Split comma-separated string inputs before creating Paths, preserving Hadoop's escaping/brace behavior where possible.

{
return false; // prevent generating extra paths
if (status.isDirectory()) {
final FileStatus[] children = fs.listStatus(status.getPath());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] FileInputFormat filtering semantics were dropped

The old FileInputFormat listing applied Hadoop's hidden-file filter, excluding path names starting with _ or ., and only recursed into directories when mapreduce.input.fileinputformat.input.dir.recursive was enabled. The new direct listStatus traversal always recurses and never filters hidden files, so existing directory inputs can start ingesting nested data and non-empty marker/metadata files that were previously skipped. Reintroduce the same path filter and recursion behavior while avoiding the MapReduce token path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HdfsInputSource with index_parallel unnecessarily requires KMS delegation token when using Kerberized HDFS

2 participants