fix: Use FileSystem on getPaths() instead of mapreduce.Job by JWuCines · Pull Request #19418 · apache/druid

JWuCines · 2026-05-06T09:15:55Z

Fixes #19411.

Description

When using index_parallel with HdfsInputSource on a Kerberized HDFS cluster where the NameNode has KMS configured, the ingestion task unnecessarily attempts to acquire a KMS delegation token. This happens because HdfsInputSource.getPaths() uses FileInputFormat.getSplits() for path/glob expansion, which internally calls TokenCache.obtainTokensForNamenodes(), cascading into KMSClientProvider.getDelegationToken(). Druid's native ingestion authenticates directly via Kerberos TGT and never needs these delegation tokens.

Replaced FileInputFormat with direct FileSystem.globStatus() calls

Replaced the FileInputFormat/Job-based path expansion in HdfsInputSource.getPaths() with direct FileSystem.globStatus() calls. This achieves the same HDFS glob expansion without entering the MapReduce TokenCache code path, eliminating the unnecessary KMS contact.

The inner HdfsFileInputFormat helper class and all org.apache.hadoop.mapreduce imports have been removed. No other file in the druid-hdfs-storage module references the MapReduce API.

Preserved FileInputFormat filtering semantics

Hidden file filter: Files and directories whose names start with _ or . are excluded, matching Hadoop's FileInputFormat.hiddenFileFilter.
Non-recursive directory listing: When a path points to a directory, only immediate files are listed; subdirectories are not traversed. This matches FileInputFormat's default behavior when mapreduce.input.fileinputformat.input.dir.recursive is not set.
Comma-separated path splitting: Input path strings are split using org.apache.hadoop.util.StringUtils.split(), preserving the same comma-separation and escape behavior as FileInputFormat.addInputPaths().

Updated documentation

Updated the paths property description in docs/ingestion/input-sources.md to document the non-recursive directory traversal behavior, hidden file filtering, and the use of glob patterns (e.g., **/*.json) for ingesting files from nested directories.

Added unit tests for getPaths() edge cases

Added a new GetPathsTest inner class to HdfsInputSourceTest with seven tests:

testGetPathsWithGlobMatchingNoFiles — glob matching no files returns an empty collection
testGetPathsFiltersZeroLengthFiles — zero-length files are excluded, non-empty files are included
testGetPathsWithMultipleInputPaths — multiple distinct glob patterns are resolved correctly
testGetPathsWithCommaSeparatedString — comma-separated path strings are split and resolved
testGetPathsFiltersHiddenFiles — files starting with _ or . are excluded from glob results
testGetPathsDirectoryListsFilesNonRecursively — subdirectories and hidden files within a directory are skipped
testGetPathsSkipsHiddenDirectories — hidden directories matched by a glob are not descended into

Release note

Fixed an issue where HdfsInputSource with index_parallel unnecessarily contacted KMS when using Kerberized HDFS, causing task failures if KMS was unreachable. The fix replaces the internal use of Hadoop MapReduce FileInputFormat for path expansion with direct FileSystem.globStatus() calls, while preserving hidden file filtering and non-recursive directory listing semantics.

Key changed/added classes in this PR

HdfsInputSource
HdfsInputSourceTest
docs/ingestion/input-sources.md

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
a release note entry in the PR description.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.

FrankChen021

Severity	Findings
P0	0
P1	1
P2	1
P3	0
Total	2

This is an automated review by Codex GPT-5

FrankChen021 · 2026-05-06T13:00:51Z

-      FileInputFormat.addInputPaths(job, inputPath);
+    final Set<Path> paths = new LinkedHashSet<>();
+    for (final String inputPath : inputPaths) {
+      final Path p = new Path(inputPath);


[P1] Comma-separated paths string no longer works

The documented HDFS paths property can be a comma-separated string, and the previous FileInputFormat.addInputPaths(job, inputPath) split that string using Hadoop's path parser. The replacement constructs one Path from the entire string, so a spec like "hdfs://nn/a.json,hdfs://nn/b.json" is now treated as a single path/glob and will fail or match nothing. Split comma-separated string inputs before creating Paths, preserving Hadoop's escaping/brace behavior where possible.

FrankChen021 · 2026-05-06T13:00:51Z

-    {
-      return false;  // prevent generating extra paths
+    if (status.isDirectory()) {
+      final FileStatus[] children = fs.listStatus(status.getPath());


[P2] FileInputFormat filtering semantics were dropped

The old FileInputFormat listing applied Hadoop's hidden-file filter, excluding path names starting with _ or ., and only recursed into directories when mapreduce.input.fileinputformat.input.dir.recursive was enabled. The new direct listStatus traversal always recurses and never filters hidden files, so existing directory inputs can start ingesting nested data and non-empty marker/metadata files that were previously skipped. Reintroduce the same path filter and recursion behavior while avoiding the MapReduce token path.

fix: Use FileSystem on getPaths() instead of mapreduce.Job

4689a57

JWuCines force-pushed the hdfs-no-mapreduce branch from 3800037 to 4689a57 Compare May 6, 2026 12:01

FrankChen021 reviewed May 6, 2026

View reviewed changes

fix: Use FileSystem on getPaths() instead of mapreduce.Job

78ea639

JWuCines requested a review from FrankChen021 May 6, 2026 15:53

github-actions Bot added the Area - Documentation label May 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418

fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418
JWuCines wants to merge 2 commits intoapache:masterfrom
JWuCines:hdfs-no-mapreduce

JWuCines commented May 6, 2026 •

edited

Loading

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 May 6, 2026

Uh oh!

FrankChen021 May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JWuCines commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Replaced FileInputFormat with direct FileSystem.globStatus() calls

Preserved FileInputFormat filtering semantics

Updated documentation

Added unit tests for getPaths() edge cases

Release note

Key changed/added classes in this PR

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JWuCines commented May 6, 2026 •

edited

Loading