fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418
fix: Use FileSystem on getPaths() instead of mapreduce.Job#19418JWuCines wants to merge 2 commits intoapache:masterfrom
Conversation
3800037 to
4689a57
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 1 |
| P3 | 0 |
| Total | 2 |
This is an automated review by Codex GPT-5
| FileInputFormat.addInputPaths(job, inputPath); | ||
| final Set<Path> paths = new LinkedHashSet<>(); | ||
| for (final String inputPath : inputPaths) { | ||
| final Path p = new Path(inputPath); |
There was a problem hiding this comment.
[P1] Comma-separated paths string no longer works
The documented HDFS paths property can be a comma-separated string, and the previous FileInputFormat.addInputPaths(job, inputPath) split that string using Hadoop's path parser. The replacement constructs one Path from the entire string, so a spec like "hdfs://nn/a.json,hdfs://nn/b.json" is now treated as a single path/glob and will fail or match nothing. Split comma-separated string inputs before creating Paths, preserving Hadoop's escaping/brace behavior where possible.
| { | ||
| return false; // prevent generating extra paths | ||
| if (status.isDirectory()) { | ||
| final FileStatus[] children = fs.listStatus(status.getPath()); |
There was a problem hiding this comment.
[P2] FileInputFormat filtering semantics were dropped
The old FileInputFormat listing applied Hadoop's hidden-file filter, excluding path names starting with _ or ., and only recursed into directories when mapreduce.input.fileinputformat.input.dir.recursive was enabled. The new direct listStatus traversal always recurses and never filters hidden files, so existing directory inputs can start ingesting nested data and non-empty marker/metadata files that were previously skipped. Reintroduce the same path filter and recursion behavior while avoiding the MapReduce token path.
Fixes #19411.
Description
When using
index_parallelwithHdfsInputSourceon a Kerberized HDFS cluster where the NameNode has KMS configured, the ingestion task unnecessarily attempts to acquire a KMS delegation token. This happens becauseHdfsInputSource.getPaths()usesFileInputFormat.getSplits()for path/glob expansion, which internally callsTokenCache.obtainTokensForNamenodes(), cascading intoKMSClientProvider.getDelegationToken(). Druid's native ingestion authenticates directly via Kerberos TGT and never needs these delegation tokens.Replaced FileInputFormat with direct FileSystem.globStatus() calls
Replaced the
FileInputFormat/Job-based path expansion inHdfsInputSource.getPaths()with directFileSystem.globStatus()calls. This achieves the same HDFS glob expansion without entering the MapReduceTokenCachecode path, eliminating the unnecessary KMS contact.The inner
HdfsFileInputFormathelper class and allorg.apache.hadoop.mapreduceimports have been removed. No other file in thedruid-hdfs-storagemodule references the MapReduce API.Preserved FileInputFormat filtering semantics
_or.are excluded, matching Hadoop'sFileInputFormat.hiddenFileFilter.FileInputFormat's default behavior whenmapreduce.input.fileinputformat.input.dir.recursiveis not set.org.apache.hadoop.util.StringUtils.split(), preserving the same comma-separation and escape behavior asFileInputFormat.addInputPaths().Updated documentation
Updated the
pathsproperty description indocs/ingestion/input-sources.mdto document the non-recursive directory traversal behavior, hidden file filtering, and the use of glob patterns (e.g.,**/*.json) for ingesting files from nested directories.Added unit tests for getPaths() edge cases
Added a new
GetPathsTestinner class toHdfsInputSourceTestwith seven tests:testGetPathsWithGlobMatchingNoFiles— glob matching no files returns an empty collectiontestGetPathsFiltersZeroLengthFiles— zero-length files are excluded, non-empty files are includedtestGetPathsWithMultipleInputPaths— multiple distinct glob patterns are resolved correctlytestGetPathsWithCommaSeparatedString— comma-separated path strings are split and resolvedtestGetPathsFiltersHiddenFiles— files starting with_or.are excluded from glob resultstestGetPathsDirectoryListsFilesNonRecursively— subdirectories and hidden files within a directory are skippedtestGetPathsSkipsHiddenDirectories— hidden directories matched by a glob are not descended intoRelease note
Fixed an issue where
HdfsInputSourcewithindex_parallelunnecessarily contacted KMS when using Kerberized HDFS, causing task failures if KMS was unreachable. The fix replaces the internal use of Hadoop MapReduceFileInputFormatfor path expansion with directFileSystem.globStatus()calls, while preserving hidden file filtering and non-recursive directory listing semantics.Key changed/added classes in this PR
HdfsInputSourceHdfsInputSourceTestdocs/ingestion/input-sources.mdThis PR has: