AVRO-4249: [java] provide a cache of schema to avoid building#3746
AVRO-4249: [java] provide a cache of schema to avoid building#3746mkeskells wants to merge 3 commits intoapache:mainfrom
Conversation
65f4513 to
34e6910
Compare
add tests and a benchmark
|
Benchmark results for PR change. Nothing unexpected Significant recuction in allocation and CPU time with this change |
|
Could a maintainer please add the performance label to this PR? |
|
I have fixed the licence issues reported in the build (missing licence header). |
|
I finally got o the bottom of the memmory leak I was chasing when I observed the problem that this fixes- its https://issues.apache.org/jira/browse/AVRO-4253, a memory leak, which in my environment was holding only 200Gb of Schemas due to the leak. Mostly fixed by this PR |
|
Hey, pardon me! I appreciate the work, and I'll take a closer look soon -- we're going to do a 1.13.0 release just after the next one 1.12.2, and this should be in it! |
What is the purpose of the change
To improve the performance of parsing files.
In an environment where we parse 10k to 100K (generally very small) files that are small, and use the same schema, or a handful of schemas we see many Tb of garbage generation with duplicate schemas being parsed
this PR is a simple fix to enable a cache to be inserted in the reader, so that the cache lookup can replace the parse where there is an exact match already
The shape of the API changes - I leave this to the reviewers to comment, and I am happy to work to their steer, and will generate tests when I have agreement of the approach to this
Verifying this change
(Please pick one of the following options)
I will add tests once the chnages to the API have been agreed. There are many ways that this chnage could be implemented so I dont want to spend th time until the shape can be agreed
Documentation