AVRO-4249: [java] provide a cache of schema to avoid building by mkeskells · Pull Request #3746 · apache/avro

mkeskells · 2026-04-29T22:23:22Z

What is the purpose of the change

To improve the performance of parsing files.
In an environment where we parse 10k to 100K (generally very small) files that are small, and use the same schema, or a handful of schemas we see many Tb of garbage generation with duplicate schemas being parsed

this PR is a simple fix to enable a cache to be inserted in the reader, so that the cache lookup can replace the parse where there is an exact match already

The shape of the API changes - I leave this to the reviewers to comment, and I am happy to work to their steer, and will generate tests when I have agreement of the approach to this

Verifying this change

(Please pick one of the following options)

I will add tests once the chnages to the API have been agreed. There are many ways that this chnage could be implemented so I dont want to spend th time until the shape can be agreed

Documentation

Does this pull request introduce a new feature? yes - i presume there would be some pluggable system setting, and an API change. I ahve proposed a system property
If yes, how is the feature documented? not documented as yet

add tests and a benchmark

mkeskells · 2026-05-02T21:53:41Z

Benchmark results for PR change. Nothing unexpected

Significant recuction in allocation and CPU time with this change


Benchmark                                                      (cacheType)  (numFields)  (numRecords)   Mode  Cnt        Score       Error   Units
SchemaCacheEffectTest.benchmarkDataReading                            none            5             1  thrpt   10   160708.485 ± 66123.198   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none            5             1  thrpt   10     3148.068 ±  1289.209  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none            5             1  thrpt   10    20558.844 ±   269.523    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none            5             1  thrpt   10       50.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none            5             1  thrpt   10     6083.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none            5            10  thrpt   10   130146.190 ± 22400.139   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none            5            10  thrpt   10     3006.175 ±   514.267  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none            5            10  thrpt   10    24253.534 ±    51.819    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none            5            10  thrpt   10       21.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none            5            10  thrpt   10     3193.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none            5           100  thrpt   10    68919.081 ±  7302.996   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none            5           100  thrpt   10     4151.588 ±   439.714  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none            5           100  thrpt   10    63250.291 ±    59.216    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none            5           100  thrpt   10       16.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none            5           100  thrpt   10     2092.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none           50             1  thrpt   10    16111.848 ± 20899.702   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none           50             1  thrpt   10     1271.894 ±  1649.321  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none           50             1  thrpt   10    82818.178 ±    85.805    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none           50             1  thrpt   10       53.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none           50             1  thrpt   10    12251.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none           50            10  thrpt   10    17872.197 ± 16429.837   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none           50            10  thrpt   10     2001.422 ±  1839.438  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none           50            10  thrpt   10   117502.133 ±    82.087    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none           50            10  thrpt   10       50.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none           50            10  thrpt   10     9733.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none           50           100  thrpt   10    11193.744 ±  1047.678   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none           50           100  thrpt   10     5074.400 ±   471.679  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none           50           100  thrpt   10   475709.595 ±    94.557    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none           50           100  thrpt   10       21.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none           50           100  thrpt   10     2190.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none          500             1  thrpt   10     1239.421 ±  2417.645   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none          500             1  thrpt   10      827.063 ±  1606.939  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none          500             1  thrpt   10   698100.419 ±    89.206    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none          500             1  thrpt   10       44.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none          500             1  thrpt   10    14395.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none          500            10  thrpt   10     1809.982 ±  2127.079   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none          500            10  thrpt   10     1817.693 ±  2130.299  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none          500            10  thrpt   10  1052253.600 ±    84.368    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none          500            10  thrpt   10       54.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none          500            10  thrpt   10    10312.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            none          500           100  thrpt   10     1132.882 ±   120.509   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              none          500           100  thrpt   10     4457.240 ±   462.910  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         none          500           100  thrpt   10  4127075.031 ±   185.786    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   none          500           100  thrpt   10       20.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    none          500           100  thrpt   10     2194.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak            5             1  thrpt   10   857058.952 ± 41590.581   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak            5             1  thrpt   10     9055.300 ±   440.460  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak            5             1  thrpt   10    11088.012 ±     0.001    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak            5             1  thrpt   10      106.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak            5             1  thrpt   10      164.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak            5            10  thrpt   10   558125.787 ± 48281.646   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak            5            10  thrpt   10     7804.731 ±   679.401  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak            5            10  thrpt   10    14676.020 ±    44.620    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak            5            10  thrpt   10       99.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak            5            10  thrpt   10      146.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak            5           100  thrpt   10   122035.807 ± 19252.663   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak            5           100  thrpt   10     6238.761 ±   984.634  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak            5           100  thrpt   10    53656.085 ±     0.014    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak            5           100  thrpt   10       89.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak            5           100  thrpt   10      132.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak           50             1  thrpt   10   381117.522 ± 30016.980   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak           50             1  thrpt   10     7444.513 ±   598.525  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak           50             1  thrpt   10    20500.029 ±    44.617    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak           50             1  thrpt   10       91.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak           50             1  thrpt   10      131.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak           50            10  thrpt   10   120204.520 ± 12184.146   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak           50            10  thrpt   10     6324.247 ±   640.856  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak           50            10  thrpt   10    55232.086 ±     0.009    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak           50            10  thrpt   10       82.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak           50            10  thrpt   10      122.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak           50           100  thrpt   10    17391.522 ±  2058.825   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak           50           100  thrpt   10     6848.725 ±   808.000  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak           50           100  thrpt   10   413376.598 ±     0.067    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak           50           100  thrpt   10       94.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak           50           100  thrpt   10      168.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak          500             1  thrpt   10    69282.529 ±  4269.320   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak          500             1  thrpt   10     7729.057 ±   476.657  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak          500             1  thrpt   10   117080.148 ±     0.011    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak          500             1  thrpt   10       91.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak          500             1  thrpt   10      152.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak          500            10  thrpt   10    15909.052 ±  1503.023   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak          500            10  thrpt   10     7142.377 ±   677.104  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak          500            10  thrpt   10   471271.854 ±    40.894    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak          500            10  thrpt   10       88.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak          500            10  thrpt   10      155.000                  ms
SchemaCacheEffectTest.benchmarkDataReading                            weak          500           100  thrpt   10     1843.039 ±   127.651   ops/s
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate              weak          500           100  thrpt   10     6227.871 ±   427.151  MB/sec
SchemaCacheEffectTest.benchmarkDataReading:gc.alloc.rate.norm         weak          500           100  thrpt   10  3548077.588 ±     0.376    B/op
SchemaCacheEffectTest.benchmarkDataReading:gc.count                   weak          500           100  thrpt   10       77.000              counts
SchemaCacheEffectTest.benchmarkDataReading:gc.time                    weak          500           100  thrpt   10      124.000                  ms



Benchmark                                                  (cacheType)  (numFields)   Mode  Cnt          Score         Error   Units
SchemaCacheTest.benchmarkSchemaParsing                        NO_CACHE            5  thrpt   10     404585.474 ±    9206.027   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate          NO_CACHE            5  thrpt   10       3486.962 ±      79.322  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm     NO_CACHE            5  thrpt   10       9040.017 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count               NO_CACHE            5  thrpt   10         85.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                NO_CACHE            5  thrpt   10         85.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                        NO_CACHE           50  thrpt   10      53144.889 ±    1996.919   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate          NO_CACHE           50  thrpt   10       3062.896 ±     114.963  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm     NO_CACHE           50  thrpt   10      60448.130 ±       0.005    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count               NO_CACHE           50  thrpt   10         76.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                NO_CACHE           50  thrpt   10         79.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                        NO_CACHE          500  thrpt   10       5854.415 ±     404.753   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate          NO_CACHE          500  thrpt   10       3150.857 ±     217.770  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm     NO_CACHE          500  thrpt   10     564505.190 ±      25.482    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count               NO_CACHE          500  thrpt   10         78.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                NO_CACHE          500  thrpt   10         76.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                      CONCURRENT            5  thrpt   10  213069364.030 ± 2674481.309   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate        CONCURRENT            5  thrpt   10       3250.295 ±      40.752  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm   CONCURRENT            5  thrpt   10         16.000 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count             CONCURRENT            5  thrpt   10         81.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time              CONCURRENT            5  thrpt   10         75.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                      CONCURRENT           50  thrpt   10  212597784.922 ± 3598523.893   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate        CONCURRENT           50  thrpt   10       3243.227 ±      54.879  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm   CONCURRENT           50  thrpt   10         16.000 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count             CONCURRENT           50  thrpt   10         80.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time              CONCURRENT           50  thrpt   10         72.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                      CONCURRENT          500  thrpt   10  210281467.875 ± 4954734.313   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate        CONCURRENT          500  thrpt   10       3207.787 ±      75.475  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm   CONCURRENT          500  thrpt   10         16.000 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count             CONCURRENT          500  thrpt   10         86.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time              CONCURRENT          500  thrpt   10         76.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                            WEAK            5  thrpt   10   38493704.277 ± 1283442.559   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate              WEAK            5  thrpt   10        880.873 ±      29.379  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm         WEAK            5  thrpt   10         24.001 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count                   WEAK            5  thrpt   10         39.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                    WEAK            5  thrpt   10         25.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                            WEAK           50  thrpt   10   39070048.693 ±  426068.207   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate              WEAK           50  thrpt   10        894.238 ±       9.681  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm         WEAK           50  thrpt   10         24.007 ±       0.001    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count                   WEAK           50  thrpt   10         40.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                    WEAK           50  thrpt   10         28.000                    ms
SchemaCacheTest.benchmarkSchemaParsing                            WEAK          500  thrpt   10   37408903.367 ± 3244572.982   ops/s
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate              WEAK          500  thrpt   10        858.073 ±      74.413  MB/sec
SchemaCacheTest.benchmarkSchemaParsing:gc.alloc.rate.norm         WEAK          500  thrpt   10         24.058 ±       0.009    B/op
SchemaCacheTest.benchmarkSchemaParsing:gc.count                   WEAK          500  thrpt   10         38.000                counts
SchemaCacheTest.benchmarkSchemaParsing:gc.time                    WEAK          500  thrpt   10         30.000                    ms

mkeskells · 2026-05-02T22:01:56Z

Could a maintainer please add the performance label to this PR?

mkeskells · 2026-05-03T14:47:43Z

I have fixed the licence issues reported in the build (missing licence header).
Is there any way that I can re-run the builds? Or is that only for the maintainers?

mkeskells · 2026-05-03T22:39:51Z

I finally got o the bottom of the memmory leak I was chasing when I observed the problem that this fixes- its https://issues.apache.org/jira/browse/AVRO-4253, a memory leak, which in my environment was holding only 200Gb of Schemas due to the leak. Mostly fixed by this PR

RyanSkraba · 2026-05-06T18:18:17Z

Hey, pardon me! I appreciate the work, and I'll take a closer look soon -- we're going to do a 1.13.0 release just after the next one 1.12.2, and this should be in it!

github-actions Bot added the Java Pull Requests for Java binding label Apr 29, 2026

AVRO-4249 provide a cache of schema to avoid building

34e6910

mkeskells force-pushed the AVRO-4249-schema-cache branch from 65f4513 to 34e6910 Compare April 29, 2026 22:30

AVRO-4249 provide a cache of schema to avoid building

9f5b962

add tests and a benchmark

AVRO-4249 add licence

b14e690

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-4249: [java] provide a cache of schema to avoid building#3746

AVRO-4249: [java] provide a cache of schema to avoid building#3746
mkeskells wants to merge 3 commits intoapache:mainfrom
mkeskells:AVRO-4249-schema-cache

mkeskells commented Apr 29, 2026 •

edited

Loading

Uh oh!

mkeskells commented May 2, 2026

Uh oh!

mkeskells commented May 2, 2026

Uh oh!

mkeskells commented May 3, 2026 •

edited

Loading

Uh oh!

mkeskells commented May 3, 2026

Uh oh!

RyanSkraba commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mkeskells commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Verifying this change

Documentation

Uh oh!

mkeskells commented May 2, 2026

Uh oh!

mkeskells commented May 2, 2026

Uh oh!

mkeskells commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkeskells commented May 3, 2026

Uh oh!

RyanSkraba commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkeskells commented Apr 29, 2026 •

edited

Loading

mkeskells commented May 3, 2026 •

edited

Loading