Skip to content

Slow local file reads when using GetResult.into_stream #693

@ariel-miculas

Description

@ariel-miculas

Describe the bug
See apache/datafusion#21450
Root cause: there's a spawn_blocking call per each 8KiBs read from the file, adding significant context switch overhead

To Reproduce
See apache/datafusion#21446
For the tests I've used a c7a.16xlarge ec2 instance, with a trimmed down version of hits.json to 51G (original has 217 GiB), with a warm cache (by running cat hits_50.json > /dev/null)

Expected behavior

A more efficient implementation (e.g. tokio uses a buffer size of 2MiBs when reading files)

Additional context
apache/datafusion#21478 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions