Skip to content

Commit 2e1fc5c

Browse files
authored
More documentation on FileSource::table_schema and FileSource::projection (#20242)
## Which issue does this PR close? - Follow on to #20188 ## Rationale for this change @zhuqi-lucas and @adriangb had some good ideas on how to further improve the documentation on #20188, which I tried to implement in this PR ## What changes are included in this PR? Add more clarity about what TableSource and FileSource::projection are ## Are these changes tested? By CI ## Are there any user-facing changes? Additional documentation
1 parent bdfe987 commit 2e1fc5c

2 files changed

Lines changed: 16 additions & 9 deletions

File tree

datafusion/datasource/src/file.rs

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -39,16 +39,17 @@ use datafusion_physical_plan::metrics::ExecutionPlanMetricsSet;
3939
use datafusion_physical_expr_common::sort_expr::PhysicalSortExpr;
4040
use object_store::ObjectStore;
4141

42-
/// Helper function to convert any type implementing FileSource to Arc<dyn FileSource>
42+
/// Helper function to convert any type implementing [`FileSource`] to `Arc<dyn FileSource>`
4343
pub fn as_file_source<T: FileSource + 'static>(source: T) -> Arc<dyn FileSource> {
4444
Arc::new(source)
4545
}
4646

47-
/// file format specific behaviors for elements in [`DataSource`]
47+
/// File format specific behaviors for [`DataSource`]
4848
///
4949
/// # Schema information
5050
/// There are two important schemas for a [`FileSource`]:
51-
/// 1. [`Self::table_schema`] -- the schema for the overall "table"
51+
/// 1. [`Self::table_schema`] -- the schema for the overall table
52+
/// (file data plus partition columns)
5253
/// 2. The logical output schema, comprised of [`Self::table_schema`] with
5354
/// [`Self::projection`] applied
5455
///
@@ -71,13 +72,16 @@ pub trait FileSource: Send + Sync {
7172
/// Any
7273
fn as_any(&self) -> &dyn Any;
7374

74-
/// Returns the table schema for this file source.
75+
/// Returns the table schema for the overall table (including partition columns, if any)
7576
///
76-
/// This always returns the unprojected schema (the full schema of the data)
77+
/// This method returns the unprojected schema: the full schema of the data
7778
/// without [`Self::projection`] applied.
7879
///
7980
/// The output schema of this `FileSource` is this TableSchema
8081
/// with [`Self::projection`] applied.
82+
///
83+
/// Use [`ProjectionExprs::project_schema`] to get the projected schema
84+
/// after applying the projection.
8185
fn table_schema(&self) -> &crate::table_schema::TableSchema;
8286

8387
/// Initialize new type with batch size configuration
@@ -92,6 +96,9 @@ pub trait FileSource: Send + Sync {
9296

9397
/// Return the projection that will be applied to the output stream on top
9498
/// of [`Self::table_schema`].
99+
///
100+
/// Note you can use [`ProjectionExprs::project_schema`] on the table
101+
/// schema to get the effective output schema of this source.
95102
fn projection(&self) -> Option<&ProjectionExprs> {
96103
None
97104
}

datafusion/datasource/src/table_schema.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,13 @@
2020
use arrow::datatypes::{FieldRef, SchemaBuilder, SchemaRef};
2121
use std::sync::Arc;
2222

23-
/// Helper to hold table schema information for partitioned data sources.
23+
/// The overall schema for potentially partitioned data sources.
2424
///
25-
/// When reading partitioned data (such as Hive-style partitioning), a table's schema
25+
/// When reading partitioned data (such as Hive-style partitioning), a [`TableSchema`]
2626
/// consists of two parts:
2727
/// 1. **File schema**: The schema of the actual data files on disk
28-
/// 2. **Partition columns**: Columns that are encoded in the directory structure,
29-
/// not stored in the files themselves
28+
/// 2. **Partition columns**: Columns whose values are encoded in the directory structure,
29+
/// but not stored in the files themselves
3030
///
3131
/// # Example: Partitioned Table
3232
///

0 commit comments

Comments
 (0)