Skip to content

Add quote style and trimming to csv writier#20813

Open
xanderbailey wants to merge 11 commits intoapache:mainfrom
xanderbailey:xb/csv_writer_options
Open

Add quote style and trimming to csv writier#20813
xanderbailey wants to merge 11 commits intoapache:mainfrom
xanderbailey:xb/csv_writer_options

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented Mar 8, 2026

Which issue does this PR close?

Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004

Rationale for this change

The CSV writer was missing support for quote_style, ignore_leading_whitespace, and ignore_trailing_whitespace options that are available on the underlying arrow WriterBuilder. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files.

What changes are included in this PR?

Adds three new CSV writer options wired through the full stack:

  • quote_style — controls when fields are quoted (Always, Necessary, NonNumeric, Never). Modelled as a protobuf enum (CsvQuoteStyle).
  • ignore_leading_whitespace — trims leading whitespace from string values on write.
  • ignore_trailing_whitespace — trims trailing whitespace from string values on write.

Are these changes tested?

Yes — sqllogictest coverage added in csv_files.slt

Are there any user-facing changes?

Three new format.* options available in COPY TO and CREATE EXTERNAL TABLE for CSV:

  • format.quote_style (string: Always, Necessary, NonNumeric, Never)
  • format.ignore_leading_whitespace (boolean)
  • format.ignore_trailing_whitespace (boolean)

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate labels Mar 8, 2026
LOCATION 'test_files/scratch/csv_files/quote_style_always.csv'
OPTIONS ('format.has_header' 'true', 'format.quote_style' 'Never');

# All values should have been quoted, but reading them back strips the quotes
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard in SLT to actually test these but I did my best...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do something like this:

statement ok
CREATE EXTERNAL TABLE stored_quote_style_nonnumeric (
  whole_file TEXT
) STORED AS CSV
LOCATION 'test_files/scratch/csv_files/quote_style_nonnumeric.csv'
OPTIONS ('format.has_header' 'true', 'format.delimiter' '@');

query T
select * from stored_quote_style_nonnumeric;
----
1,"hello",1.1
2,"world",2.2
3,"comma,value",3.3
  • Pretty much read entire file as a single column, by choosing a delimiter that doesn't appear

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's a nice idea, I'll change to that!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub quote_style: i32,
/// Whether to ignore leading whitespace in string values
#[prost(bytes = "vec", tag = "21")]
pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern here for other bools being Vec

pub quote_style: i32,
/// Whether to ignore leading whitespace in string values
#[prost(bytes = "vec", tag = "21")]
pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern here for other bools being Vec

Ok(CsvQuoteStyleProto::Always) => CsvQuoteStyle::Always,
Ok(CsvQuoteStyleProto::NonNumeric) => CsvQuoteStyle::NonNumeric,
Ok(CsvQuoteStyleProto::Never) => CsvQuoteStyle::Never,
_ => CsvQuoteStyle::Necessary,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't error on:

            compression: match proto.compression {
                0 => CompressionTypeVariant::GZIP,
                1 => CompressionTypeVariant::BZIP2,
                2 => CompressionTypeVariant::XZ,
                3 => CompressionTypeVariant::ZSTD,
                _ => CompressionTypeVariant::UNCOMPRESSED,
            },

So made the same true here

@xanderbailey
Copy link
Copy Markdown
Contributor Author

@Jefffrey are you maybe able to take a look here?

Copy link
Copy Markdown
Contributor

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some integration testing of this PR since it solves a problem one of my customers was having as well. I think it's good to merge, with a squashing of course 😈

I ended up backporting it to 52.4.0 which applied cleanly as well. I notice that arrow 57.2 actually introduced QuoteStyle which means this change requires arrow-csv 57.2.0 or later

Comment on lines +81 to +92
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum CsvQuoteStyle {
/// Quote all fields
Always,
/// Only quote fields when necessary (default)
#[default]
Necessary,
/// Quote all non-numeric fields
NonNumeric,
/// Never quote fields
Never,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding is this duplicated from the arrow-csv crate in order to add more trait implementations? I am less familiar with the protocol buffer stuff that happens in Datafusion, but figure there must be a reason for the struct to be ruplicated

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't 100% sure what to do here but it seemed fine to duplicate?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is in line with trend of having types here be DF versions; for example we have a DF version of parquet writer version as well:

/// Parquet writer version options for controlling the Parquet file format version
///
/// This enum validates parquet writer version values at configuration time,
/// ensuring only valid versions ("1.0" or "2.0") can be set via `SET` commands
/// or proto deserialization.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum DFParquetWriterVersion {
/// Parquet format version 1.0
#[default]
V1_0,
/// Parquet format version 2.0
V2_0,
}

rtyler added a commit to buoyant-data/oxbow that referenced this pull request Apr 1, 2026
Digging deeper into load failures I discovered that while
arrow-csv and other Rust CSV parsers can handle some characters
properly, some RDMS and C-based CSV parsers (older) cannot, without
quoting.

This relies on a backport of apache/datafusion#20813 to the 52.4.0
release in order to support this. It also requires arrow-csv 57.2.0 or
later since that release is the first that `QuoteStyle` exists.
@xanderbailey
Copy link
Copy Markdown
Contributor Author

@alamb could I bother you for a review here please?

builder = builder.with_double_quote(*v)
}
let style = match value.quote_style {
CsvQuoteStyle::Always => QuoteStyle::Always,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pull this into a From impl where CsvQuoteStyle is defined

Comment thread datafusion/common/src/parsers.rs Outdated
Comment on lines +98 to +106
match s {
"Always" | "ALWAYS" | "always" => Ok(Self::Always),
"Necessary" | "NECESSARY" | "necessary" | "" => Ok(Self::Necessary),
"NonNumeric" | "NON_NUMERIC" | "nonnumeric" => Ok(Self::NonNumeric),
"Never" | "NEVER" | "never" => Ok(Self::Never),
_ => Err(DataFusionError::NotImplemented(format!(
"Unsupported CSV quote style {s}"
))),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match s {
"Always" | "ALWAYS" | "always" => Ok(Self::Always),
"Necessary" | "NECESSARY" | "necessary" | "" => Ok(Self::Necessary),
"NonNumeric" | "NON_NUMERIC" | "nonnumeric" => Ok(Self::NonNumeric),
"Never" | "NEVER" | "never" => Ok(Self::Never),
_ => Err(DataFusionError::NotImplemented(format!(
"Unsupported CSV quote style {s}"
))),
}
match s.to_lowercase().as_str() {
"always" => Ok(Self::Always),
"necessary" | "" => Ok(Self::Necessary),
"non_numeric" | "nonnumeric" => Ok(Self::NonNumeric),
"never" => Ok(Self::Never),
_ => Err(DataFusionError::NotImplemented(format!(
"Unsupported CSV quote style {s}"
))),
}

I think it might be a better idea to be permissive like this; also not sure about that empty string parsing as the default 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll quickly check what spark does here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark just has quoteAll which maps to always so I have removed the empty string.

Comment thread datafusion/sqllogictest/test_files/csv_files.slt Outdated
query IT
select * from stored_trim_leading order by id;
----
1 hello
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to see trailing whitespace here? Or is there an issue with the runner on how it treats trailing whitespaces 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had to look at the CSV that actually got written...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +81 to +92
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)]
pub enum CsvQuoteStyle {
/// Quote all fields
Always,
/// Only quote fields when necessary (default)
#[default]
Necessary,
/// Quote all non-numeric fields
NonNumeric,
/// Never quote fields
Never,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is in line with trend of having types here be DF versions; for example we have a DF version of parquet writer version as well:

/// Parquet writer version options for controlling the Parquet file format version
///
/// This enum validates parquet writer version values at configuration time,
/// ensuring only valid versions ("1.0" or "2.0") can be set via `SET` commands
/// or proto deserialization.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum DFParquetWriterVersion {
/// Parquet format version 1.0
#[default]
V1_0,
/// Parquet format version 2.0
V2_0,
}

Comment on lines +2945 to 2948
pub ignore_leading_whitespace: Option<bool>, default = None
/// Whether to ignore trailing whitespace in string values when writing CSV.
pub ignore_trailing_whitespace: Option<bool>, default = None
/// Specifies whether newlines in (quoted) values are supported.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we have this trend of doing Option<bool> instead of just bool 🤔

What is the default in case of None, and how is that different from just true/false (whichever it is)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious about this but didn't want to break the pattern...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it some ser-deser quirk? I'll see if I can figure this out.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I can't find a good reason for this... Would you like me to change or keep the pattern?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps its to do with hierarchy of configs and/or sql parsing 🤔

@alamb do you happen to know a reason for having Option<bool> instead of plain bool for cases where it'll end up being true or false anyway? (i.e. None doesn't represent a third state, but eventually maps to either true or false)

@xanderbailey xanderbailey requested a review from Jefffrey April 16, 2026 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add quote-style parameter for CSV options

3 participants