Add quote style and trimming to csv writier#20813
Add quote style and trimming to csv writier#20813xanderbailey wants to merge 11 commits intoapache:mainfrom
Conversation
| LOCATION 'test_files/scratch/csv_files/quote_style_always.csv' | ||
| OPTIONS ('format.has_header' 'true', 'format.quote_style' 'Never'); | ||
|
|
||
| # All values should have been quoted, but reading them back strips the quotes |
There was a problem hiding this comment.
It's hard in SLT to actually test these but I did my best...
There was a problem hiding this comment.
Could do something like this:
statement ok
CREATE EXTERNAL TABLE stored_quote_style_nonnumeric (
whole_file TEXT
) STORED AS CSV
LOCATION 'test_files/scratch/csv_files/quote_style_nonnumeric.csv'
OPTIONS ('format.has_header' 'true', 'format.delimiter' '@');
query T
select * from stored_quote_style_nonnumeric;
----
1,"hello",1.1
2,"world",2.2
3,"comma,value",3.3- Pretty much read entire file as a single column, by choosing a delimiter that doesn't appear
There was a problem hiding this comment.
Oh that's a nice idea, I'll change to that!
| pub quote_style: i32, | ||
| /// Whether to ignore leading whitespace in string values | ||
| #[prost(bytes = "vec", tag = "21")] | ||
| pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>, |
There was a problem hiding this comment.
Following the pattern here for other bools being Vec
| pub quote_style: i32, | ||
| /// Whether to ignore leading whitespace in string values | ||
| #[prost(bytes = "vec", tag = "21")] | ||
| pub ignore_leading_whitespace: ::prost::alloc::vec::Vec<u8>, |
There was a problem hiding this comment.
Following the pattern here for other bools being Vec
| Ok(CsvQuoteStyleProto::Always) => CsvQuoteStyle::Always, | ||
| Ok(CsvQuoteStyleProto::NonNumeric) => CsvQuoteStyle::NonNumeric, | ||
| Ok(CsvQuoteStyleProto::Never) => CsvQuoteStyle::Never, | ||
| _ => CsvQuoteStyle::Necessary, |
There was a problem hiding this comment.
We don't error on:
compression: match proto.compression {
0 => CompressionTypeVariant::GZIP,
1 => CompressionTypeVariant::BZIP2,
2 => CompressionTypeVariant::XZ,
3 => CompressionTypeVariant::ZSTD,
_ => CompressionTypeVariant::UNCOMPRESSED,
},
So made the same true here
|
@Jefffrey are you maybe able to take a look here? |
rtyler
left a comment
There was a problem hiding this comment.
I have done some integration testing of this PR since it solves a problem one of my customers was having as well. I think it's good to merge, with a squashing of course 😈
I ended up backporting it to 52.4.0 which applied cleanly as well. I notice that arrow 57.2 actually introduced QuoteStyle which means this change requires arrow-csv 57.2.0 or later
| #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)] | ||
| pub enum CsvQuoteStyle { | ||
| /// Quote all fields | ||
| Always, | ||
| /// Only quote fields when necessary (default) | ||
| #[default] | ||
| Necessary, | ||
| /// Quote all non-numeric fields | ||
| NonNumeric, | ||
| /// Never quote fields | ||
| Never, | ||
| } |
There was a problem hiding this comment.
For my own understanding is this duplicated from the arrow-csv crate in order to add more trait implementations? I am less familiar with the protocol buffer stuff that happens in Datafusion, but figure there must be a reason for the struct to be ruplicated
There was a problem hiding this comment.
Yeah I wasn't 100% sure what to do here but it seemed fine to duplicate?
There was a problem hiding this comment.
I believe this is in line with trend of having types here be DF versions; for example we have a DF version of parquet writer version as well:
datafusion/datafusion/common/src/parquet_config.rs
Lines 24 to 36 in cc4717a
Digging deeper into load failures I discovered that while arrow-csv and other Rust CSV parsers can handle some characters properly, some RDMS and C-based CSV parsers (older) cannot, without quoting. This relies on a backport of apache/datafusion#20813 to the 52.4.0 release in order to support this. It also requires arrow-csv 57.2.0 or later since that release is the first that `QuoteStyle` exists.
|
@alamb could I bother you for a review here please? |
| builder = builder.with_double_quote(*v) | ||
| } | ||
| let style = match value.quote_style { | ||
| CsvQuoteStyle::Always => QuoteStyle::Always, |
There was a problem hiding this comment.
nit: pull this into a From impl where CsvQuoteStyle is defined
| match s { | ||
| "Always" | "ALWAYS" | "always" => Ok(Self::Always), | ||
| "Necessary" | "NECESSARY" | "necessary" | "" => Ok(Self::Necessary), | ||
| "NonNumeric" | "NON_NUMERIC" | "nonnumeric" => Ok(Self::NonNumeric), | ||
| "Never" | "NEVER" | "never" => Ok(Self::Never), | ||
| _ => Err(DataFusionError::NotImplemented(format!( | ||
| "Unsupported CSV quote style {s}" | ||
| ))), | ||
| } |
There was a problem hiding this comment.
| match s { | |
| "Always" | "ALWAYS" | "always" => Ok(Self::Always), | |
| "Necessary" | "NECESSARY" | "necessary" | "" => Ok(Self::Necessary), | |
| "NonNumeric" | "NON_NUMERIC" | "nonnumeric" => Ok(Self::NonNumeric), | |
| "Never" | "NEVER" | "never" => Ok(Self::Never), | |
| _ => Err(DataFusionError::NotImplemented(format!( | |
| "Unsupported CSV quote style {s}" | |
| ))), | |
| } | |
| match s.to_lowercase().as_str() { | |
| "always" => Ok(Self::Always), | |
| "necessary" | "" => Ok(Self::Necessary), | |
| "non_numeric" | "nonnumeric" => Ok(Self::NonNumeric), | |
| "never" => Ok(Self::Never), | |
| _ => Err(DataFusionError::NotImplemented(format!( | |
| "Unsupported CSV quote style {s}" | |
| ))), | |
| } |
I think it might be a better idea to be permissive like this; also not sure about that empty string parsing as the default 🤔
There was a problem hiding this comment.
I'll quickly check what spark does here
There was a problem hiding this comment.
Spark just has quoteAll which maps to always so I have removed the empty string.
| query IT | ||
| select * from stored_trim_leading order by id; | ||
| ---- | ||
| 1 hello |
There was a problem hiding this comment.
Do we expect to see trailing whitespace here? Or is there an issue with the runner on how it treats trailing whitespaces 🤔
There was a problem hiding this comment.
Yeah I had to look at the CSV that actually got written...
There was a problem hiding this comment.
| #[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, Default)] | ||
| pub enum CsvQuoteStyle { | ||
| /// Quote all fields | ||
| Always, | ||
| /// Only quote fields when necessary (default) | ||
| #[default] | ||
| Necessary, | ||
| /// Quote all non-numeric fields | ||
| NonNumeric, | ||
| /// Never quote fields | ||
| Never, | ||
| } |
There was a problem hiding this comment.
I believe this is in line with trend of having types here be DF versions; for example we have a DF version of parquet writer version as well:
datafusion/datafusion/common/src/parquet_config.rs
Lines 24 to 36 in cc4717a
| pub ignore_leading_whitespace: Option<bool>, default = None | ||
| /// Whether to ignore trailing whitespace in string values when writing CSV. | ||
| pub ignore_trailing_whitespace: Option<bool>, default = None | ||
| /// Specifies whether newlines in (quoted) values are supported. |
There was a problem hiding this comment.
I'm wondering why we have this trend of doing Option<bool> instead of just bool 🤔
What is the default in case of None, and how is that different from just true/false (whichever it is)?
There was a problem hiding this comment.
I was also curious about this but didn't want to break the pattern...
There was a problem hiding this comment.
Is it some ser-deser quirk? I'll see if I can figure this out.
There was a problem hiding this comment.
No I can't find a good reason for this... Would you like me to change or keep the pattern?
There was a problem hiding this comment.
Perhaps its to do with hierarchy of configs and/or sql parsing 🤔
@alamb do you happen to know a reason for having Option<bool> instead of plain bool for cases where it'll end up being true or false anyway? (i.e. None doesn't represent a third state, but eventually maps to either true or false)
Which issue does this PR close?
Related arrow-rs PRs apache/arrow-rs#8960 and apache/arrow-rs#9004
Rationale for this change
The CSV writer was missing support for
quote_style,ignore_leading_whitespace, andignore_trailing_whitespaceoptions that are available on the underlying arrowWriterBuilder. This meant users couldn't control quoting behaviour or whitespace trimming when writing CSV files.What changes are included in this PR?
Adds three new CSV writer options wired through the full stack:
quote_style— controls when fields are quoted (Always,Necessary,NonNumeric,Never). Modelled as a protobuf enum (CsvQuoteStyle).ignore_leading_whitespace— trims leading whitespace from string values on write.ignore_trailing_whitespace— trims trailing whitespace from string values on write.Are these changes tested?
Yes — sqllogictest coverage added in
csv_files.sltAre there any user-facing changes?
Three new
format.*options available in COPY TO and CREATE EXTERNAL TABLE for CSV:format.quote_style(string:Always,Necessary,NonNumeric,Never)format.ignore_leading_whitespace(boolean)format.ignore_trailing_whitespace(boolean)