-
Notifications
You must be signed in to change notification settings - Fork 711
Fix placeholder spans #1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix placeholder spans #1979
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1521,6 +1521,11 @@ impl<'a> Tokenizer<'a> { | |
| match chars.peek() { | ||
| Some(':') => self.consume_and_return(chars, Token::DoubleColon), | ||
| Some('=') => self.consume_and_return(chars, Token::Assignment), | ||
| Some(c) | ||
| if self.dialect.supports_colon_placeholder() && c.is_alphabetic() => | ||
| { | ||
| self.tokenize_colon_preceeded_placeholder(chars).map(Some) | ||
| } | ||
| _ => Ok(Some(Token::Colon)), | ||
| } | ||
| } | ||
|
|
@@ -1756,6 +1761,30 @@ impl<'a> Tokenizer<'a> { | |
| } | ||
| } | ||
|
|
||
| /// Tokenizes an identifier followed immediately after a colon, | ||
| /// aka named query parameter, e.g. `:name`. The next char of the | ||
| /// processed char stream is to be an alphabetic - panics otherwise. | ||
| fn tokenize_colon_preceeded_placeholder( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that it would likely be ideal to use existing logic like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| &self, | ||
| chars: &mut State, | ||
| ) -> Result<Token, TokenizerError> { | ||
| let mut s = String::with_capacity(16); | ||
| s.push(':'); | ||
| s.push(chars.next().expect("initial character missing")); | ||
| while let Some(&ch) = chars.peek() { | ||
| if ch.is_alphanumeric() | ||
| || ch == '_' | ||
| || matches!(ch, '$' if self.dialect.supports_dollar_placeholder()) | ||
| { | ||
| s.push(ch); | ||
| chars.next(); | ||
| } else { | ||
| break; | ||
| } | ||
| } | ||
| Ok(Token::Placeholder(s)) | ||
| } | ||
|
|
||
| /// Tokenize dollar preceded value (i.e: a string/placeholder) | ||
| fn tokenize_dollar_preceded_value(&self, chars: &mut State) -> Result<Token, TokenizerError> { | ||
| let mut s = String::new(); | ||
|
|
@@ -2952,6 +2981,68 @@ mod tests { | |
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn tokenize_colon_placeholder() { | ||
| #[derive(Debug)] | ||
| struct TestDialect(bool); | ||
| impl Dialect for TestDialect { | ||
| fn supports_colon_placeholder(&self) -> bool { | ||
| true | ||
| } | ||
| fn supports_dollar_placeholder(&self) -> bool { | ||
| self.0 | ||
| } | ||
| fn is_identifier_start(&self, ch: char) -> bool { | ||
| ch.is_alphabetic() || ch == '_' | ||
| } | ||
| fn is_identifier_part(&self, ch: char) -> bool { | ||
| ch.is_alphabetic() || ch.is_ascii_digit() || ch == '_' | ||
| } | ||
| } | ||
|
|
||
| let sql = "SELECT :foo FROM bar"; | ||
| let tokens = Tokenizer::new(&TestDialect(false), sql) | ||
| .tokenize_with_location() | ||
| .unwrap(); | ||
| assert_eq!( | ||
| tokens.iter().map(|t| t.token.clone()).collect::<Vec<_>>(), | ||
| vec![ | ||
| Token::make_keyword("SELECT"), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::Placeholder(":foo".into()), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::make_keyword("FROM"), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::make_word("bar", None) | ||
| ] | ||
| ); | ||
| assert_eq!( | ||
| tokens[2].span, | ||
| Span::new(Location::of(1, 8), Location::of(1, 12)) | ||
| ); | ||
|
|
||
| let sql = "SELECT :foo$bar FROM bar"; | ||
| let tokens = Tokenizer::new(&TestDialect(true), sql) | ||
| .tokenize_with_location() | ||
| .unwrap(); | ||
| assert_eq!( | ||
| tokens.iter().map(|t| t.token.clone()).collect::<Vec<_>>(), | ||
| vec![ | ||
| Token::make_keyword("SELECT"), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::Placeholder(":foo$bar".into()), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::make_keyword("FROM"), | ||
| Token::Whitespace(Whitespace::Space), | ||
| Token::make_word("bar", None) | ||
| ] | ||
| ); | ||
| assert_eq!( | ||
| tokens[2].span, | ||
| Span::new(Location::of(1, 8), Location::of(1, 16)) | ||
| ); | ||
| } | ||
|
|
||
| #[test] | ||
| fn tokenize_dollar_placeholder() { | ||
| let sql = String::from("SELECT $$, $$ABC$$, $ABC$, $ABC"); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xitep could you elaborate on this part? I'm not sure I followed the intent, is my understanding correct that currently the parser indeed handles e.g.
SELECT :var? If so what's the requirement for the tokenizer updates?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @iffyio, yes the parser can deal with things like
:varalready, but i'm not sure this is a feature or rather just a side effect of the code pointed out above. here's an example:note:
:<word>(which in fact is what i'm after)however, you'll get the same output when using
"select : var"or even"select : /*foobar*/ var"as input :-/ this is not what i want.so the intention is to make
:<word>a first level token, a placeholder.does this help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that make sense, it seems that the handling of placeholders isn't very consistent by the parser as some are done in tokenizer while others later on. I think it could be good to do a more general hopefully improvement if we feel that ideally placeholders are managed by the tokenizer, the we can parse and represent it consistently/accordingly.
Did a quick pass through existing support, and I can imagine something like the following representation to merge the different representations explicitly, wdyt?
It would also remove the need for this code as well
datafusion-sqlparser-rs/src/parser/mod.rs
Lines 9637 to 9646 in 15d8bfe
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i can try it out and upgrade this PR accordingly. here's one thought to be clarified before i set off with it:
this implies to me that "value" would be stripped off of the placeholder demarcation character(s); e.g. for the input
":foo"->valuewould readSome("foo"), for the input"$$foo$$"it would readSome("foo"), for the input"$"valuewould beNone, correct?But then, what would
sqlparser::ast::Value::Placeholder(..)look like? Clients surely might want to know exactly which placeholder style/kind was used; sincesqlparsersupports various formats, some clients might opt in to reject what they don't want their queries to contain.all in all, it looks to me that in your proposal we want to end up with
valuebeing the original, full token as identified in the scanned input (which, naturally, is neverNoneor empty), therefore:any feedback on this thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's correct that the input would be stripped into two, the kind and the value as in your examples.
Value::PlaceholderI imagine would have the same representation mirroringToken::Placeholdersince the goal becomes that placeholder parsing is handled by the tokenizer i.e.Value::Placeholder(Token::Placeholder)I think we might not always have a variable name, hence the
Option- sqlite for example uses anonymous?in some contexts for placeholders e.g.WHERE city IN (?, ?, ?, ?). In that scenario the representation becomesToken::Placeholder{kind: PlaceholderKind::SingleQuestionMark, value: None}ah heads up I think a reference
&strwouldn't be feasible here without introducing lifetimes to the Token/Value enums which would be quite an invasive change and likely not worth it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrt ...
i'm seeing "$$foo$$" being treated as a "double quoted string", not as a placeholder. if
dialect.supports_dollar_placeholder() == true)then$...is considered a placeholder, but the dollars at the start and end don't have to be balanced. this make theDollarEnclosedvariant essentially void.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, after a lot of struggle i came to the conclusion that the approach with first-class placeholder is not feasible in
sqlparser. After having the tokenizer and the model change ready, this test (now failing) convinced me, that my concern cannot be resolved at the level of the tokenizer sincesqlparserneeds to be very flexible and determine the meaning only on the level of the parser. you can have a look here at how the change we discussed would have looked like. (this would be the consequential change to the parser.)fortunately, i have found a far smaller change that satisfies my needs and will update this PR.