Update comments / docs for Spanned (apache#1549)

Co-authored-by: Ifeanyi Ubah <[email protected]>
validio-io · Nov 30, 2024 · 4ab3ab9 · 4ab3ab9
1 parent f4f112d
commit 4ab3ab9
Show file tree

Hide file tree

Showing 8 changed files with 306 additions and 91 deletions.
diff --git a/README.md b/README.md
@@ -100,13 +100,18 @@ similar semantics are represented with the same AST. We welcome PRs to fix such
 issues and distinguish different syntaxes in the AST.
 
 
-## WIP: Extracting source locations from AST nodes
+## Source Locations (Work in Progress)
 
-This crate allows recovering source locations from AST nodes via the [Spanned](https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html) trait, which can be used for advanced diagnostics tooling. Note that this feature is a work in progress and many nodes report missing or inaccurate spans. Please see [this document](./docs/source_spans.md#source-span-contributing-guidelines) for information on how to contribute missing improvements.
+This crate allows recovering source locations from AST nodes via the [Spanned]
+trait, which can be used for advanced diagnostics tooling. Note that this
+feature is a work in progress and many nodes report missing or inaccurate spans.
+Please see [this ticket] for information on how to contribute missing
+improvements.
 
-```rust
-use sqlparser::ast::Spanned;
+[Spanned]: https://docs.rs/sqlparser/latest/sqlparser/ast/trait.Spanned.html
+[this ticket]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548
 
+```rust
 // Parse SQL
 let ast = Parser::parse_sql(&GenericDialect, "SELECT A FROM B").unwrap();
 
@@ -123,9 +128,9 @@ SQL was first standardized in 1987, and revisions of the standard have been
 published regularly since. Most revisions have added significant new features to
 the language, and as a result no database claims to support the full breadth of
 features. This parser currently supports most of the SQL-92 syntax, plus some
-syntax from newer versions that have been explicitly requested, plus some MSSQL,
-PostgreSQL, and other dialect-specific syntax. Whenever possible, the [online
-SQL:2016 grammar][sql-2016-grammar] is used to guide what syntax to accept.
+syntax from newer versions that have been explicitly requested, plus various
+other dialect-specific syntax. Whenever possible, the [online SQL:2016
+grammar][sql-2016-grammar] is used to guide what syntax to accept.
 
 Unfortunately, stating anything more specific about compliance is difficult.
 There is no publicly available test suite that can assess compliance

diff --git a/docs/source_spans.md b/docs/source_spans.md
diff --git a/src/ast/helpers/attached_token.rs b/src/ast/helpers/attached_token.rs
@@ -19,25 +19,73 @@ use core::cmp::{Eq, Ord, Ordering, PartialEq, PartialOrd};
 use core::fmt::{self, Debug, Formatter};
 use core::hash::{Hash, Hasher};
 
-use crate::tokenizer::{Token, TokenWithSpan};
+use crate::tokenizer::TokenWithSpan;
 
 #[cfg(feature = "serde")]
 use serde::{Deserialize, Serialize};
 
 #[cfg(feature = "visitor")]
 use sqlparser_derive::{Visit, VisitMut};
 
-/// A wrapper type for attaching tokens to AST nodes that should be ignored in comparisons and hashing.
-/// This should be used when a token is not relevant for semantics, but is still needed for
-/// accurate source location tracking.
+/// A wrapper over [`TokenWithSpan`]s that ignores the token and source
+/// location in comparisons and hashing.
+///
+/// This type is used when the token and location is not relevant for semantics,
+/// but is still needed for accurate source location tracking, for example, in
+/// the nodes in the [ast](crate::ast) module.
+///
+/// Note: **All** `AttachedTokens` are equal.
+///
+/// # Examples
+///
+/// Same token, different location are equal
+/// ```
+/// # use sqlparser::ast::helpers::attached_token::AttachedToken;
+/// # use sqlparser::tokenizer::{Location, Span, Token, TokenWithLocation};
+/// // commas @ line 1, column 10
+/// let tok1 = TokenWithLocation::new(
+///   Token::Comma,
+///   Span::new(Location::new(1, 10), Location::new(1, 11)),
+/// );
+/// // commas @ line 2, column 20
+/// let tok2 = TokenWithLocation::new(
+///   Token::Comma,
+///   Span::new(Location::new(2, 20), Location::new(2, 21)),
+/// );
+///
+/// assert_ne!(tok1, tok2); // token with locations are *not* equal
+/// assert_eq!(AttachedToken(tok1), AttachedToken(tok2)); // attached tokens are
+/// ```
+///
+/// Different token, different location are equal 🤯
+///
+/// ```
+/// # use sqlparser::ast::helpers::attached_token::AttachedToken;
+/// # use sqlparser::tokenizer::{Location, Span, Token, TokenWithLocation};
+/// // commas @ line 1, column 10
+/// let tok1 = TokenWithLocation::new(
+///   Token::Comma,
+///   Span::new(Location::new(1, 10), Location::new(1, 11)),
+/// );
+/// // period @ line 2, column 20
+/// let tok2 = TokenWithLocation::new(
+///  Token::Period,
+///   Span::new(Location::new(2, 10), Location::new(2, 21)),
+/// );
+///
+/// assert_ne!(tok1, tok2); // token with locations are *not* equal
+/// assert_eq!(AttachedToken(tok1), AttachedToken(tok2)); // attached tokens are
+/// ```
+/// // period @ line 2, column 20
 #[derive(Clone)]
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub struct AttachedToken(pub TokenWithSpan);
 
 impl AttachedToken {
+    /// Return a new Empty AttachedToken
     pub fn empty() -> Self {
-        AttachedToken(TokenWithSpan::wrap(Token::EOF))
+        AttachedToken(TokenWithSpan::new_eof())
     }
 }
 
@@ -80,3 +128,9 @@ impl From<TokenWithSpan> for AttachedToken {
         AttachedToken(value)
     }
 }
+
+impl From<AttachedToken> for TokenWithSpan {
+    fn from(value: AttachedToken) -> Self {
+        value.0
+    }
+}
diff --git a/src/ast/mod.rs b/src/ast/mod.rs
@@ -596,9 +596,21 @@ pub enum CeilFloorKind {
 
 /// An SQL expression of any type.
 ///
+/// # Semantics / Type Checking
+///
 /// The parser does not distinguish between expressions of different types
-/// (e.g. boolean vs string), so the caller must handle expressions of
-/// inappropriate type, like `WHERE 1` or `SELECT 1=1`, as necessary.
+/// (e.g. boolean vs string). The caller is responsible for detecting and
+/// validating types as necessary (for example  `WHERE 1` vs `SELECT 1=1`)
+/// See the [README.md] for more details.
+///
+/// [README.md]: https://github.com/apache/datafusion-sqlparser-rs/blob/main/README.md#syntax-vs-semantics
+///
+/// # Equality and Hashing Does not Include Source Locations
+///
+/// The `Expr` type implements `PartialEq` and `Eq` based on the semantic value
+/// of the expression (not bitwise comparison). This means that `Expr` instances
+/// that are semantically equivalent but have different spans (locations in the
+/// source tree) will compare as equal.
 #[derive(Debug, Clone, PartialEq, PartialOrd, Eq, Ord, Hash)]
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(

diff --git a/src/ast/query.rs b/src/ast/query.rs
@@ -282,6 +282,7 @@ impl fmt::Display for Table {
 pub struct Select {
     /// Token for the `SELECT` keyword
     pub select_token: AttachedToken,
+    /// `SELECT [DISTINCT] ...`
     pub distinct: Option<Distinct>,
     /// MSSQL syntax: `TOP (<N>) [ PERCENT ] [ WITH TIES ]`
     pub top: Option<Top>,
@@ -511,7 +512,7 @@ impl fmt::Display for NamedWindowDefinition {
 #[cfg_attr(feature = "serde", derive(Serialize, Deserialize))]
 #[cfg_attr(feature = "visitor", derive(Visit, VisitMut))]
 pub struct With {
-    // Token for the "WITH" keyword
+    /// Token for the "WITH" keyword
     pub with_token: AttachedToken,
     pub recursive: bool,
     pub cte_tables: Vec<Cte>,
@@ -564,7 +565,7 @@ pub struct Cte {
     pub query: Box<Query>,
     pub from: Option<Ident>,
     pub materialized: Option<CteAsMaterialized>,
-    // Token for the closing parenthesis
+    /// Token for the closing parenthesis
     pub closing_paren_token: AttachedToken,
 }
 

diff --git a/src/ast/spans.rs b/src/ast/spans.rs
@@ -21,21 +21,51 @@ use super::{
 
 /// Given an iterator of spans, return the [Span::union] of all spans.
 fn union_spans<I: Iterator<Item = Span>>(iter: I) -> Span {
-    iter.reduce(|acc, item| acc.union(&item))
-        .unwrap_or(Span::empty())
+    Span::union_iter(iter)
 }
 
-/// A trait for AST nodes that have a source span for use in diagnostics.
+/// Trait for AST nodes that have a source location information.
 ///
-/// Source spans are not guaranteed to be entirely accurate. They may
-/// be missing keywords or other tokens. Some nodes may not have a computable
-/// span at all, in which case they return [`Span::empty()`].
+/// # Notes:
+///
+/// Source [`Span`] are not yet complete. They may be missing:
+///
+/// 1. keywords or other tokens
+/// 2. span information entirely, in which case they return [`Span::empty()`].
+///
+/// Note Some impl blocks (rendered below) are annotated with which nodes are
+/// missing spans. See [this ticket] for additional information and status.
+///
+/// [this ticket]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548
+///
+/// # Example
+/// ```
+/// # use sqlparser::parser::{Parser, ParserError};
+/// # use sqlparser::ast::Spanned;
+/// # use sqlparser::dialect::GenericDialect;
+/// # use sqlparser::tokenizer::Location;
+/// # fn main() -> Result<(), ParserError> {
+/// let dialect = GenericDialect {};
+/// let sql = r#"SELECT *
+///   FROM table_1"#;
+/// let statements = Parser::new(&dialect)
+///   .try_with_sql(sql)?
+///   .parse_statements()?;
+/// // Get the span of the first statement (SELECT)
+/// let span = statements[0].span();
+/// // statement starts at line 1, column 1 (1 based, not 0 based)
+/// assert_eq!(span.start, Location::new(1, 1));
+/// // statement ends on line 2, column 15
+/// assert_eq!(span.end, Location::new(2, 15));
+/// # Ok(())
+/// # }
+/// ```
 ///
-/// Some impl blocks may contain doc comments with information
-/// on which nodes are missing spans.
 pub trait Spanned {
-    /// Compute the source span for this AST node, by recursively
-    /// combining the spans of its children.
+    /// Return the [`Span`] (the minimum and maximum [`Location`]) for this AST
+    /// node, by recursively combining the spans of its children.
+    ///
+    /// [`Location`]: crate::tokenizer::Location
     fn span(&self) -> Span;
 }
 

diff --git a/src/lib.rs b/src/lib.rs
@@ -25,6 +25,9 @@
 //! 1. [`Parser::parse_sql`] and [`Parser::new`] for the Parsing API
 //! 2. [`ast`] for the AST structure
 //! 3. [`Dialect`] for supported SQL dialects
+//! 4. [`Spanned`] for source text locations (see "Source Spans" below for details)
+//!
+//! [`Spanned`]: ast::Spanned
 //!
 //! # Example parsing SQL text
 //!
@@ -61,13 +64,67 @@
 //! // The original SQL text can be generated from the AST
 //! assert_eq!(ast[0].to_string(), sql);
 //! ```
-//!
 //! [sqlparser crates.io page]: https://crates.io/crates/sqlparser
 //! [`Parser::parse_sql`]: crate::parser::Parser::parse_sql
 //! [`Parser::new`]: crate::parser::Parser::new
 //! [`AST`]: crate::ast
 //! [`ast`]: crate::ast
 //! [`Dialect`]: crate::dialect::Dialect
+//!
+//! # Source Spans
+//!
+//! Starting with version  `0.53.0` sqlparser introduced source spans to the
+//! AST. This feature provides source information for syntax errors, enabling
+//! better error messages. See [issue #1548] for more information and the
+//! [`Spanned`] trait to access the spans.
+//!
+//! [issue #1548]: https://github.com/apache/datafusion-sqlparser-rs/issues/1548
+//! [`Spanned`]: ast::Spanned
+//!
+//! ## Migration Guide
+//!
+//! For the next few releases, we will be incrementally adding source spans to the
+//! AST nodes, trying to minimize the impact on existing users. Some breaking
+//! changes are inevitable, and the following is a summary of the changes:
+//!
+//! #### New fields for spans (must be added to any existing pattern matches)
+//!
+//! The primary change is that new fields will be added to AST nodes to store the source `Span` or `TokenWithLocation`.
+//!
+//! This will require
+//! 1. Adding new fields to existing pattern matches.
+//! 2. Filling in the proper span information when constructing AST nodes.
+//!
+//! For example, since `Ident` now stores a `Span`, to construct an `Ident` you
+//! must provide now provide one:
+//!
+//! Previously:
+//! ```text
+//! # use sqlparser::ast::Ident;
+//! Ident {
+//!     value: "name".into(),
+//!     quote_style: None,
+//! }
+//! ```
+//! Now
+//! ```rust
+//! # use sqlparser::ast::Ident;
+//! # use sqlparser::tokenizer::Span;
+//! Ident {
+//!     value: "name".into(),
+//!     quote_style: None,
+//!     span: Span::empty(),
+//! };
+//! ```
+//!
+//! Similarly, when pattern matching on `Ident`, you must now account for the
+//! `span` field.
+//!
+//! #### Misc.
+//! - [`TokenWithLocation`] stores a full `Span`, rather than just a source location.
+//!   Users relying on `token.location` should use `token.location.start` instead.
+//!
+//![`TokenWithLocation`]: tokenizer::TokenWithLocation
 
 #![cfg_attr(not(feature = "std"), no_std)]
 #![allow(clippy::upper_case_acronyms)]