Merge pull request #230 from dmaresma/feature/snowflake_schema_comment

fix databrics typo with a k
xnuinside · Jan 13, 2024 · 8536c65 · 8536c65
2 parents 7239317 + c726143
commit 8536c65
Show file tree

Hide file tree

Showing 17 changed files with 605 additions and 50,317 deletions.
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -1,3 +1,10 @@
+**v1.0.2**
+### Minor Fixes
+1. typo on Databricks dialect
+2. improve equals symbols support within COMMENT statement.
+3. snowflake TAG now available on SCHEMA definitions.
+4. turn regexp into functions
+
 **v1.0.1**
 ### Minor Fixes
 1. When using `normalize_names=True` do not remove `[]` from types like `decimal(21)[]`.
@@ -25,7 +32,7 @@ if you choose the correct output_mode.
 
 ### New Dialects support
 1. Added as possible output_modes new Dialects: 
-- Databrics SQL like 'databricks', 
+- Databricks SQL like 'databricks', 
 - Vertica as 'vertica', 
 - SqliteFields as 'sqlite',
 - PostgreSQL as 'postgres'
@@ -34,7 +41,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:
 
 `from simple_ddl_parser import supported_dialects`
 
-Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.
 

diff --git a/README.md b/README.md
@@ -118,7 +118,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'
 
 If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.
 
-Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 ### From python code
 
@@ -216,7 +216,7 @@ Output will be:
 ### More details
 
 `DDLParser(ddl).run()`
-.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 Also in .run() method exists argument `group_by_type` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.
 
@@ -513,7 +513,7 @@ if you choose the correct output_mode.
 
 ### New Dialects support
 1. Added as possible output_modes new Dialects: 
-- Databrics SQL like 'databricks', 
+- Databricks SQL like 'databricks', 
 - Vertica as 'vertica', 
 - SqliteFields as 'sqlite',
 - PostgreSQL as 'postgres'
@@ -522,7 +522,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:
 
 `from simple_ddl_parser import supported_dialects`
 
-Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.
 

diff --git a/docs/README.rst b/docs/README.rst
@@ -140,7 +140,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'
 
 If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.
 
-Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 From python code
 ^^^^^^^^^^^^^^^^
@@ -237,7 +237,7 @@ More details
 ^^^^^^^^^^^^
 
 ``DDLParser(ddl).run()``
-.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 Also in .run() method exists argument ``group_by_type`` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.
 
@@ -590,7 +590,7 @@ New Dialects support
 #. Added as possible output_modes new Dialects: 
 
 
-* Databrics SQL like 'databricks', 
+* Databricks SQL like 'databricks', 
 * Vertica as 'vertica', 
 * SqliteFields as 'sqlite',
 * PostgreSQL as 'postgres'
@@ -599,7 +599,7 @@ Full list of supported dialects you can find in dict - ``supported_dialects``\ :
 
 ``from simple_ddl_parser import supported_dialects``
 
-Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
+Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
 
 If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "simple-ddl-parser"
-version = "1.0.1"
+version = "1.0.2"
 description = "Simple DDL Parser to parse SQL & dialects like HQL, TSQL (MSSQL), Oracle, AWS Redshift, Snowflake, MySQL, PostgreSQL, etc ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.; sequences, alters, custom types & other entities from ddl."
 authors = ["Iuliia Volkova <[email protected]>"]
 license = "MIT"

diff --git a/simple_ddl_parser/dialects/snowflake.py b/simple_ddl_parser/dialects/snowflake.py
@@ -81,14 +81,9 @@ def p_expression_change_tracking(self, p: List) -> None:
         p_list = remove_par(list(p))
         p[0]["change_tracking"] = p_list[-1]
 
-    def p_table_comment(self, p: List) -> None:
-        """expr : expr option_comment"""
-        p[0] = p[1]
-        if p[2]:
-            p[0].update(p[2])
-
-    def p_table_tag(self, p: List) -> None:
-        """expr : expr option_with_tag"""
+    def p_comment_equals(self, p: List) -> None:
+        """expr : expr option_comment
+        """
         p[0] = p[1]
         if p[2]:
             p[0].update(p[2])
@@ -98,10 +93,23 @@ def p_option_comment(self, p: List) -> None:
         | ID DQ_STRING
         | COMMENT ID STRING
         | COMMENT ID DQ_STRING
+        | option_comment_equals
         """
         p_list = remove_par(list(p))
-        if "comment" in p[1].lower():
-            p[0] = {"comment": p_list[-1]}
+        p[0] = {"comment": p_list[-1]}
+
+    def p_option_comment_equals(self, p: List) -> None:
+        """option_comment_equals : STRING
+        | option_comment_equals DQ_STRING
+        """
+        p_list = remove_par(list(p))
+        p[0] = str(p_list[-1])
+
+    def p_tag(self, p: List) -> None:
+        """expr : expr option_with_tag"""
+        p[0] = p[1]
+        if p[2]:
+            p[0].update(p[2])
 
     def p_tag_equals(self, p: List) -> None:
         """tag_equals : id id id_or_string
@@ -136,6 +144,7 @@ def p_option_with_tag(self, p: List) -> None:
         | TAG LP id DOT id DOT id RP
         | TAG LP multiple_tag_equals RP
         | WITH TAG LP id RP
+        | WITH TAG LP id DOT id DOT id RP
         | WITH TAG LP multiple_tag_equals RP
         """
         p_list = remove_par(list(p))

diff --git a/simple_ddl_parser/dialects/sql.py b/simple_ddl_parser/dialects/sql.py
@@ -507,7 +507,6 @@ def set_auth_property_in_schema(self, p: List, p_list: List) -> None:
     def p_c_schema(self, p: List) -> None:
         """c_schema : CREATE SCHEMA
         | CREATE ID SCHEMA"""
-
         if len(p) == 4:
             p[0] = {"remote": True}
 
@@ -516,21 +515,17 @@ def p_create_schema(self, p: List) -> None:
         | c_schema id id id
         | c_schema id
         | c_schema id DOT id
-        | c_schema id option_comment
-        | c_schema id DOT id option_comment
         | c_schema IF NOT EXISTS id
         | c_schema IF NOT EXISTS id DOT id
         | create_schema id id id
         | create_schema id id STRING
         | create_schema options
         """
         p_list = list(p)
-
         p[0] = {}
         auth_index = None
 
         if "comment" in p_list[-1]:
-            p[0].update(p_list[-1])
             del p_list[-1]
 
         self.add_if_not_exists(p[0], p_list)
@@ -547,7 +542,10 @@ def p_create_schema(self, p: List) -> None:
                 if schema_name is None:
                     schema_name = p_list[auth_index + 1]
             else:
-                schema_name = p_list[-1]
+                if "=" in p_list:
+                    schema_name = p_list[2]
+                else:
+                    schema_name = p_list[-1]
             p[0]["schema_name"] = schema_name.replace("`", "")
 
         p[0] = self.set_project_in_schema(p[0], p_list, auth_index)

diff --git a/simple_ddl_parser/output/dialects.py b/simple_ddl_parser/output/dialects.py
@@ -138,8 +138,8 @@ class MSSQL(Dialect):
 
 
 @dataclass
-@dialect(name="databrics")
-class Databrics(Dialect):
+@dialect(name="databricks")
+class Databricks(Dialect):
     property_key: Optional[str] = field(default=None)
 
 
@@ -261,32 +261,32 @@ class CommonDialectsFieldsMixin(Dialect):
     )
     stored_as: Optional[str] = field(
         default=None,
-        metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
+        metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
     )
 
     row_format: Optional[dict] = field(
         default=None,
-        metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
+        metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
     )
     location: Optional[str] = field(
         default=None,
         metadata={
-            "output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databrics]),
+            "output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databricks]),
             "exclude_if_not_provided": True,
         },
     )
     fields_terminated_by: Optional[str] = field(
         default=None,
-        metadata={"output_modes": add_dialects([HQL, Databrics])},
+        metadata={"output_modes": add_dialects([HQL, Databricks])},
     )
     lines_terminated_by: Optional[str] = field(
-        default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
+        default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
     )
     map_keys_terminated_by: Optional[str] = field(
-        default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
+        default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
     )
     collection_items_terminated_by: Optional[str] = field(
-        default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
+        default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
     )
     clustered_by: Optional[list] = field(
         default=None,
@@ -305,7 +305,7 @@ class CommonDialectsFieldsMixin(Dialect):
     transient: Optional[bool] = field(
         default=False,
         metadata={
-            "output_modes": add_dialects([HQL, Databrics]),
+            "output_modes": add_dialects([HQL, Databricks]),
             "exclude_if_not_provided": True,
         },
     )

diff --git a/simple_ddl_parser/parser.py b/simple_ddl_parser/parser.py
@@ -97,6 +97,12 @@ def __init__(
         self.block_comments = []
         self.comments = []
 
+        self.comma_only_str = re.compile(r"((\')|(' ))+(,)((\')|( '))+\B")
+        self.equal_without_space = re.compile(r"(\b)=")
+        self.in_comment = re.compile(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+")
+        self.set_statement = re.compile(r"SET ")
+        self.skip_regex = re.compile(r"^(GO|USE|INSERT)\b")
+
     def catch_comment_or_process_line(self, code_line: str) -> str:
         if self.multi_line_comment:
             self.comments.append(self.line)
@@ -113,8 +119,8 @@ def catch_comment_or_process_line(self, code_line: str) -> str:
 
     def pre_process_line(self) -> Tuple[str, List]:
         code_line = ""
-        comma_only_str = r"((\')|(' ))+(,)((\')|( '))+\B"
-        self.line = re.sub(comma_only_str, "_ddl_parser_comma_only_str", self.line)
+        self.line = self.comma_only_str.sub("_ddl_parser_comma_only_str", self.line)
+        self.line = self.equal_without_space.sub(" = ", self.line)
         code_line = self.catch_comment_or_process_line(code_line)
         if self.line.startswith(OP_COM) and CL_COM not in self.line:
             self.multi_line_comment = True
@@ -123,7 +129,7 @@ def pre_process_line(self) -> Tuple[str, List]:
         self.line = code_line
 
     def process_in_comment(self, line: str) -> str:
-        if re.search(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+", line):
+        if self.in_comment.search(line):
             code_line = line
         else:
             splitted_line = line.split(IN_COM)
@@ -200,7 +206,7 @@ def process_set(self) -> None:
         self.tables.append({"name": name, "value": value})
 
     def parse_set_statement(self):
-        if re.match(r"SET ", self.line.upper()):
+        if self.set_statement.match(self.line.upper()):
             self.set_was_in_line = True
             if not self.set_line:
                 self.set_line = self.line
@@ -224,11 +230,9 @@ def check_new_statement_start(self, line: str) -> bool:
         return self.new_statement
 
     def check_line_on_skip_words(self) -> bool:
-        skip_regex = r"^(GO|USE|INSERT)\b"
-
         self.skip = False
 
-        if re.match(skip_regex, self.line.upper()):
+        if self.skip_regex.match(self.line.upper()):
             self.skip = True
         return self.skip