Skip to content

Commit

Permalink
Merge pull request #230 from dmaresma/feature/snowflake_schema_comment
Browse files Browse the repository at this point in the history
fix databrics typo with a k
  • Loading branch information
xnuinside authored Jan 13, 2024
2 parents 7239317 + c726143 commit 8536c65
Show file tree
Hide file tree
Showing 17 changed files with 605 additions and 50,317 deletions.
11 changes: 9 additions & 2 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
**v1.0.2**
### Minor Fixes
1. typo on Databricks dialect
2. improve equals symbols support within COMMENT statement.
3. snowflake TAG now available on SCHEMA definitions.
4. turn regexp into functions

**v1.0.1**
### Minor Fixes
1. When using `normalize_names=True` do not remove `[]` from types like `decimal(21)[]`.
Expand Down Expand Up @@ -25,7 +32,7 @@ if you choose the correct output_mode.

### New Dialects support
1. Added as possible output_modes new Dialects:
- Databrics SQL like 'databricks',
- Databricks SQL like 'databricks',
- Vertica as 'vertica',
- SqliteFields as 'sqlite',
- PostgreSQL as 'postgres'
Expand All @@ -34,7 +41,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:

`from simple_ddl_parser import supported_dialects`

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'

If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.

Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

### From python code

Expand Down Expand Up @@ -216,7 +216,7 @@ Output will be:
### More details

`DDLParser(ddl).run()`
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

Also in .run() method exists argument `group_by_type` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.

Expand Down Expand Up @@ -513,7 +513,7 @@ if you choose the correct output_mode.

### New Dialects support
1. Added as possible output_modes new Dialects:
- Databrics SQL like 'databricks',
- Databricks SQL like 'databricks',
- Vertica as 'vertica',
- SqliteFields as 'sqlite',
- PostgreSQL as 'postgres'
Expand All @@ -522,7 +522,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:

`from simple_ddl_parser import supported_dialects`

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
8 changes: 4 additions & 4 deletions docs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'
If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.

Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

From python code
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -237,7 +237,7 @@ More details
^^^^^^^^^^^^

``DDLParser(ddl).run()``
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

Also in .run() method exists argument ``group_by_type`` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.

Expand Down Expand Up @@ -590,7 +590,7 @@ New Dialects support
#. Added as possible output_modes new Dialects:


* Databrics SQL like 'databricks',
* Databricks SQL like 'databricks',
* Vertica as 'vertica',
* SqliteFields as 'sqlite',
* PostgreSQL as 'postgres'
Expand All @@ -599,7 +599,7 @@ Full list of supported dialects you can find in dict - ``supported_dialects``\ :

``from simple_ddl_parser import supported_dialects``

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "simple-ddl-parser"
version = "1.0.1"
version = "1.0.2"
description = "Simple DDL Parser to parse SQL & dialects like HQL, TSQL (MSSQL), Oracle, AWS Redshift, Snowflake, MySQL, PostgreSQL, etc ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.; sequences, alters, custom types & other entities from ddl."
authors = ["Iuliia Volkova <[email protected]>"]
license = "MIT"
Expand Down
29 changes: 19 additions & 10 deletions simple_ddl_parser/dialects/snowflake.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,14 +81,9 @@ def p_expression_change_tracking(self, p: List) -> None:
p_list = remove_par(list(p))
p[0]["change_tracking"] = p_list[-1]

def p_table_comment(self, p: List) -> None:
"""expr : expr option_comment"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])

def p_table_tag(self, p: List) -> None:
"""expr : expr option_with_tag"""
def p_comment_equals(self, p: List) -> None:
"""expr : expr option_comment
"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])
Expand All @@ -98,10 +93,23 @@ def p_option_comment(self, p: List) -> None:
| ID DQ_STRING
| COMMENT ID STRING
| COMMENT ID DQ_STRING
| option_comment_equals
"""
p_list = remove_par(list(p))
if "comment" in p[1].lower():
p[0] = {"comment": p_list[-1]}
p[0] = {"comment": p_list[-1]}

def p_option_comment_equals(self, p: List) -> None:
"""option_comment_equals : STRING
| option_comment_equals DQ_STRING
"""
p_list = remove_par(list(p))
p[0] = str(p_list[-1])

def p_tag(self, p: List) -> None:
"""expr : expr option_with_tag"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])

def p_tag_equals(self, p: List) -> None:
"""tag_equals : id id id_or_string
Expand Down Expand Up @@ -136,6 +144,7 @@ def p_option_with_tag(self, p: List) -> None:
| TAG LP id DOT id DOT id RP
| TAG LP multiple_tag_equals RP
| WITH TAG LP id RP
| WITH TAG LP id DOT id DOT id RP
| WITH TAG LP multiple_tag_equals RP
"""
p_list = remove_par(list(p))
Expand Down
10 changes: 4 additions & 6 deletions simple_ddl_parser/dialects/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,6 @@ def set_auth_property_in_schema(self, p: List, p_list: List) -> None:
def p_c_schema(self, p: List) -> None:
"""c_schema : CREATE SCHEMA
| CREATE ID SCHEMA"""

if len(p) == 4:
p[0] = {"remote": True}

Expand All @@ -516,21 +515,17 @@ def p_create_schema(self, p: List) -> None:
| c_schema id id id
| c_schema id
| c_schema id DOT id
| c_schema id option_comment
| c_schema id DOT id option_comment
| c_schema IF NOT EXISTS id
| c_schema IF NOT EXISTS id DOT id
| create_schema id id id
| create_schema id id STRING
| create_schema options
"""
p_list = list(p)

p[0] = {}
auth_index = None

if "comment" in p_list[-1]:
p[0].update(p_list[-1])
del p_list[-1]

self.add_if_not_exists(p[0], p_list)
Expand All @@ -547,7 +542,10 @@ def p_create_schema(self, p: List) -> None:
if schema_name is None:
schema_name = p_list[auth_index + 1]
else:
schema_name = p_list[-1]
if "=" in p_list:
schema_name = p_list[2]
else:
schema_name = p_list[-1]
p[0]["schema_name"] = schema_name.replace("`", "")

p[0] = self.set_project_in_schema(p[0], p_list, auth_index)
Expand Down
20 changes: 10 additions & 10 deletions simple_ddl_parser/output/dialects.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ class MSSQL(Dialect):


@dataclass
@dialect(name="databrics")
class Databrics(Dialect):
@dialect(name="databricks")
class Databricks(Dialect):
property_key: Optional[str] = field(default=None)


Expand Down Expand Up @@ -261,32 +261,32 @@ class CommonDialectsFieldsMixin(Dialect):
)
stored_as: Optional[str] = field(
default=None,
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
)

row_format: Optional[dict] = field(
default=None,
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
)
location: Optional[str] = field(
default=None,
metadata={
"output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databrics]),
"output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databricks]),
"exclude_if_not_provided": True,
},
)
fields_terminated_by: Optional[str] = field(
default=None,
metadata={"output_modes": add_dialects([HQL, Databrics])},
metadata={"output_modes": add_dialects([HQL, Databricks])},
)
lines_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
map_keys_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
collection_items_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
clustered_by: Optional[list] = field(
default=None,
Expand All @@ -305,7 +305,7 @@ class CommonDialectsFieldsMixin(Dialect):
transient: Optional[bool] = field(
default=False,
metadata={
"output_modes": add_dialects([HQL, Databrics]),
"output_modes": add_dialects([HQL, Databricks]),
"exclude_if_not_provided": True,
},
)
Expand Down
18 changes: 11 additions & 7 deletions simple_ddl_parser/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,12 @@ def __init__(
self.block_comments = []
self.comments = []

self.comma_only_str = re.compile(r"((\')|(' ))+(,)((\')|( '))+\B")
self.equal_without_space = re.compile(r"(\b)=")
self.in_comment = re.compile(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+")
self.set_statement = re.compile(r"SET ")
self.skip_regex = re.compile(r"^(GO|USE|INSERT)\b")

def catch_comment_or_process_line(self, code_line: str) -> str:
if self.multi_line_comment:
self.comments.append(self.line)
Expand All @@ -113,8 +119,8 @@ def catch_comment_or_process_line(self, code_line: str) -> str:

def pre_process_line(self) -> Tuple[str, List]:
code_line = ""
comma_only_str = r"((\')|(' ))+(,)((\')|( '))+\B"
self.line = re.sub(comma_only_str, "_ddl_parser_comma_only_str", self.line)
self.line = self.comma_only_str.sub("_ddl_parser_comma_only_str", self.line)
self.line = self.equal_without_space.sub(" = ", self.line)
code_line = self.catch_comment_or_process_line(code_line)
if self.line.startswith(OP_COM) and CL_COM not in self.line:
self.multi_line_comment = True
Expand All @@ -123,7 +129,7 @@ def pre_process_line(self) -> Tuple[str, List]:
self.line = code_line

def process_in_comment(self, line: str) -> str:
if re.search(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+", line):
if self.in_comment.search(line):
code_line = line
else:
splitted_line = line.split(IN_COM)
Expand Down Expand Up @@ -200,7 +206,7 @@ def process_set(self) -> None:
self.tables.append({"name": name, "value": value})

def parse_set_statement(self):
if re.match(r"SET ", self.line.upper()):
if self.set_statement.match(self.line.upper()):
self.set_was_in_line = True
if not self.set_line:
self.set_line = self.line
Expand All @@ -224,11 +230,9 @@ def check_new_statement_start(self, line: str) -> bool:
return self.new_statement

def check_line_on_skip_words(self) -> bool:
skip_regex = r"^(GO|USE|INSERT)\b"

self.skip = False

if re.match(skip_regex, self.line.upper()):
if self.skip_regex.match(self.line.upper()):
self.skip = True
return self.skip

Expand Down
Loading

0 comments on commit 8536c65

Please sign in to comment.