Skip to content

Commit

Permalink
fix equals with or without space arround, type on databricks name, en…
Browse files Browse the repository at this point in the history
…force snowflake tag on schema.
  • Loading branch information
dmaresma committed Jan 13, 2024
1 parent 7239317 commit c726143
Show file tree
Hide file tree
Showing 17 changed files with 605 additions and 50,317 deletions.
11 changes: 9 additions & 2 deletions CHANGELOG.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
**v1.0.2**
### Minor Fixes
1. typo on Databricks dialect
2. improve equals symbols support within COMMENT statement.
3. snowflake TAG now available on SCHEMA definitions.
4. turn regexp into functions

**v1.0.1**
### Minor Fixes
1. When using `normalize_names=True` do not remove `[]` from types like `decimal(21)[]`.
Expand Down Expand Up @@ -25,7 +32,7 @@ if you choose the correct output_mode.

### New Dialects support
1. Added as possible output_modes new Dialects:
- Databrics SQL like 'databricks',
- Databricks SQL like 'databricks',
- Vertica as 'vertica',
- SqliteFields as 'sqlite',
- PostgreSQL as 'postgres'
Expand All @@ -34,7 +41,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:

`from simple_ddl_parser import supported_dialects`

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'

If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.

Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

### From python code

Expand Down Expand Up @@ -216,7 +216,7 @@ Output will be:
### More details

`DDLParser(ddl).run()`
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument `output_mode` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

Also in .run() method exists argument `group_by_type` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.

Expand Down Expand Up @@ -513,7 +513,7 @@ if you choose the correct output_mode.

### New Dialects support
1. Added as possible output_modes new Dialects:
- Databrics SQL like 'databricks',
- Databricks SQL like 'databricks',
- Vertica as 'vertica',
- SqliteFields as 'sqlite',
- PostgreSQL as 'postgres'
Expand All @@ -522,7 +522,7 @@ Full list of supported dialects you can find in dict - `supported_dialects`:

`from simple_ddl_parser import supported_dialects`

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
8 changes: 4 additions & 4 deletions docs/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ And you will get output with additional keys 'stored_as', 'location', 'external'
If you run parser with command line add flag '-o=hql' or '--output-mode=hql' to get the same result.

Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

From python code
^^^^^^^^^^^^^^^^
Expand Down Expand Up @@ -237,7 +237,7 @@ More details
^^^^^^^^^^^^

``DDLParser(ddl).run()``
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
.run() method contains several arguments, that impact changing output result. As you can saw upper exists argument ``output_mode`` that allow you to set dialect and get more fields in output relative to chosen dialect, for example 'hql'. Possible output_modes: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

Also in .run() method exists argument ``group_by_type`` (by default: False). By default output of parser looks like a List with Dicts where each dict == one entity from ddl (table, sequence, type, etc). And to understand that is current entity you need to check Dict like: if 'table_name' in dict - this is a table, if 'type_name' - this is a type & etc.

Expand Down Expand Up @@ -590,7 +590,7 @@ New Dialects support
#. Added as possible output_modes new Dialects:


* Databrics SQL like 'databricks',
* Databricks SQL like 'databricks',
* Vertica as 'vertica',
* SqliteFields as 'sqlite',
* PostgreSQL as 'postgres'
Expand All @@ -599,7 +599,7 @@ Full list of supported dialects you can find in dict - ``supported_dialects``\ :

``from simple_ddl_parser import supported_dialects``

Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databrics', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']
Currently supported: ['redshift', 'spark_sql', 'mysql', 'bigquery', 'mssql', 'databricks', 'sqlite', 'vertics', 'ibm_db2', 'postgres', 'oracle', 'hql', 'snowflake', 'sql']

If you don't see dialect that you want to use - open issue with description and links to Database docs or use one of existed dialects.

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "simple-ddl-parser"
version = "1.0.1"
version = "1.0.2"
description = "Simple DDL Parser to parse SQL & dialects like HQL, TSQL (MSSQL), Oracle, AWS Redshift, Snowflake, MySQL, PostgreSQL, etc ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.; sequences, alters, custom types & other entities from ddl."
authors = ["Iuliia Volkova <[email protected]>"]
license = "MIT"
Expand Down
29 changes: 19 additions & 10 deletions simple_ddl_parser/dialects/snowflake.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,14 +81,9 @@ def p_expression_change_tracking(self, p: List) -> None:
p_list = remove_par(list(p))
p[0]["change_tracking"] = p_list[-1]

def p_table_comment(self, p: List) -> None:
"""expr : expr option_comment"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])

def p_table_tag(self, p: List) -> None:
"""expr : expr option_with_tag"""
def p_comment_equals(self, p: List) -> None:
"""expr : expr option_comment
"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])
Expand All @@ -98,10 +93,23 @@ def p_option_comment(self, p: List) -> None:
| ID DQ_STRING
| COMMENT ID STRING
| COMMENT ID DQ_STRING
| option_comment_equals
"""
p_list = remove_par(list(p))
if "comment" in p[1].lower():
p[0] = {"comment": p_list[-1]}
p[0] = {"comment": p_list[-1]}

def p_option_comment_equals(self, p: List) -> None:
"""option_comment_equals : STRING
| option_comment_equals DQ_STRING
"""
p_list = remove_par(list(p))
p[0] = str(p_list[-1])

def p_tag(self, p: List) -> None:
"""expr : expr option_with_tag"""
p[0] = p[1]
if p[2]:
p[0].update(p[2])

def p_tag_equals(self, p: List) -> None:
"""tag_equals : id id id_or_string
Expand Down Expand Up @@ -136,6 +144,7 @@ def p_option_with_tag(self, p: List) -> None:
| TAG LP id DOT id DOT id RP
| TAG LP multiple_tag_equals RP
| WITH TAG LP id RP
| WITH TAG LP id DOT id DOT id RP
| WITH TAG LP multiple_tag_equals RP
"""
p_list = remove_par(list(p))
Expand Down
10 changes: 4 additions & 6 deletions simple_ddl_parser/dialects/sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,6 @@ def set_auth_property_in_schema(self, p: List, p_list: List) -> None:
def p_c_schema(self, p: List) -> None:
"""c_schema : CREATE SCHEMA
| CREATE ID SCHEMA"""

if len(p) == 4:
p[0] = {"remote": True}

Expand All @@ -516,21 +515,17 @@ def p_create_schema(self, p: List) -> None:
| c_schema id id id
| c_schema id
| c_schema id DOT id
| c_schema id option_comment
| c_schema id DOT id option_comment
| c_schema IF NOT EXISTS id
| c_schema IF NOT EXISTS id DOT id
| create_schema id id id
| create_schema id id STRING
| create_schema options
"""
p_list = list(p)

p[0] = {}
auth_index = None

if "comment" in p_list[-1]:
p[0].update(p_list[-1])
del p_list[-1]

self.add_if_not_exists(p[0], p_list)
Expand All @@ -547,7 +542,10 @@ def p_create_schema(self, p: List) -> None:
if schema_name is None:
schema_name = p_list[auth_index + 1]
else:
schema_name = p_list[-1]
if "=" in p_list:
schema_name = p_list[2]
else:
schema_name = p_list[-1]
p[0]["schema_name"] = schema_name.replace("`", "")

p[0] = self.set_project_in_schema(p[0], p_list, auth_index)
Expand Down
20 changes: 10 additions & 10 deletions simple_ddl_parser/output/dialects.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,8 @@ class MSSQL(Dialect):


@dataclass
@dialect(name="databrics")
class Databrics(Dialect):
@dialect(name="databricks")
class Databricks(Dialect):
property_key: Optional[str] = field(default=None)


Expand Down Expand Up @@ -261,32 +261,32 @@ class CommonDialectsFieldsMixin(Dialect):
)
stored_as: Optional[str] = field(
default=None,
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
)

row_format: Optional[dict] = field(
default=None,
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databrics, Redshift])},
metadata={"output_modes": add_dialects([SparkSQL, HQL, Databricks, Redshift])},
)
location: Optional[str] = field(
default=None,
metadata={
"output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databrics]),
"output_modes": add_dialects([HQL, SparkSQL, Snowflake, Databricks]),
"exclude_if_not_provided": True,
},
)
fields_terminated_by: Optional[str] = field(
default=None,
metadata={"output_modes": add_dialects([HQL, Databrics])},
metadata={"output_modes": add_dialects([HQL, Databricks])},
)
lines_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
map_keys_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
collection_items_terminated_by: Optional[str] = field(
default=None, metadata={"output_modes": add_dialects([HQL, Databrics])}
default=None, metadata={"output_modes": add_dialects([HQL, Databricks])}
)
clustered_by: Optional[list] = field(
default=None,
Expand All @@ -305,7 +305,7 @@ class CommonDialectsFieldsMixin(Dialect):
transient: Optional[bool] = field(
default=False,
metadata={
"output_modes": add_dialects([HQL, Databrics]),
"output_modes": add_dialects([HQL, Databricks]),
"exclude_if_not_provided": True,
},
)
Expand Down
18 changes: 11 additions & 7 deletions simple_ddl_parser/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,12 @@ def __init__(
self.block_comments = []
self.comments = []

self.comma_only_str = re.compile(r"((\')|(' ))+(,)((\')|( '))+\B")
self.equal_without_space = re.compile(r"(\b)=")
self.in_comment = re.compile(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+")
self.set_statement = re.compile(r"SET ")
self.skip_regex = re.compile(r"^(GO|USE|INSERT)\b")

def catch_comment_or_process_line(self, code_line: str) -> str:
if self.multi_line_comment:
self.comments.append(self.line)
Expand All @@ -113,8 +119,8 @@ def catch_comment_or_process_line(self, code_line: str) -> str:

def pre_process_line(self) -> Tuple[str, List]:
code_line = ""
comma_only_str = r"((\')|(' ))+(,)((\')|( '))+\B"
self.line = re.sub(comma_only_str, "_ddl_parser_comma_only_str", self.line)
self.line = self.comma_only_str.sub("_ddl_parser_comma_only_str", self.line)
self.line = self.equal_without_space.sub(" = ", self.line)
code_line = self.catch_comment_or_process_line(code_line)
if self.line.startswith(OP_COM) and CL_COM not in self.line:
self.multi_line_comment = True
Expand All @@ -123,7 +129,7 @@ def pre_process_line(self) -> Tuple[str, List]:
self.line = code_line

def process_in_comment(self, line: str) -> str:
if re.search(r"((\")|(\'))+(.)*(--)+(.)*((\")|(\'))+", line):
if self.in_comment.search(line):
code_line = line
else:
splitted_line = line.split(IN_COM)
Expand Down Expand Up @@ -200,7 +206,7 @@ def process_set(self) -> None:
self.tables.append({"name": name, "value": value})

def parse_set_statement(self):
if re.match(r"SET ", self.line.upper()):
if self.set_statement.match(self.line.upper()):
self.set_was_in_line = True
if not self.set_line:
self.set_line = self.line
Expand All @@ -224,11 +230,9 @@ def check_new_statement_start(self, line: str) -> bool:
return self.new_statement

def check_line_on_skip_words(self) -> bool:
skip_regex = r"^(GO|USE|INSERT)\b"

self.skip = False

if re.match(skip_regex, self.line.upper()):
if self.skip_regex.match(self.line.upper()):
self.skip = True
return self.skip

Expand Down
Loading

0 comments on commit c726143

Please sign in to comment.