Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialize just once random_referenced objects #700

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ default_language_version:
python: python3
repos:
- repo: https://github.com/ambv/black
rev: 21.4b2
rev: 22.6.0
hooks:
- id: black
- repo: https://github.com/pre-commit/pre-commit-hooks
Expand Down
95 changes: 53 additions & 42 deletions docs/arch/ArchIndex.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ The Snowfakery interpreter reads a recipe, translates it into internal data stru

Obviously, Snowfakery architecture will be easier to understand in the context of the language itself, so understanding the syntax is a good first step.



## Levels of Looping

Snowfakery recipes are designed to be evaluated over and over again, top to bottom. Each run-through is called
Expand All @@ -21,15 +19,15 @@ This is useful for generating chunks of data called _portions_, and then handing

Here is the overall pattern:

| CumulusCI | Snowfakery | Data Loader |
| ------------- |-------------| -------------|
| Generate Data | Start | Wait |
| Load Data | Stop | Start |
| Generate Data | Start | Stop |
| Load Data | Stop | Start |
| Generate Data | Start | Stop |
| Load Data | Finish | Start |
| Load Data | Finished | Finish |
| CumulusCI | Snowfakery | Data Loader |
| ------------- | ---------- | ----------- |
| Generate Data | Start | Wait |
| Load Data | Stop | Start |
| Generate Data | Start | Stop |
| Load Data | Stop | Start |
| Generate Data | Start | Stop |
| Load Data | Finish | Start |
| Load Data | Finished | Finish |

Note that every time you Start and Stop Snowfakery, you generate a whole new Interpreter object, which re-reads the recipe. In some contexts, the new Intepreter object may be in a different process or (theoretically) on a different computer altogether.

Expand Down Expand Up @@ -57,9 +55,9 @@ So Snowfakery would run it once snapshot the "continuation state" and then fan t

When reading Snowfakery code, you must always think about the lifetime of each data structure:

* Will it survive for a single iteration, like local variables? We call these Transients.
* Will it survive for a single continuation, like "FakerData" objects? We could call these Interpreter Managed objects.
* Will it be saved and loaded between continuations, and thus survive across continuations? These are Globals.
- Will it survive for a single iteration, like local variables? We call these Transients.
- Will it survive for a single continuation, like "FakerData" objects? We could call these Interpreter Managed objects.
- Will it be saved and loaded between continuations, and thus survive across continuations? These are Globals.

## The Parser

Expand All @@ -76,12 +74,12 @@ is executed once per continuation (or just once if the recipe is not continued).
The Interpreter mediates access betewen the recipe (represented by the ParseResult) and resources
such as:

* the Output Stream
* Global persistent data that survives continuations by being saved to and loaded from YAML
* Transient persistent data that is discarded and rebuilt (as necessary) after continuation
* The Row History which is used for allowing randomized access to objects for the `random_reference` feature
* Plugins and Providers which extend Snowfakery
* Runtime Object Model objects
- the Output Stream
- Global persistent data that survives continuations by being saved to and loaded from YAML
- Transient persistent data that is discarded and rebuilt (as necessary) after continuation
- The Row History which is used for allowing randomized access to objects for the `random_reference` feature
- Plugins and Providers which extend Snowfakery
- Runtime Object Model objects

On my relatively slow computer it takes 1/25 of a second to initialize an Interpreter from a Recipe once all modules are loaded. It takes about 3/4 of a second to launch an interpreter and load the corre, required modules.

Expand All @@ -97,8 +95,7 @@ For example, a VariableDefinition represents this structure:

```


An ObjectTemplate represents this one:
An ObjectTemplate represents this one:

```
- object: XXX
Expand Down Expand Up @@ -128,12 +125,12 @@ id_manager:
Contact: 2
Opportunity: 5
intertable_dependencies:
- field_name: AccountId
table_name_from: Contact
table_name_to: Account
- field_name: AccountId
table_name_from: Opportunity
table_name_to: Account
- field_name: AccountId
table_name_from: Contact
table_name_to: Account
- field_name: AccountId
table_name_from: Opportunity
table_name_to: Account
nicknames_and_tables:
Account: Account
Contact: Contact
Expand Down Expand Up @@ -173,28 +170,27 @@ today: 2022-06-06

This also shows the contents of the Globals object. Things we track:

* The last used IDs for various Tables, so we don't generate overlapping IDs
* Inter-table dependencies, so we can generate a CCI mapping file or other output schema that depends on
- The last used IDs for various Tables, so we don't generate overlapping IDs
- Inter-table dependencies, so we can generate a CCI mapping file or other output schema that depends on
relationships
* Mapping from nicknames to tablenames, with tables own names being registered as nicknames for convenience
* Data from specific ("persistent") objects which the user asked to be generated just once and may want to refer to again later
* The current date to allow the `today` function to be consistent even if a process runs across midnight (perhaps we should revisit this)
- Mapping from nicknames to tablenames, with tables own names being registered as nicknames for convenience
- Data from specific ("persistent") objects which the user asked to be generated just once and may want to refer to again later
- The current date to allow the `today` function to be consistent even if a process runs across midnight (perhaps we should revisit this)

### Transients

If data should be discarded on every iteration (analogous to 'local variables' in a programming language) then it should be stored in the Transients object which is recreated on every iteration. This object is accessible through the Globals but is not saved to YAML.
If data should be discarded on every iteration (analogous to 'local variables' in a programming language) then it should be stored in the Transients object which is recreated on every iteration. This object is accessible through the Globals but is not saved to YAML.

### Row History

RowHistory is a way of keeping track of the contents of a subset of all of the rows/objects generated by Snowfakery in a single continuation.

There are a few Recipe patterns enabled by the row history:

* `random_reference` lookups to nicknames
* `random_reference` lookups to objects that have data of interest, such as _another_ `random_reference`
- `random_reference` lookups to nicknames
- `random_reference` lookups to objects that have data of interest, such as _another_ `random_reference`


Row History data structures survive for as long as a single process/interpreter/continuation. A new
Row History data structures survive for as long as a single process/interpreter/continuation. A new
continuation gets a new Row History, so it is not possible to use Row History to make links across
continuation boundaries.

Expand All @@ -215,11 +211,10 @@ Here is the kind of recipe that might blow up memory:
fields:
ref:
random_reference: target
name:
${{ref.bloat}}
name: ${{ref.bloat}}
```

The second object picks from one of a 100M unique strings
The second object picks from one of a 100M unique strings
which are each approx 80M in size. That's a lot of data and
would quickly blow up memory.

Expand All @@ -242,8 +237,24 @@ All Fake Data is mediated through the [FakeData](https://github.com/SFDO-Tooling

Snowfakery extends and customizes the set of fake data providers through its [FakeNames](https://github.com/SFDO-Tooling/Snowfakery/search?q=%22class+FakeNames%22) class. For example, Snowfakery's email address provider incorporates the first name and last name of the imaginary person into the email. Snowfakery renames `postcode` to `postalcode` to match Salesforc conventions. Snowfakery adds timezones to date-time fakers.

## Formulas
## Formulas

Snowfakery `${{formulas}}` are Jinja Templates controlled by a class called the [`JinjaTemplateEvaluatorFactory`](https://github.com/SFDO-Tooling/Snowfakery/search?q=%22class+JinjaTemplateEvaluatorFactory%22). The `Interpreter` object keeps a reference to this class.

## Continuations

Recall that there are multiple [Levels of Looping](#levels-of-looping). Data which
survives beyond continutation (process) boundaries lives in continuation files.
You can see how that works here:

```sh
$ snowfakery foo.yml --generate-continuation-file /tmp/continue.yml && snowfakery foo.yml --continuation-file /tmp/continue.yml

$ cat /tmp/continue.yml
```

The contents of `/tmp/continue.yml` are specific to a version of Snowfakery and subject
to change over time.

In general, it saves the contents of `just_once` objects and recently created
objects.
68 changes: 40 additions & 28 deletions snowfakery/data_generator_runtime.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@


# save every single object to history. Useful for testing saving of datatypes
SAVE_EVERYTHING = os.environ.get("SF_SAVE_EVERYTHING")
SAVE_EVERYTHING = os.environ.get("SF_SAVE_EVERYTHING", False)


class StoppingCriteria(NamedTuple):
Expand Down Expand Up @@ -126,11 +126,18 @@ def __init__(
today: date = None,
name_slots: Mapping[str, str] = None,
):
# these lists start empty and are filled.
# They survive iterations and continuations.
# all of these properties start empty and are filled.
# They all survive iterations and continuations.

# These two are indexed by name
self.persistent_nicknames = {}
self.persistent_objects_by_table = {}

# Not indexed because it is used only to refresh the RowHistory DB
# after continuation
# Otherwise the data is never read or written
self.persistent_random_referenceable_objects = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this isn't a dict?

Copy link
Contributor Author

@prescod prescod Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does b671d0f clarify, @jofsky ?


self.id_manager = IdManager()
self.intertable_dependencies = OrderedSet()
self.today = today or date.today()
Expand All @@ -139,16 +146,25 @@ def __init__(
self.reset_slots()

def register_object(
self, obj: ObjectRow, nickname: Optional[str], persistent_object: bool
self,
obj: ObjectRow,
nickname: Optional[str],
persistent_object: bool,
random_referenced_object: bool,
):
"""Register an object for lookup by object type and (optionally) Nickname"""
if nickname:
# should survive continuations. Somebody will probably `reference:`` it
if persistent_object:
self.persistent_nicknames[nickname] = obj
else:
self.transients.nicknamed_objects[nickname] = obj
if persistent_object:
self.persistent_objects_by_table[obj._tablename] = obj

if persistent_object and random_referenced_object:
self.persistent_random_referenceable_objects.append((nickname, obj))

self.transients.last_seen_obj_by_table[obj._tablename] = obj

@property
Expand Down Expand Up @@ -214,6 +230,10 @@ def serialize_dict_of_object_rows(dct):
"today": self.today,
"nicknames_and_tables": self.nicknames_and_tables,
"intertable_dependencies": intertable_dependencies,
"persistent_random_referenceable_objects": [
(nn, obj.__getstate__())
for (nn, obj) in self.persistent_random_referenceable_objects
],
}
return state

Expand All @@ -233,6 +253,10 @@ def deserialize_dict_of_object_rows(dct):
self.intertable_dependencies = OrderedSet(
Dependency(*dep) for dep in getattr(state, "intertable_dependencies", [])
)
self.persistent_random_referenceable_objects = [
(nickname, hydrate(ObjectRow, v))
for (nickname, v) in state["persistent_random_referenceable_objects"]
]

self.today = state["today"]
persistent_objects_by_table = state.get("persistent_objects_by_table")
Expand Down Expand Up @@ -373,26 +397,8 @@ def resave_objects_from_continuation(
):
"""Re-save just_once objects to the local history cache after resuming a continuation"""

# deal with objs known by their nicknames
relevant_objs = [
(obj._tablename, nickname, obj)
for nickname, obj in globals.persistent_nicknames.items()
]
already_saved = set(obj._id for (_, _, obj) in relevant_objs)
# and those known by their tablename, if not already in the list
relevant_objs.extend(
(tablename, None, obj)
for tablename, obj in globals.persistent_objects_by_table.items()
if obj._id not in already_saved
)
# filter out those in tables that are not history-backed
relevant_objs = (
(table, nick, obj)
for (table, nick, obj) in relevant_objs
if table in tables_to_keep_history_for
)
for tablename, nickname, obj in relevant_objs:
self.row_history.save_row(tablename, nickname, obj._values)
for nickname, obj in globals.persistent_random_referenceable_objects:
self.row_history.save_row(obj._tablename, nickname, obj._values)

def execute(self):
RowHistoryCV.set(self.row_history)
Expand Down Expand Up @@ -569,19 +575,25 @@ def remember_row(self, tablename: str, nickname: T.Optional[str], row: dict):
self.interpreter.globals.register_intertable_reference(
tablename, fieldvalue._tablename, fieldname
)
if self._should_save(tablename, nickname):
self.interpreter.row_history.save_row(tablename, nickname, row)

def _should_save(self, tablename: str, nickname: T.Optional[str]) -> bool:
history_tables = self.interpreter.tables_to_keep_history_for
should_save: bool = (
return (
(tablename in history_tables)
or (nickname in history_tables)
or SAVE_EVERYTHING
)
if should_save:
self.interpreter.row_history.save_row(tablename, nickname, row)

def register_object(self, obj, name: Optional[str], persistent: bool):
"Keep track of this object in case other objects refer to it."
self.obj = obj
self.interpreter.globals.register_object(obj, name, persistent)
should_save = self._should_save(obj._tablename, name)
# `persistent means`: is it `just_once` and therefore might be
# referred to by `reference` in a future iteration
# `should_save` means it may be referred to by `random_reference`
self.interpreter.globals.register_object(obj, name, persistent, should_save)

@contextmanager
def child_context(self, template):
Expand Down
14 changes: 12 additions & 2 deletions snowfakery/object_rows.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,9 @@ def __init__(self, tablename: str, id: int):

class LazyLoadedObjectReference(ObjectReference):
_data = None
yaml_loader = yaml.SafeLoader
yaml_dumper = SnowfakeryDumper
yaml_tag = "!snowfakery_lazyloadedobjectrow"

def __init__(
self,
Expand All @@ -85,10 +88,17 @@ def __getattr__(self, attrname):
if attrname.endswith("__"): # pragma: no cover
raise AttributeError(attrname)
if self._data is None:
row_history = RowHistoryCV.get()
self._data = row_history.load_row(self.sql_tablename, self.id)
self._load_data()
return self._data[attrname]

def _load_data(self):
row_history = RowHistoryCV.get()
self._data = row_history.load_row(self.sql_tablename, self.id)

def __reduce_ex__(self, *args, **kwargs):
self._load_data()
return super().__reduce_ex__(*args, **kwargs)


class SlotState(Enum):
"""The current state of a NicknameSlot.
Expand Down
19 changes: 19 additions & 0 deletions tests/deep-random-nesting.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
### This recipe creates reference Account record for PMM data
# Look at examples/salesforce/Account.recipe.yml for more examples.

# Run this like this:

# cci task run generate_and_load_from_yaml --generator_yaml snowfakery_samples/PMM/pmm_0_Account.recipe.yml --num_records 300 --num_records_tablename Account --org qa
# snowfakery snowfakery_samples/PMM/pmm_0_Account.recipe.yml --output-format json --output-file src/foo.json

# Set Macro for Household and Organization Record Type

- object: Account
count: 3
just_once: True

- object: Account
just_once: True
fields:
parent:
random_reference: Account
1 change: 1 addition & 0 deletions tests/test_data_generator.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ def test_stopping_criteria_with_startids(self, write_row):
nicknames_and_tables: {}
today: 2022-11-03
persistent_nicknames: {}
persistent_random_referenceable_objects: []
"""
generate(
StringIO(yaml),
Expand Down
3 changes: 2 additions & 1 deletion tests/test_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,8 @@ def test_parent_application__streams_instead_of_files(self, generated_rows):
Foo: Foo
persistent_nicknames: {}
persistent_objects_by_table: {}
today: 2021-04-07"""
today: 2021-04-07
persistent_random_referenceable_objects: []"""
)
generate_continuation_file = StringIO()
decls = """[{"sf_object": Opportunity, "api": bulk}]"""
Expand Down
Loading