-
Notifications
You must be signed in to change notification settings - Fork 6
05 Preparing Your Own Data
You can create your own data in an ingestion package that can be shared for others to load into RiB. This page will walk you through the following:
- what to include in an ingestion package
- how
dateTime
is formatted - how to populate
dataInsertedBy
to capture how item was collected for insertion into RACK
First step for creating ingestion packages is to understand what ingestion packages are. An ingestion package is simply a zip file that includes RACK data that can be loaded into a RACK-in-a-box instance. A good, well-formed ingestion package should include both the data model (.owl files), instance data (.csv files), and a manifest file (manifest.yaml). When properly created all that should be required to load an ingestion package into RACK is described in the manifest file.
Example Zip File Contents:
Example.zip
|
├─manifest.yaml
|
├─OwlModels
| ├─NewOnt.owl
│ └─import.yaml
|
├─nodegroups
| ├─QueryNodegroup.json
│ └─store_data.csv
|
└─InstanceData
├─REQUIREMENT1.csv
├─REQUIREMENT2.csv
└─import.yaml
An Ingestion Pack will typically consist of the following:
- Manifest File
- Data Model
- Nodegroups (Optional)
- Instance Data
The manifest for an ingestion package is a simple file that identifies relevant data model files and instance data, and nodegroup files (optional). It also specifies the model and data datagraphs. It can also reference another manifest file. All file paths are resolved relative to the location of the manifest.
Example.zip/Load-IngestionExample.sh contents:
name: 'Turnstile ingestion'
footprint:
model-graphs:
- http://rack001/model
data-graphs:
- http://rack001/do-178c
steps:
- manifest: rack.yaml
- model: ../GE-Ontology/OwlModels/import.yaml
- data: ../RACK-Ontology/ontology/DO-178C/import.yaml
name: 'RACK ontology'
description: 'Base ontology for assurance case curation'
footprint:
model-graphs:
- http://rack001/model
steps:
- model: ../RACK-Ontology/OwlModels/import.yaml
- nodegroups: ../nodegroups/queries
Data Model files are provided in owl formats. Generation of the owl files is outside the scope of this article, but typically the owl files are created through the use of another tool such as RITE or SADL. The import.yaml file should lists the data model files to be loaded into RACK.
Example.zip/OwlModels/import.yaml contents:
# This file is intended to be used using the rack.py script found
# in RACK-Ontology/scripts/
#
# Script documentation is available in RACK-Ontology/scripts/README.md
files:
- NewOnt.owl
This simply says that the single owl file is being loaded into RACK.
Nodegroups are provided in json format, the generation of which are outside the scope of this article, but typically the json files are created by using SemTK. The store_data.csv file is used to define what nodegroups are to be loaded into RACK, as well as the description data that is to be included in SemTK's Nodegroup store.
Example.zip/nodegroups/store_data.csv contents:
ID, comments, creator, jsonFile, itemType
Query for requirements, Nodegroup to query for requirements from the Ingestion Package, JackBlack, QueryNodegroup.json, PrefabNodeGroup
Instance Data is the final part of an ingestion package and is likely to contain the bulk of the data being loaded into RACK. Normally Instance Data will be in the form of CSV files. The examples in this article are using the files generated by the Scraping Tool Kit. However any source of CSV can be used, it just must be understood that the load order may be important and care should be taken with the import.yaml file to load the data in the correct order.
Example.zip/InstanceData/import.yaml contents:
data-graph: "http://rack001/data"
ingestion-steps:
#Phase1: Instance type declarations only
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"}
#Phase2: Only properties and relationships
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"}
More details on this format are available with the documentation of the RACK CLI, but the short explanation of this file is as follows:
data-graph: "http://rack001/data"
-> Load this data into the specified graph
ingestion-steps:
-> Load the data in the following:
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT1.csv"}
-> Load REQUIREMENT1.csv into the REQUIREMENT class
- {class: "http://arcos.rack/REQUIREMENTS#REQUIREMENT", csv: "REQUIREMENT2.csv"}
A couple of items worth noting:
Paths of CSV support relative paths from the yaml file location, although best practice would be to create a separate ingestion step if another file location is needed.
This ingestion uses a two phase ingestion: first all the instance type declarations are loaded; then all the properties and relationships are loaded. This two phase ingestion is done to minimize the order dependencies. If this two phase ingestion approach is not used, a situation where an ENTITY objects shadow the true intended objects can occur. As an example, if you were adding two REQUIREMENTS, Sw-R-1 and Sys-R-1, and Sw-R-1 "satisfies" Sys-R-1, then a CSV that describes this could look like:
REQUIREMENT2.csv:
identifier, satisfies_identifier
Sw-R-1, Sys-R-1
Sys-R-1,
If this is loaded into RACK, one of the three outcomes occurs depending on how the nodegroup is constructed:
- An error of a failed lookup, when the lookup of statisfies_identifier is an "error if missing".
- The expected two REQUIREMENTS being create, when statisfies_identifier is a "create if missing" and the satisfies node is typed as a REQUIREMENT.
- Three items being create in RACK: two REQUIREMENTS (Sw-R-1 and Sys-R-1) as well as an ENTITY (Sys-R-1), when statisfies_identifier is a "create if missing" and the satisfies node is typed as a ENTITY.
This is because when the lookup for Sys-R-1 is performed for the ingestion of Sw-R-1, it does not exist yet since it is created by the next row. As a result one of the three outcomes occurs depending on how the nodegroup is constructed. By doing a two phase ingestion all the items are first created with CSV that is simply the identifier:
REQUIREMENT1.csv:
identifier
Sw-R-1
Sys-R-1
Then the REQUIREMENT2.csv will be ingested correctly regardless of how the nodegroup is constructed as the lookup of Sys-R-1 while ingesting Sw-R-1 will find the intended REQUIREMENT since that was created as part of ingesting REQUIREMENT1.csv.
Here is how to provide dateTime information in a csv data file. Use a value like "Thu Mar 23 03:03:16 PDT 2017" in the csv file (this is the value used for the column generatedAtTime_SoftwareUnitTestResult in the data file SoftwareUnitTestResult.csv in the Turnstile-Ontology). When this dateTime value which is in local dateTime format is ingested in SemTK, it is converted to UTC format which is shown as "2017-03-23T03:03:16-07:00" when appropriate query is run.
Additional resources: https://github.com/ge-semtk/semtk/wiki/Ingestion-Type-Handling
dataInsertedBy
is a property that is on all ENTITY
s, ACTIVITY
s, and AGENT
s (THING
s within the ontology). It is intended for the capturing of how the item was collected for insertion into RACK. This differs from the the more standard relationship to ACTIVITY
s as it not related to the create of the data, rather to the collecting of the data.
As an example the extraction of ENTITY
s from a pdf, the extraction processes should be captured by the dataInsertedBy
activity, while the creation of the originating ENTITY
s would be captured by wasGeneratedBy
or some sub-property of wasGeneratedBy
.
The best practice for RACK is that all items should have a dataInsertedBy
relationship to an ACTIVITY
. This ACTIVTIY
should have at a minimum:
identifier
endedAtTime
wasAssociatedWith
description
the AGENT
that is the target of wasAssoicatedWith
should have at a minimum:
identifier
description
A single THING
may have multiple dataInsertedBy
relationships, if the item was identified in multiple sources. For example a REQUIREMENT
may be identified as part of a extracting from a Requirements Specification, as well as it my be found as trace data in the extraction of TEST
s from a test procedure. In This case both the requirement should have a dataInsertedBy
relationship to both ACTIVITY
s (extraction from Requirements Specification and extraction from the Test Procedure).
Given the usefulness of this dataInsertedBy
information there is a predefined nodegroup, query Get DataInsertedBy From Guid, is preloaded into RACK that allows one to find this information for a given guid. To run the query in SemTK simply load the nodegroup and select run, a dialog will be presented that will allow you to select the guid for which you wish to get the 'dataInsertedBy'. This nodegroup can also be run programmatically just as any other runtime constrained nodegroup can be.
An important thing to note, especially while running this query programmatically, is that as described above a single THING
can and often will have multiple dataInsertedBy
ACTIVITY
s that will be returned, so any handling of the resultant data should account for the multiple rows of data.
Copyright (c) 2021-2024, General Electric Company, Galois, Inc.
All Rights Reserved
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-20-C-0203.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).
Distribution Statement "A" (Approved for Public Release, Distribution Unlimited)