-
Notifications
You must be signed in to change notification settings - Fork 6
05 Preparing Your Own Data
You can create your own data in an ingestion package that can be shared for others to load into RiB. This page will walk you through the following:
- what to include in an ingestion package
- how
dateTime
is formatted - how to populate
dataInsertedBy
to capture how item was collected for insertion into RACK
First step for creating ingestion packages is to understand what ingestion packages are. An ingestion package is simply a zip file that includes RACK data that can be loaded into a RACK-in-a-box instance. A good, well-formed ingestion package should have no dependencies outside the base RACK Ontology and should include a RACK CLI ingestion script. When properly created all that should be required to load an ingestion package into RACK is to simply unzip the file and run the load script.
Example Zip File Contents:
Example.zip
|
├─Load-IngestionExample.sh
|
├─OwlModels
| ├─NewOnt.owl
│ └─import.yaml
|
├─nodegroups
| ├─IngestNewOnt.json
│ └─store_data.csv
|
└─InstanceData
├─NewOnt1.csv
├─REQUIREMENT1.csv
├─NewOnt2.csv
├─REQUIREMENT2.csv
└─import.yaml
An Ingestion Pack will typically consist of the following:
- An Ingestion Script
- Ontology Updates (Optional)
- Additional Nodegroups (Optional)
- Instance Data
The ingestion script for an ingestion package is a simple sh script that is used invoke the RACK CLI to ingest any Ontology, Nodegroups, or Instance Data into RACK. Typically it is best to first ingest any Ontology updates, followed by any Nodegroups, and lastly the Instance Data.
Example.zip/Load-IngestionExample.sh contents:
./ensure-cli-in-PATH.sh
rack model import $BASEDIR/OwlModels/import.yaml
rack nodegroups import $BASEDIR/nodegroups
rack data import --clear $BASEDIR/InstanceData/import.yaml
Those familiar with shell scripting will likely need no explanation but for others a short description is as follows:
- Find the folder that this script is located and store this as a variable BASEDIR
- Check to see if the RACK CLI is available
- Use RACK CLI to load ontology specified by a import.yaml file located in the OwlModels subdirectory of the BASEDIR
- Use RACK CLI to load nodegroups specified by a store_data.csv file located in the nodegroups subdirectory of the BASEDIR
- Use RACK CLI to clear then load instance data specified by a import.yaml file located in the InstanceData subdirectory of the BASEDIR
- Note: at present, the recommended order of operation is to first clear existing data in RACK, then ingest an entire set of instance data. In the future, we will support the ability to update and "fit" new data into RACK.
This order of loading into RACK is typically important--the loading of instance data may be dependent on additional Nodegroups, and those nodegroups may be dependent on the ontology updates. While this may not always be the case, it is best to use this ingestion order to mitigate any risk of unknown dependencies.
Additionally multiple lines for each RACK CLI step could be included. If multiple lines are used care should be taken on the ingestion of Instance Data to ensure that there is a correct load order for the multiple lines. Typically data should be ingested in the order from a higher level to a lower level (i.e. System Requirements are ingested before Software Requirements, Software Requirements are ingested before Source or Testing Data).
Ingestion packages may use the ingestion nodegroups and CDRs that are included in each RACK release or custom ingestion nodegroups. The former for simplicity, and the latter for efficiency. To learn more about custom ingestion nodegroups and templates, see SemTK Wiki and it's pages on ingestion.
The included ingestion nodegroups can be found under nodegroups/ingestion and are arranged such that:
- they are grouped in sub-folders by model prefix (roughly matching overlays)
- there is one ingestion nodegroup per class
- there is a sample CSV file found under nodegroups/CDR
Below is a sample, the nodegroup for ingesting AGENTs, called "ingest_AGENT.json":
It is very important to note that in these auto-generated ingestion nodegroups:
- the main class (AGENT) is createIfMissing. It will be looked up and if found, new data linked to it. Otherwise it will be created.
- the other linked classes are errorIfMissing, and instances with the given identifiers must exist or the ingestion will fail.
And the corresponding empty csv "ingest_AGENT.csv". Files are empty except for the headers. The example below contains a row of annotations:
description | identifier | title | definedIn_identifier | actedOnBehalfOf_identifier | dataInsertedBy_identifier |
---|---|---|---|---|---|
optional | required | optional | must exist if non-empty | must exist if non-empty | must exist if non-empty |
The columns behave such that:
- identifier should not be empty, as it is the identifier of the instance to be ingested. SemTK uses createIfMissing mode.
- all other columns are optional
- the data properties (in this case: "description" and "title") must be the correct type. See SemTK Ingestion Type Handling
- columns ending in "_identifier" are links to other instances that have already been ingested. SemTK uses errorIfMissing mode.
The default ingestion nodegroups do a URI lookup with createIfMissing on the identifier of the main class, which means the instance is first looked up, then created if not found. This means the same identifier can appear in multiple rows, and also spread across multiple ingestions. Any columns that are empty are ignored. Any column that is not empty will be ingested against the looked-up or created instance. Multiple values for one property can be ingested by repeating the identifier in two rows and putting each property value in one of the rows. Cardinality is not checked at ingestion-time (but can be checked with the Report tool).
Since errorIfMissing requires that an instance exists before the ingestion step begins, the errorIfMissing on connections to other instances creates a requirement that ingestion files be run in the correct order. If this is burdensome or impractical for your data, custom ingestion templates can be used instead.
Notice in the example above that actedOnBehalfOf connects the AGENT to another AGENT. Since the predicate is errorIfMissing it must exist before this ingestion step begins.
For simplicity and consistency, RACK ingestion scripts usually run each ingestion nodegroup twice: first with a csv containing only identifiers, then with the full csv. Using this ingestion pattern, the instances are ingested with only their identifiers in the first pass. In the second pass they are looked up and always exist, so the data properties can be added and any links to an instance of the same type will already exist.
When there are not any object property links to instances of the same class, this two-step ingestion process is not required.
Ontology updates are provided in owl formats. Generation of the owl files is outside the scope of this article, but typically the owl files are created through the use of another tool such as SADL. One or more Owl Model can be housed in the directory and the import.yaml file defines what is to be loaded into RACK at ingestion time.
Example.zip/OwlModels/import.yaml contents:
# This file is intended to be used using the rack.py script found
# in RACK-Ontology/scripts/
#
# Script documentation is available in RACK-Ontology/scripts/README.md
files:
- NewOnt.owl
This simply says that the single owl file is being loaded into RACK.
Additional Nodegroups are provided in json formats, the generation of which are outside the scope of this article, but typically the json files are created by using SemTK. One or more nodegroups can be present in the directory and the store_data.csv is used to define what nodegroups are to be loaded into RACK, as well as the description data that is to be included in SemTK's Nodegroup store.
Example.zip/nodegroups/store_data.csv contents:
ID, comments, creator, jsonFile
IngestNewOnt, Nodegroup for ingesting NewOnt class from the Ingestion Package Example, Example, IngestNewOnt.json
Instance Data is the final part of an ingestion package and is likely to contain the bulk of the data being loaded into RACK. Normally Instance Data will be in the form of CSV files. The examples in this article are using the files generated by the Scraping Tool Kit. However any source of CSV can be used, it just must be understood that the load order may be important and care should be taken with the import.yaml file to load the data in the correct order.
Example.zip/InstanceData/import.yaml contents:
data-graph: "http://rack001/data"
ingestion-steps:
#Phase1: Instance type declarations only
- {nodegroup: "IngestNewOnt", csv: "NewOnt1.csv"}
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT1.csv"}
#Phase2: Only properties and relationships
- {nodegroup: "IngestNewOnt", csv: "NewOnt2.csv"}
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT2.csv"}
More details on this format are available with the documentation of the RACK CLI, but the short explanation of this file is as follows:
data-graph: "http://rack001/data"
-> Load this data into the specified graph
ingestion-steps:
-> Load the data in the following:
- {nodegroup: "IngestNewOnt", csv: "NewOnt1.csv"}
-> Load NewOnt1.csv using the Nodegroup with an ID of IngestNewOnt
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT1.csv"}
-> Load REQUIREMENT1.csv using the Nodegroup with an ID of ingest_REQUIREMENT
- {nodegroup: "IngestNewOnt", csv: "NewOnt2.csv"}
-> Load NewOnt2.csv using the Nodegroup with an ID of IngestNewOnt
- {nodegroup: "ingest_REQUIREMENT", csv: "REQUIREMENT2.csv"}
-> Load REQUIREMENT2.csv using the Nodegroup with an ID of ingest_REQUIREMENT
A couple of items worth noting:
Paths of CSV support relative paths from the yaml file location, although best practice would be to create a separate ingestion step in the ingestion script if another file location is needed.
NewOnt and IngestNewOnt are using Ontology and Nodegroups that are added by this ingestion package. If those steps are not preformed prior to loading the instance data, an error will occur.
This ingestion uses a two phase ingestion: first all the instance type declarations are loaded; then all the properties and relationships are loaded. This two phase ingestion is done to minimize the order dependencies. If this two phase ingestion approach is not used, a situation where an ENTITY objects shadow the true intended objects can occur. As an example, if you were adding two REQUIREMENTS, Sw-R-1 and Sys-R-1, and Sw-R-1 "satisfies" Sys-R-1, then a CSV that describes this could look like:
REQUIREMENT2.csv:
identifier, satisfies_identifier
Sw-R-1, Sys-R-1
Sys-R-1,
If this is loaded into RACK, one of the three outcomes occurs depending on how the nodegroup is constructed:
- An error of a failed lookup, when the lookup of statisfies_identifier is an "error if missing".
- The expected two REQUIREMENTS being create, when statisfies_identifier is a "create if missing" and the satisfies node is typed as a REQUIREMENT.
- Three items being create in RACK: two REQUIREMENTS (Sw-R-1 and Sys-R-1) as well as an ENTITY (Sys-R-1), when statisfies_identifier is a "create if missing" and the satisfies node is typed as a ENTITY.
This is because when the lookup for Sys-R-1 is performed for the ingestion of Sw-R-1, it does not exist yet since it is created by the next row. As a result one of the three outcomes occurs depending on how the nodegroup is constructed. By doing a two phase ingestion all the items are first created with CSV that is simply the identifier:
REQUIREMENT1.csv:
identifier
Sw-R-1
Sys-R-1
Then the REQUIREMENT2.csv will be ingested correctly regardless of how the nodegroup is constructed as the lookup of Sys-R-1 while ingesting Sw-R-1 will find the intended REQUIREMENT since that was created as part of ingesting REQUIREMENT1.csv.
Since ultimately ingestion packages are data with a shell script containing instructions for loading the data into RACK shell scripting capabilities can be used make the data more dynamic and adaptable to the host environment. These examples are not the limits of what can be accomplished within ingestion packages but rather are to serve as examples.
The first example is to use shell script to update a file with a local file reference. Within the CSV files in a ingestion package a variable expansion tag can be defined {{BASEURL}}
. The ingestion shell script is then used to replace this tag with a URLBASE that was determined at ingestion time in the shell script prior to loading the data into RACK. This allows the FILE entityURL to be defined with a URL that specifies a file included within the ingestion packages, so that if someone follows the url it will take them to the specific file that was included in the ingestion package.
Notes: This assumes that the ingestion package is unzipped to a local hard drive and not removed following the ingestion of the data. Furthermore, while this example makes accommodations for execution on Linux or Windows systems. The approach demonstrated below is limited to the situation where the machine hosting the ingestion package is the same as the one following the URL. Modifications could me made adapt to a shared drive or file server, but it would be situation dependent. This example also make the assumption that you will not move the files after ingestion, and will try to re-ingestion in the new location, as the script modifies the CSV file in place (i.e. no expansion variables remain). To re-ingestion with the files in a new location the user is required to replace the modified CSV with the original version from the zipped ingestion package (i.e. with the expansion variables). Modifications could be made to address this is beyond the scope of this example.
Example CSV File:
identifier, entityUrl,
fileName,{{BASEURL}}folderFromIngestionPackage/filenName.txt,
This example CSV file uses a tag within the data to identify the place in the data where the expansion variable should be populated. The script below is a section of shell script that needs to be added to the standard ingestion script. It should be included before the ingestion of the CSV files, but should be after the definition of the $BASEDIR
variable.
Example Shell Script:
if test "$OSTYPE" == "cygwin" -o "$OSTYPE" == "msys"; then
URLBASE="file://$(cygpath -m "$BASEDIR")"
else
URLBASE="file://$BASEDIR"
fi
echo "Updating CSV files with URL Base ..."
find "$BASEDIR" -name "*.csv" -exec sed -i -e "s|{{URLBASE}}|$URLBASE|g" {} +
The behavior that is being added is:
-
Determine the URL base based on the shell $OSTYPE
- if the script is being run in a cygwin or msys shell the path should use windows formatting ("c:/"). This is provided by the cygpath utility, the -m option will cause the path to be with forward slashes, resulting in a valid URL.
- otherwise just use the filepath of the
$BASEDIR
which is file location of the ingestion script this would be using unix-like file paths ("/home/username")
-
Find all the CSV files in the current directory (
$BASEDIR
) using thefind
utility, and for each file usesed
to replace{{URLBASE}}
with the$URLBASE
determined in the first step.
Note: BASH shell (#!/bin/bash) should be used as the $OSTYPE
is not available for the original Bourne shell that is commonly associated with simple shell scripts (#!/bin/sh).
After the execution of these shell commands the CSV file will no longer contain the expansion variable {{URLBASE}}
, and the resulting update would be OS dependent:
Example Resulting CSV File (in a windows environment):
identifier, entityUrl,
fileName,file://C:/UnzippedLocation/folderFromIngestionPackage/filenName.txt,
Example Resulting CSV File (in a unix-like environment):
identifier, entityUrl,
fileName,file:///home/username/UnzippedLocation/folderFromIngestionPackage/filenName.txt,
Now that the CSV files have been updated with the expansion variable regular data ingestion can continue as described in the above.
Here is how to provide dateTime information in a csv data file. Use a value like "Thu Mar 23 03:03:16 PDT 2017" in the csv file (this is the value used for the column generatedAtTime_SoftwareUnitTestResult in the data file SoftwareUnitTestResult.csv in the Turnstile-Ontology). When this dateTime value which is in local dateTime format is ingested in SemTK, it is converted to UTC format which is shown as "2017-03-23T03:03:16-07:00" when appropriate query is run.
Additional resources: https://github.com/ge-semtk/semtk/wiki/Ingestion-Type-Handling
dataInsertedBy
is a property that is on all ENTITY
s, ACTIVITY
s, and AGENT
s (THING
s within the ontology). It is intended for the capturing of how the item was collected for insertion into RACK. This differs from the the more standard relationship to ACTIVITY
s as it not related to the create of the data, rather to the collecting of the data.
As an example the extraction of ENTITY
s from a pdf, the extraction processes should be captured by the dataInsertedBy
activity, while the creation of the originating ENTITY
s would be captured by wasGeneratedBy
or some sub-property of wasGeneratedBy
.
The best practice for RACK is that all items should have a dataInsertedBy
relationship to an ACTIVITY
. This ACTIVTIY
should have at a minimum:
identifier
endedAtTime
wasAssociatedWith
description
the AGENT
that is the target of wasAssoicatedWith
should have at a minimum:
identifier
description
A single THING
may have multiple dataInsertedBy
relationships, if the item was identified in multiple sources. For example a REQUIREMENT
may be identified as part of a extracting from a Requirements Specification, as well as it my be found as trace data in the extraction of TEST
s from a test procedure. In This case both the requirement should have a dataInsertedBy
relationship to both ACTIVITY
s (extraction from Requirements Specification and extraction from the Test Procedure).
Given the usefulness of this dataInsertedby
information there is a predefined nodegroup, query Get DataInsertedBy From Guid, is preloaded into RACK that allows one to find this information for a given guid. To run the query in SemTK simply load the nodegroup and select run, a dialog will be presented that will allow you to select the guid for which you wish to get the 'dataInsertedBy'. This nodegroup can also be run programmatically just as any other runtime constrained nodegroup can be.
An important thing to note, especially while running this query programmatically, is that as described above a single THING
can and often will have multiple dataInsertedBy
ACTIVITY
s that will be returned, so any handling of the resultant data should account for the multiple rows of data.
Copyright (c) 2021-2024, General Electric Company, Galois, Inc.
All Rights Reserved
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-20-C-0203.
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA).
Distribution Statement "A" (Approved for Public Release, Distribution Unlimited)