-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
271 additions
and
17 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
.\" Generated by scdoc 1.11.2 | ||
.\" Complete documentation for this program is not available as a GNU info page | ||
.ie \n(.g .ds Aq \(aq | ||
.el .ds Aq ' | ||
.nh | ||
.ad l | ||
.\" Begin generated content: | ||
.TH "infer-schema" "1" "2023-11-19" | ||
.P | ||
.SH NAME | ||
.P | ||
infer-schema - generate JSON Schema from CSV files | ||
.P | ||
.SH SYNOPSIS | ||
.P | ||
\fBinfer-schema\fR file [options] | ||
.P | ||
.SH DESCRIPTION | ||
.P | ||
The infer-schema utility generates JSON Schema (draft 7) corresponding to a CSV | ||
file, such that the CSV file is valid against the generated schema.\& The JSON | ||
Schema is shown on standard output and can be piped or written to a file with | ||
the \fB--output\fR option.\& | ||
.P | ||
The following options are available: | ||
.P | ||
\fB-h\fR, \fB--help\fR | ||
.RS 4 | ||
Show usage information | ||
.P | ||
.RE | ||
\fB--enum-threshold\fR \fIthreshold\fR | ||
.RS 4 | ||
The \fIthreshold\fR of unique items up to which enum categories should be | ||
populated in the JSON schema.\& Default threshold is 10.\& | ||
.P | ||
.RE | ||
\fB--enum-fields\fR \fIfields\fR | ||
.RS 4 | ||
Forces \fIfields\fR (comma-separated) to be classed as an enum, useful for | ||
including fields that do not meet enum \fIthreshold\fR criteria | ||
.P | ||
.RE | ||
\fB--bound-types\fR \fItypes\fR | ||
.RS 4 | ||
Comma-separated \fItypes\fR for which bounds should be encoded into the schema, | ||
default is '\&number,integer'\&, for which minimum / maximum are determined.\& For | ||
strings minLength and maxLength are determined.\& Set \fB--bound-types\fR=none to | ||
disable bound detection.\& Allowed bound types are \fBinteger\fR, \fBnumber\fR and | ||
\fBstring\fR | ||
.P | ||
.RE | ||
\fB--explicit-nulls\fR | ||
.RS 4 | ||
By default, fields that have null and another type are typed as non-required | ||
with the non-null type.\& This setting makes the nulls explicit by dual typing | ||
a field with the non-null type.\& | ||
.P | ||
As an example, consider a field '\&count'\& that has the following values | ||
20,NA,30.\& By default, this field will be typed as '\&integer'\& and will not be | ||
required.\& With \fB--explicit-nulls\fR set, this will be typed as [integer, null] | ||
.P | ||
.RE | ||
\fB-o\fR \fIoutput\fR, \fB--output\fR \fIoutput\fR | ||
.RS 4 | ||
Save schema to \fIoutput\fR file | ||
.P | ||
.RE | ||
.SH EXAMPLES | ||
.P | ||
Given this CSV file called \fIdates.\&csv\fR | ||
.P | ||
.nf | ||
.RS 4 | ||
date,num_cases | ||
2022-11-11,4 | ||
2022-11-12,5 | ||
2022-11-13,6 | ||
,10 | ||
2022-11-15,10 | ||
2022-11-16,5 | ||
2022-11-17,3 | ||
2022-11-18,2 | ||
2022-11-19,10 | ||
2022-11-20,11 | ||
2022-11-21,4 | ||
2022-11-22,20 | ||
2022-11-23, | ||
2022-11-24,9 | ||
2022-11-25,4 | ||
2022-11-26,21 | ||
2022-11-27,99 | ||
2022-11-28,59 | ||
2022-11-30,45 | ||
.fi | ||
.RE | ||
.P | ||
Running '\&infer-schema dates.\&csv'\& gives the following output | ||
.P | ||
.nf | ||
.RS 4 | ||
{ | ||
"$schema": "https://json-schema\&.org/draft-07/schema", | ||
"description": "Description of tests/dates\&.csv", | ||
"properties": { | ||
"date": { | ||
"description": "Description for column date", | ||
"format": "date", | ||
"type": "string" | ||
}, | ||
"num_cases": { | ||
"description": "Description for column num_cases", | ||
"maximum": 99, | ||
"minimum": 2, | ||
"type": "integer" | ||
} | ||
}, | ||
"required": [], | ||
"title": "JSON Schema for tests/dates\&.csv" | ||
} | ||
.fi | ||
.RE | ||
.P | ||
Here we see that infer-schema determines minimum and maximum values for integer | ||
columns.\& For strings, minLength and maxLength are determined.\& This is controlled | ||
by the \fB--bound-types\fR setting, which can be set to \fBnone\fR to turn off bounds | ||
detection.\& | ||
.P | ||
By default, any column with upto 10 (default \fB--enum-threshold\fR) unique values | ||
is considered categorical and expressed as a JSON Schema enum type.\& Columns with | ||
more than 10 values can be forced to be of enum type by using \fB--enum-fields\fR.\& | ||
.P | ||
.SH BUGS | ||
.P | ||
Report bugs at \fIhttps://github.\&com/abhidg/infer-schema/issues\fR |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
infer-schema(1) | ||
|
||
# NAME | ||
|
||
infer-schema - generate JSON Schema from CSV files | ||
|
||
# SYNOPSIS | ||
|
||
*infer-schema* file [options] | ||
|
||
# DESCRIPTION | ||
|
||
The infer-schema utility generates JSON Schema (draft 7) corresponding to a CSV | ||
file, such that the CSV file is valid against the generated schema. The JSON | ||
Schema is shown on standard output and can be piped or written to a file with | ||
the *--output* option. | ||
|
||
The following options are available: | ||
|
||
*-h*, *--help* | ||
Show usage information | ||
|
||
*--enum-threshold* _threshold_ | ||
The _threshold_ of unique items up to which enum categories should be | ||
populated in the JSON schema. Default threshold is 10. | ||
|
||
*--enum-fields* _fields_ | ||
Forces _fields_ (comma-separated) to be classed as an enum, useful for | ||
including fields that do not meet enum _threshold_ criteria | ||
|
||
*--bound-types* _types_ | ||
Comma-separated _types_ for which bounds should be encoded into the schema, | ||
default is 'number,integer', for which minimum / maximum are determined. For | ||
strings minLength and maxLength are determined. Set *--bound-types*=none to | ||
disable bound detection. Allowed bound types are *integer*, *number* and | ||
*string* | ||
|
||
*--explicit-nulls* | ||
By default, fields that have null and another type are typed as non-required | ||
with the non-null type. This setting makes the nulls explicit by dual typing | ||
a field with the non-null type. | ||
|
||
As an example, consider a field 'count' that has the following values | ||
20,NA,30. By default, this field will be typed as 'integer' and will not be | ||
required. With *--explicit-nulls* set, this will be typed as [integer, null] | ||
|
||
*-o* _output_, *--output* _output_ | ||
Save schema to _output_ file | ||
|
||
# EXAMPLES | ||
|
||
Given this CSV file called _dates.csv_ | ||
|
||
``` | ||
date,num_cases | ||
2022-11-11,4 | ||
2022-11-12,5 | ||
2022-11-13,6 | ||
,10 | ||
2022-11-15,10 | ||
2022-11-16,5 | ||
2022-11-17,3 | ||
2022-11-18,2 | ||
2022-11-19,10 | ||
2022-11-20,11 | ||
2022-11-21,4 | ||
2022-11-22,20 | ||
2022-11-23, | ||
2022-11-24,9 | ||
2022-11-25,4 | ||
2022-11-26,21 | ||
2022-11-27,99 | ||
2022-11-28,59 | ||
2022-11-30,45 | ||
``` | ||
|
||
Running 'infer-schema dates.csv' gives the following output | ||
|
||
``` | ||
{ | ||
"$schema": "https://json-schema.org/draft-07/schema", | ||
"description": "Description of tests/dates.csv", | ||
"properties": { | ||
"date": { | ||
"description": "Description for column date", | ||
"format": "date", | ||
"type": "string" | ||
}, | ||
"num_cases": { | ||
"description": "Description for column num_cases", | ||
"maximum": 99, | ||
"minimum": 2, | ||
"type": "integer" | ||
} | ||
}, | ||
"required": [], | ||
"title": "JSON Schema for tests/dates.csv" | ||
} | ||
``` | ||
|
||
Here we see that infer-schema determines minimum and maximum values for integer | ||
columns. For strings, minLength and maxLength are determined. This is controlled | ||
by the *--bound-types* setting, which can be set to *none* to turn off bounds | ||
detection. | ||
|
||
By default, any column with upto 10 (default *--enum-threshold*) unique values | ||
is considered categorical and expressed as a JSON Schema enum type. Columns with | ||
more than 10 values can be forced to be of enum type by using *--enum-fields*. | ||
|
||
# BUGS | ||
|
||
Report bugs at https://github.com/abhidg/infer-schema/issues |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
#!/usr/bin/env python3 | ||
import csv | ||
import json | ||
import argparse | ||
|