undatum -- a command-line tool for data processing

undatum (pronounced un-da-tum) is a command line data processing tool. Its goal is to make CLI interaction with huge datasets so easy as possible. It provides a simple undatum command that allows to convert, split, calculate frequency, statistics and to validate data in CSV, JSON lines, BSON files.

Contents

1 Main features
2 Installation
- 2.1 macOS
- 2.2 Linux
- 2.3 Windows, etc.
- 2.4 Python version
3 Usage
- 3.1 Examples
4 Commands
- 4.1 Frequency command
- 4.2 Uniq command
- 4.3 Convert command
- 4.4 Validate command
- 4.5 Headers command
- 4.6 Stats command
- 4.7 Analyze command
- 4.8 Split command
- 4.9 Select command
- 4.10 Flatten command
5 Advanced
- 5.1 Filtering
- 5.2 Data containers
- 5.3 Date detection
6 Data types
- 6.1 JSONl

1 Main features

Common data operations against CSV, JSON lines and BSON files
Built-in data filtering
Support data compressed with ZIP, XZ, GZ, BZ2
Conversion between CSV, JSONl, BSON, XML, XLS, XLSX, Parquet, AVRO and ORC file types
Low memory footprint
Support for compressed datasets
Advanced statistics calculations
Date/datetime fields automatic recognition
Data validation
Documentation
Test coverage

2 Installation

2.1 macOS

On macOS, undatum can be installed via Homebrew (recommended):

$ brew install undatum

A MacPorts port is also available:

$ port install undatum

2.2 Linux

Most Linux distributions provide a package that can be installed using the system package manager, for example:

# Debian, Ubuntu, etc.
$ apt install undatum

# Fedora
$ dnf install undatum

# CentOS, RHEL, ...
$ yum install undatum

# Arch Linux
$ pacman -S undatum

2.3 Windows, etc.

A universal installation method (that works on Windows, Mac OS X, Linux, вЂ¦, and always provides the latest version) is to use pip:

# Make sure we have an up-to-date version of pip and setuptools:
$ pip install --upgrade pip setuptools

$ pip install --upgrade undatum

(If pip installation fails for some reason, you can try easy_install undatum as a fallback.)

2.4 Python version

Python version 3.6 or greater is required.

3 Usage

Synopsis:

$ undatum [flags] [command] inputfile

3.1 Examples

Get headers from file as headers command, JSONl data:

$ undatum headers examples/ausgovdir.jsonl

Analyze file and generate statistics stats command:

$ undatum stats examples/ausgovdir.jsonl

Get frequency command of values for field GovSystem in the list of Russian federal government domains from govdomains repository

$ undatum frequency examples/feddomains.csv --fields GovSystem

Get all unique values using uniq command of the item.type field

$ undatum uniq --fields item.type examples/ausgovdir.jsonl

convert command from XML to JSON lines file on tag item:

$ undatum convert --tagname item examples/ausgovdir.xml examples/ausgovdir.jsonl

Validate data with validate command against validation rule ru.org.inn and field VendorINN in data file. Output is statistcs only :

$ undatum validate -r ru.org.inn --mode stats --fields VendorINN examples/roszdravvendors_final.jsonl > inn_stats.json

Validate data with validate command against validation rule ru.org.inn and field VendorINN in data file. Output all invalid records :

$ undatum validate -r ru.org.inn --mode invalid --fields VendorINN examples/roszdravvendors_final.jsonl > inn_invalid.json

4 Commands

4.1 Frequency command

Field value frequency calculator. Returns frequency table for certain field. This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters.

Get frequencies of values for field GovSystem in the list of Russian federal government domains

$ undatum frequency examples/feddomains.csv --fields GovSystem

4.2 Uniq command

Returns all unique files of certain field(s). Accepts parameter fields with comma separated fields to gets it unique values. Provide single field name to get unique values of this field or provide list of fields to get combined unique values. This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters

Returns all unique values of field regions in selected JSONl file

$ undatum uniq --fields region examples/reestrgp_final.jsonl

Returns all unique combinations of fields status and regions in selected JSONl file

$ undatum uniq --fields status,region examples/reestrgp_final.jsonl

4.3 Convert command

Converts data from one format to another. Supports most common data files Supports conversions:

XML to JSON lines
CSV to JSON lines
XLS to JSON lines
XLSX to JSON lines
XLS to CSV
CSV to BSON
XLS to BSON
JSON lines to CSV
CSV to Parquet
JSON lines to Parquet

Conversion between XML and JSON lines require flag tagname with name of tag which should be converted into single JSON record.

Converts XML ausgovdir.xml with tag named item to ausgovdir.jsonl

$ undatum convert --tagname item examples/ausgovdir.xml examples/ausgovdir.jsonl

Converts JSON lines file roszdravvendors_final.jsonl to CSV file roszdravvendors_final.csv

$ undatum convert examples/roszdravvendors_final.jsonl examples/roszdravvendors_final.csv

Converts CSV file feddomains.csv to Parquet file feddomains.parquet

$ undatum convert examples/feddomains.csv examples/feddomains.parquet

Data formats conversion table map

From / To CSV JSONlines BSON JSON XLS XLSX XML Parquet ORC AVRO

CSV

Yes Yes No No No No Yes Yes Yes

JSONlines Yes

No No No No No Yes Yes No

BSON No Yes

No No No No No No No

JSON No Yes No

No No No No No No

XLS No Yes Yes No

No No No No No

XLSX No Yes Yes No No

No No No No

XML No Yes No No No No

No No No

Parquet No No No No No No No

No No

ORC No No No No No No No No

No

AVRO No No No No No No No No No

4.4 Validate command

Validate command used to check every value of of field against validation rules like rule to validate email or url.

Current supported rules:

common.email - checks if value is email
common.url - checks if value is url
ru.org.inn - checks if value is russian organization INN identifier
ru.org.ogrn - checks if value if russian organization OGRN identifier

Validate data with validate command against validation rule ru.org.inn and field VendorINN in data file. Output all invalid records :

$ undatum validate -r ru.org.inn --mode invalid --fields VendorINN examples/roszdravvendors_final.jsonl > inn_invalid.json

4.5 Headers command

Returns fieldnames of the file. Supports CSV, JSON, BSON file types. For CSV file it takes first line of the file and for JSON lines and BSON files it processes number of records provided as limit parameter with default value 10000. This command autodetects delimiter and encoding of CSV files and encoding of JSON lines files by default. You may override it providng "-d" delimiter and "-e" encoding parameters

Returns headers of JSON lines file with top 10 000 records (default value)

$ undatum headers examples/ausgovdir.jsonl

Returns headers of JSON lines file using top 50 000 records

$ undatum headers --limit 50000 examples/ausgovdir.jsonl

4.6 Stats command

Collects statistics about data in dataset. Supports BSON, CSV an JSON lines file types.

Returns table with following data:

key - name of the key
ftype - data type of the values with this key
is_dictkey - if True, than this key is identified as dictionary value
is_uniq - if True, identified as unique field
n_uniq - number of unique values
share_uniq - share of unique values among all values
minlen - minimal length of the field
maxlen - maximum length of the field
avglen - average length of the field

Returns stats for JSON lines file

$ undatum stats examples/ausgovdir.jsonl

Analysis of JSON lines file and verifies each field that it's date field, detects date format:

$ undatum stats --checkdates examples/ausgovdir.jsonl

4.7 Analyze command

Analyzes data format and provides human-readable information.

$ undatum analyze examples/ausgovdir.jsonl

Returned values will include:

Filename - name of the file
File type - type of the file, could be: jsonl, xml, csv, json, bson
Encoding - file encoding
Delimiter - file delimiter if CSV file
File size - size of the file, bytes
Objects count - number of objects in file
Fields - list of file fields

Also for XML AND JSON files:

Miltiple tables exists - True or False, if multiple data tables in XML files
Full data key - full path to data key (field with list of objects) in XML file
Short data key - final name of field with objects in XML file

For JSON files: JSON type - could be "objects list", "objects list with key" and "single object" For XML, JSON lines and JSON files: Is flat table? - True if table is flat and could be converted to CSV, False if not convertable For CSV and JSON lines: Number of lines - number of lines in file

4.8 Split command

Splits dataset into number of datasets based on number of records or field value. Chunksize parameter -c used to set size of chunk if dataset should be splitted by chunk size rule. If dataset should be splitted by field value than --fields parameter used.

Split dataset as 10000 records chunks, procuces files like filename_1.jsonl, filename_2.jsonl where filename is name of original file except extension.

$ undatum split -c 10000 examples/ausgovdir.jsonl

Split dataset as number of files based of field item.type", generates files filename_value1.jsonl, filename_value2.jsonl and e.t.c. There are *[filename] - ausgovdir and [value1] - certain unique value from item.type field

$ undatum split --fields item.type examples/ausgovdir.jsonl

4.9 Select command

Select or re-order columns from file. Supports CSV, JSON lines, BSON

Returns columns item.title and item.type from ausgovdir.jsonl

$ undatum select --fields item.title,item.type examples/ausgovdir.jsonl

Returns columns item.title and item.type from ausgovdir.jsonl and stores result as selected.jsonl

$ undatum select --fields item.title,item.type -o selected.jsonl examples/ausgovdir.jsonl

4.10 Flatten command

Flatten data records. Write them as one value per row

Returns all columns as flattened key,value

$ undatum flatten examples/ausgovdir.jsonl

5 Advanced

5.1 Filtering

You could filter values of any file record by using filter attr for any command where it's suported.

Returns columns item.title and item.type filtered with item.type value as role. Note: keys should be surrounded by "`" and text values by "'".

$ undatum select --fields item.title,item.type --filter "`item.type` == 'role'" examples/ausgovdir.jsonl

5.2 Data containers

Sometimes, to keep keep memory usage as low as possible to process huge data files. These files are inside compressed containers like .zip, .gz, .bz2 or .tar.gz files. undatum could process compressed files with little memory footprint, but it could slow down file processing.

Returns headers from subs_dump_1.jsonl file inside subs_dump_1.zip file. Require parameter --format-in to force input file type.

$ undatum headers --format-in jsonl subs_dump_1.zip

Extracts unique values of the field countryCode from XZ compressed file data.jsonl.xz. Require parameter --format-in to force input file type.

$ undatum uniq -f countryCode --format-in jsonl data.jsonl.xz

5.3 Date detection

JSON, JSON lines and CSV files do not support date and datetime data types. If you manually prepare your data, than you could define datetime in JSON schema for example.B But if data is external, you need to identify these fields.

undatum supports date identification via qddate python library with automatic date detection abilities.

$ undatum stats --checkdates examples/ausgovdir.jsonl

6 Data types

6.1 JSONl

JSON lines is a replacement to CSV and JSON files, with JSON flexibility and ability to process data line by line, without loading everything into memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.rst

README.rst

undatum -- a command-line tool for data processing

1 Main features

2 Installation

2.1 macOS

2.2 Linux

2.3 Windows, etc.

2.4 Python version

3 Usage

3.1 Examples

4 Commands

4.1 Frequency command

4.2 Uniq command

4.3 Convert command

4.4 Validate command

4.5 Headers command

4.6 Stats command

4.7 Analyze command

4.8 Split command

4.9 Select command

4.10 Flatten command

5 Advanced

5.1 Filtering

5.2 Data containers

5.3 Date detection

6 Data types

6.1 JSONl

From / To	CSV	JSONlines	BSON	JSON	XLS	XLSX	XML	Parquet	ORC	AVRO
CSV		Yes	Yes	No	No	No	No	Yes	Yes	Yes
JSONlines	Yes		No	No	No	No	No	Yes	Yes	No
BSON	No	Yes		No	No	No	No	No	No	No
JSON	No	Yes	No		No	No	No	No	No	No
XLS	No	Yes	Yes	No		No	No	No	No	No
XLSX	No	Yes	Yes	No	No		No	No	No	No
XML	No	Yes	No	No	No	No		No	No	No
Parquet	No	No	No	No	No	No	No		No	No
ORC	No	No	No	No	No	No	No	No		No
AVRO	No	No	No	No	No	No	No	No	No

Files

README.rst

Latest commit

History

README.rst

File metadata and controls

undatum -- a command-line tool for data processing