Skip to content

💾 Command-line tool for working with SAS binary files; 🦀 Rust wrapper of ReadStat C library

License

Notifications You must be signed in to change notification settings

curtisalexander/readstat-rs

Repository files navigation

readstat-rs Gitpod ready-to-code

readstat-rs

Command-line tool for working with SAS binary — sas7bdat — files.

Get metadata, preview data, or convert data to csv, feather (or the Arrow IPC format), ndjson, or parquet formats.

🔑 Dependencies

The command-line tool is developed in Rust and is only possible due to the following excellent projects:

The ReadStat library is used to parse and read sas7bdat files, and the arrow2 crate is used to convert the read sas7bdat data into the Arrow memory format. Once in the Arrow memory format, the data can be written to other file formats.

Quickstart

Convert the first 50,000 rows of example.sas7bdat (by performing the read in parallel) to the file example.parquet, overwriting the file if it already exists.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 50000 --overwrite --parallel

Install

Download a Release

[Mostly] static binaries for Linux, macOS, and Windows may be found at the Releases page.

Setup

Move the readstat binary to a known directory and add the binary to the user's PATH.

Linux & macOS

Ensure the path to readstat is added to the appropriate shell configuration file.

Windows

For Windows users, path configuration may be found within the Environment Variables menu. Executing the following from the command line opens the Environment Variables menu for the current user.

rundll32.exe sysdm.cpl,EditEnvironmentVariables

Run

Run the binary.

readstat --help

Build

Clone

Ensure submodules are also cloned.

git clone --recurse-submodules https://github.com/curtisalexander/readstat-rs.git

The ReadStat repository is included as a git submodule within this repository. In order to build and link, first a readstat-sys crate is created. Then the readstat binary utilizes readstat-sys as a dependency.

Linux

Install developer tools

# unixodbc-dev needed for full compilation of arrow2
sudo apt install build-essential clang unixodbc-dev

Build

cargo build

macOS

Install developer tools

xcode-select --install

Build

cargo build

Windows

Building on Windows requires LLVM and Visual Studio C++ Build tools be downloaded and installed.

In addition, the path to libclang needs to be set in the environment variable LIBCLANG_PATH. If LIBCLANG_PATH is not set then the readstat-sys build script assumes the needed path to be C:\Program Files\LLVM\lib.

For details see the following.

Build

cargo build

Run

After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:

  • metadata → writes the following to standard out or json
    • row count
    • variable count
    • table name
    • table label
    • file encoding
    • format version
    • bitness
    • creation time
    • modified time
    • compression
    • byte order
    • variable names
    • variable type classes
    • variable types
    • variable labels
    • variable format classes
    • variable formats
    • arrow data types
  • preview → writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data in csv format to standard out
  • data → writes parsed data in csv, feather, ndjson, or parquet format to a file

Metadata

To write metadata to standard out, invoke the following.

readstat metadata /some/dir/to/example.sas7bdat

To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.

readstat metadata /some/dir/to/example.sas7bdat --as-json

Preview Data

To write parsed data (as a csv) to standard out, invoke the following (default is to write the first 10 rows).

readstat preview /some/dir/to/example.sas7bdat

To write the first 100 rows of parsed data (as a csv) to standard out, invoke the following.

readstat preview /some/dir/to/example.sas7bdat --rows 100

Data

📝 The data subcommand includes a parameter for --format, which is the file format that is to be written. Currently, the following formats have been implemented:

  • csv
  • feather
  • ndjson
  • parquet

csv

To write parsed data (as csv) to a file, invoke the following (default is to write all parsed data to the specified file).

The default --format is csv. Thus, the parameter is elided from the below examples.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv

To write the first 100 rows of parsed data (as csv) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100

feather

To write parsed data (as feather) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather

To write the first 100 rows of parsed data (as feather) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100

ndjson

To write parsed data (as ndjson) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson

To write the first 100 rows of parsed data (as ndjson) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100

parquet

To write parsed data (as parquet) to a file, invoke the following (default is to write all parsed data to the specified file).

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet

To write the first 100 rows of parsed data (as parquet) to a file, invoke the following.

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100

To write parsed data (as parquet) to a file with specific compression settings, invoke the following:

readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3

Parallelism

The data subcommand includes a parameter for --parallel — if invoked, the reading of a sas7bdat will occur in parallel. If the total rows to process is greater than stream-rows (if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user's machine are used with the --parallel option. In the future, may consider allowing the user to throttle this number.

Note that although reading is in parallel, writing is still sequential. Thus, one should only anticipate moderate speed-ups as much of the time is spent writing.

❗ Utilizing the --parallel parameter will increase memory usage — there will be multiple threads simultaneously reading chunks from the sas7bdat. In addition, because all processors are utilized, CPU usage may be maxed out during reading.

⚠️ Also, note that utilizing the --parallel parameter may write rows out of order from the original sas7bdat.

Reader

The preview and data subcommands include a parameter for --reader. The possible values for --reader include the following.

  • mem → Parse and read the entire sas7bdat into memory before writing to either standard out or a file
  • stream (default) → Parse and read at most stream-rows into memory before writing to disk
    • stream-rows may be set via the command line parameter --stream-rows or if elided will default to 10,000 rows

Why is this useful?

  • mem is useful for testing purposes
  • stream is useful for keeping memory usage low for large datasets (and hence is the default)
  • In general, users should not need to deviate from the default — stream — unless they have a specific need
  • In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes

Debug

Debug information is printed to standard out by setting the environment variable RUST_LOG=debug before the call to readstat.

⚠️ This is quite verbose! If using the preview or data subcommand, will write debug information for every single value!

# Linux and macOS
RUST_LOG=debug readstat ...
# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...

Help

For full details run with --help.

readstat --help
readstat metadata --help
readstat preview --help
readstat data --help

Floating Point Values

⚠️ Decimal values are rounded to contain only 14 decimal digits!

For example, the number 1.1234567890123456 created within SAS would be returned as 1.12345678901235 within Rust.

Why does this happen? Is this an implementation error? No, rounding to only 14 decimal digits has been purposely implemented within the Rust code.

As a specific example, when testing with the cars.sas7bdat dataset (which was created originally on Windows), the numeric value 4.6 as observed within SAS was being returned as 4.600000000000001 (15 digits) within Rust. Values created on Windows with an x64 processor are only accurate to 15 digits.

For comparison, the ReadStat binary truncates to 14 decimal places when writing to csv.

Finally, SAS represents all numeric values in floating-point representation which creates a challenge for all parsed numerics!

Sources

Date, Time, and Datetimes

Currently any dates, times, or datetimes in the following SAS formats are parsed and read as dates, times, or datetimes.

⚠️ If the format does not match one of the above SAS formats, or if the value does not have a format applied, then the value will be parsed and read as a numeric value!

Details

SAS stores dates, times, and datetimes internally as numeric values. To distinguish among dates, times, datetimes, or numeric values, a SAS format is read from the variable metadata. If the format matches one of the above SAS formats then the numeric value is converted and read into memory using one of the Arrow types:

If values are read into memory as Arrow date, time, or datetime types, then when they are written — from an arrow2 Chunk to csv, feather, ndjson, or parquet — they are treated as dates, times, or datetimes and not as numeric values.

Finally, more work is planned to handle other SAS dates, times, and datetimes that have SAS formats other than those listed above.

Testing

To perform unit / integration tests, run the following.

cargo test

Datasets

Formally tested (via integration tests) against the following datasets. See the README.md for data sources.

  • ahs2019n.sas7bdat → US Census data
  • all_types.sas7bdat → SAS dataset containing all SAS types
  • cars.sas7bdat → SAS cars dataset
  • hasmissing.sas7bdat → SAS dataset containing missing values
  • intel.sas7bdat
  • messydata.sas7bdat
  • rand_ds.sas7bdat → Created using create_rand_ds.sas
  • rand_ds_largepage_err.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 2M
  • rand_ds_largepage_ok.sas7bdat → Created using create_rand_ds.sas with BUFSIZE set to 1M
  • scientific_notation.sas7bdat → Used to test float parsing
  • somedata.sas7bdat
  • somemiss.sas7bdat

Valgrind

To ensure no memory leaks, valgrind may be utilized. For example, to ensure no memory leaks for the test parse_file_metadata_test, run the following from within the readstat directory.

valgrind ./target/debug/deps/parse_file_metadata_test-<hash>
  • ✔️ Linux → successfully builds and runs
  • ✔️ macOS → successfully builds and runs
  • ✔️ Windows → successfully builds and runs

Benchmarking

Benchmarking performed with hyperfine.

This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat repository. In general, hope that performance is fairly close to that of the C binary.

To run, execute the following from within the readstat directory.

# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"

📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare option provided by hyperfine on Windows.

# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"

Other, future, benchmarking may be performed now that channels and threads have been developed.

Profiling

Profiling performed with cargo flamegraph.

To run, execute the following from within the readstat directory.

cargo flamegraph --bin readstat -- data tests/data/_ahs2019n.sas7bdat --output tests/data/_ahs2019n.csv

Flamegraph is written to readstat/flamegraph.svg.

📝 Have yet to utilize flamegraphs in order to improve performance.

GitHub Actions

Below is the rough git tag dance to delete and/or add tags to trigger GitHub Actions.

# delete local tag
git tag --delete v0.1.0

# delete remote tag
git push origin --delete v0.1.0

# add and commit local changes
git add .
git commit -m "commit msg"

# push local changes to remote
git push

# add local tag
git tag -a v0.1.0 -m "v0.1.0"

# push local tag to remote
git push origin --tags

Goals

Short Term

Short term, developing the command-line tool was a helpful exercise in binding to a C library using bindgen and the Rust FFI. It definitely required a review of C pointers (and for which I claim no expertise)!

Long Term

The long term goals of this repository are uncertain. Possibilities include:

  • ✔️ Developing a command line tool that performs transformations from sas7bdat to other file types
    • text
      • csv
      • ndjson
    • binary
      • feather
      • parquet
  • ✔️ Developing a command line tool that expands the functionality made available by the readstat command line tool
  • Completing and publishing the readstat-sys crate that binds to ReadStat
  • Developing and publishing a Rust library — readstat — that allows Rust programmers to work with sas7bdat files
    • Implementing a custom serde data format for sas7bdat files (implement serialize first and deserialize later (if possible))

Resources

The following have been incredibly helpful while developing!

About

💾 Command-line tool for working with SAS binary files; 🦀 Rust wrapper of ReadStat C library

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published