Command-line tool for working with SAS binary — sas7bdat
— files.
Get metadata, preview data, or convert data to csv
, feather
(or the Arrow IPC format), ndjson
, or parquet
formats.
The command-line tool is developed in Rust and is only possible due to the following excellent projects:
- The ReadStat C library developed by Evan Miller
- The arrow2 Rust crate developed by Jorge Leitao
The ReadStat
library is used to parse and read sas7bdat
files, and the arrow2
crate is used to convert the read sas7bdat
data into the Arrow memory format. Once in the Arrow
memory format, the data can be written to other file formats.
Convert the first 50,000 rows of example.sas7bdat
(by performing the read in parallel) to the file example.parquet
, overwriting the file if it already exists.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 50000 --overwrite --parallel
[Mostly] static binaries for Linux, macOS, and Windows may be found at the Releases page.
Move the readstat
binary to a known directory and add the binary to the user's PATH.
Ensure the path to readstat
is added to the appropriate shell configuration file.
For Windows users, path configuration may be found within the Environment Variables menu. Executing the following from the command line opens the Environment Variables menu for the current user.
rundll32.exe sysdm.cpl,EditEnvironmentVariables
Run the binary.
readstat --help
Ensure submodules are also cloned.
git clone --recurse-submodules https://github.com/curtisalexander/readstat-rs.git
The ReadStat repository is included as a git submodule within this repository. In order to build and link, first a readstat-sys crate is created. Then the readstat binary utilizes readstat-sys
as a dependency.
Install developer tools
# unixodbc-dev needed for full compilation of arrow2
sudo apt install build-essential clang unixodbc-dev
Build
cargo build
Install developer tools
xcode-select --install
Build
cargo build
Building on Windows requires LLVM and Visual Studio C++ Build tools be downloaded and installed.
In addition, the path to libclang
needs to be set in the environment variable LIBCLANG_PATH
. If LIBCLANG_PATH
is not set then the readstat-sys build script assumes the needed path to be C:\Program Files\LLVM\lib
.
For details see the following.
Build
cargo build
After either building or installing, the binary is invoked using subcommands. Currently, the following subcommands have been implemented:
metadata
→ writes the following to standard out or json- row count
- variable count
- table name
- table label
- file encoding
- format version
- bitness
- creation time
- modified time
- compression
- byte order
- variable names
- variable type classes
- variable types
- variable labels
- variable format classes
- variable formats
- arrow data types
preview
→ writes the first 10 rows (or optionally the number of rows provided by the user) of parsed data incsv
format to standard outdata
→ writes parsed data incsv
,feather
,ndjson
, orparquet
format to a file
To write metadata to standard out, invoke the following.
readstat metadata /some/dir/to/example.sas7bdat
To write metadata to json, invoke the following. This is useful for reading the metadata programmatically.
readstat metadata /some/dir/to/example.sas7bdat --as-json
To write parsed data (as a csv
) to standard out, invoke the following (default is to write the first 10 rows).
readstat preview /some/dir/to/example.sas7bdat
To write the first 100 rows of parsed data (as a csv
) to standard out, invoke the following.
readstat preview /some/dir/to/example.sas7bdat --rows 100
📝 The data
subcommand includes a parameter for --format
, which is the file format that is to be written. Currently, the following formats have been implemented:
csv
feather
ndjson
parquet
To write parsed data (as csv
) to a file, invoke the following (default is to write all parsed data to the specified file).
The default --format
is csv
. Thus, the parameter is elided from the below examples.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv
To write the first 100 rows of parsed data (as csv
) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.csv --rows 100
To write parsed data (as feather
) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather
To write the first 100 rows of parsed data (as feather
) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.feather --format feather --rows 100
To write parsed data (as ndjson
) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson
To write the first 100 rows of parsed data (as ndjson
) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.ndjson --format ndjson --rows 100
To write parsed data (as parquet
) to a file, invoke the following (default is to write all parsed data to the specified file).
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet
To write the first 100 rows of parsed data (as parquet
) to a file, invoke the following.
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --rows 100
To write parsed data (as parquet
) to a file with specific compression settings, invoke the following:
readstat data /some/dir/to/example.sas7bdat --output /some/dir/to/example.parquet --format parquet --compression zstd --compression-level 3
The data
subcommand includes a parameter for --parallel
— if invoked, the reading of a sas7bdat
will occur in parallel. If the total rows to process is greater than stream-rows
(if unset, the default rows to stream is 10,000), then each chunk of rows is read in parallel. Note that all processors on the user's machine are used with the --parallel
option. In the future, may consider allowing the user to throttle this number.
Note that although reading is in parallel, writing is still sequential. Thus, one should only anticipate moderate speed-ups as much of the time is spent writing.
❗ Utilizing the --parallel
parameter will increase memory usage — there will be multiple threads simultaneously reading chunks from the sas7bdat
. In addition, because all processors are utilized, CPU usage may be maxed out during reading.
--parallel
parameter may write rows out of order from the original sas7bdat
.
The preview
and data
subcommands include a parameter for --reader
. The possible values for --reader
include the following.
mem
→ Parse and read the entiresas7bdat
into memory before writing to either standard out or a filestream
(default) → Parse and read at moststream-rows
into memory before writing to diskstream-rows
may be set via the command line parameter--stream-rows
or if elided will default to 10,000 rows
Why is this useful?
mem
is useful for testing purposesstream
is useful for keeping memory usage low for large datasets (and hence is the default)- In general, users should not need to deviate from the default —
stream
— unless they have a specific need - In addition, by enabling these options as command line parameters hyperfine may be used to benchmark across an assortment of file sizes
Debug information is printed to standard out by setting the environment variable RUST_LOG=debug
before the call to readstat
.
# Linux and macOS
RUST_LOG=debug readstat ...
# Windows PowerShell
$env:RUST_LOG="debug"; readstat ...
For full details run with --help
.
readstat --help
readstat metadata --help
readstat preview --help
readstat data --help
For example, the number 1.1234567890123456
created within SAS would be returned as 1.12345678901235
within Rust.
Why does this happen? Is this an implementation error? No, rounding to only 14 decimal digits has been purposely implemented within the Rust code.
As a specific example, when testing with the cars.sas7bdat dataset (which was created originally on Windows), the numeric value 4.6
as observed within SAS was being returned as 4.600000000000001
(15 digits) within Rust. Values created on Windows with an x64 processor are only accurate to 15 digits.
For comparison, the ReadStat binary truncates to 14 decimal places when writing to csv
.
Finally, SAS represents all numeric values in floating-point representation which creates a challenge for all parsed numerics!
- How SAS Stores Numeric Values
- Accuracy on x64 Windows Processors
- SAS on Windows with x64 processors can only represent 15 digits
- Floating-point arithmetic may give inaccurate results in Excel
Currently any dates, times, or datetimes in the following SAS formats are parsed and read as dates, times, or datetimes.
- Dates
- Times
- Datetimes
SAS stores dates, times, and datetimes internally as numeric values. To distinguish among dates, times, datetimes, or numeric values, a SAS format is read from the variable metadata. If the format matches one of the above SAS formats then the numeric value is converted and read into memory using one of the Arrow types:
If values are read into memory as Arrow date, time, or datetime types, then when they are written — from an arrow2 Chunk
to csv
, feather
, ndjson
, or parquet
— they are treated as dates, times, or datetimes and not as numeric values.
Finally, more work is planned to handle other SAS dates, times, and datetimes that have SAS formats other than those listed above.
To perform unit / integration tests, run the following.
cargo test
Formally tested (via integration tests) against the following datasets. See the README.md for data sources.
-
ahs2019n.sas7bdat
→ US Census data -
all_types.sas7bdat
→ SAS dataset containing all SAS types -
cars.sas7bdat
→ SAS cars dataset -
hasmissing.sas7bdat
→ SAS dataset containing missing values -
intel.sas7bdat
-
messydata.sas7bdat
-
rand_ds.sas7bdat
→ Created using create_rand_ds.sas -
rand_ds_largepage_err.sas7bdat
→ Created using create_rand_ds.sas with BUFSIZE set to2M
-
rand_ds_largepage_ok.sas7bdat
→ Created using create_rand_ds.sas with BUFSIZE set to1M
-
scientific_notation.sas7bdat
→ Used to test float parsing -
somedata.sas7bdat
-
somemiss.sas7bdat
To ensure no memory leaks, valgrind may be utilized. For example, to ensure no memory leaks for the test parse_file_metadata_test
, run the following from within the readstat
directory.
valgrind ./target/debug/deps/parse_file_metadata_test-<hash>
- ✔️ Linux → successfully builds and runs
- ✔️ macOS → successfully builds and runs
- ✔️ Windows → successfully builds and runs
- As of ReadStat
1.1.5
, able to build using MSVC in lieu of setting up an msys2 environment - Requires
libclang
in order to build aslibclang
is required by bindgen
- As of ReadStat
Benchmarking performed with hyperfine.
This example compares the performance of the Rust binary with the performance of the C binary built from the ReadStat
repository. In general, hope that performance is fairly close to that of the C binary.
To run, execute the following from within the readstat
directory.
# Windows
hyperfine --warmup 5 "ReadStat_App.exe -f crates\readstat-tests\tests\data\cars.sas7bdat tests\data\cars_c.csv" ".\target\release\readstat.exe data crates\readstat-tests\tests\data\cars.sas7bdat --output crates\readstat-tests\tests\data\cars_rust.csv"
📝 First experiments on Windows are challenging to interpret due to file caching. Need further research into utilizing the --prepare
option provided by hyperfine
on Windows.
# Linux and macOS
hyperfine --prepare "sync; echo 3 | sudo tee /proc/sys/vm/drop_caches" "readstat -f crates/readstat-tests/tests/data/cars.sas7bdat crates/readstat-tests/tests/data/cars_c.csv" "./target/release/readstat data tests/data/cars.sas7bdat --output crates/readstat-tests/tests/data/cars_rust.csv"
Other, future, benchmarking may be performed now that channels and threads have been developed.
Profiling performed with cargo flamegraph.
To run, execute the following from within the readstat
directory.
cargo flamegraph --bin readstat -- data tests/data/_ahs2019n.sas7bdat --output tests/data/_ahs2019n.csv
Flamegraph is written to readstat/flamegraph.svg
.
📝 Have yet to utilize flamegraphs in order to improve performance.
Below is the rough git tag
dance to delete and/or add tags to trigger GitHub Actions.
# delete local tag
git tag --delete v0.1.0
# delete remote tag
git push origin --delete v0.1.0
# add and commit local changes
git add .
git commit -m "commit msg"
# push local changes to remote
git push
# add local tag
git tag -a v0.1.0 -m "v0.1.0"
# push local tag to remote
git push origin --tags
Short term, developing the command-line tool was a helpful exercise in binding to a C library using bindgen and the Rust FFI. It definitely required a review of C pointers (and for which I claim no expertise)!
The long term goals of this repository are uncertain. Possibilities include:
- ✔️ Developing a command line tool that performs transformations from
sas7bdat
to other file types- text
-
csv
-
ndjson
-
- binary
-
feather
-
parquet
-
- text
- ✔️ Developing a command line tool that expands the functionality made available by the readstat command line tool
- Completing and publishing the
readstat-sys
crate that binds to ReadStat - Developing and publishing a Rust library —
readstat
— that allows Rust programmers to work withsas7bdat
files- Implementing a custom serde data format for
sas7bdat
files (implement serialize first and deserialize later (if possible))
- Implementing a custom serde data format for
The following have been incredibly helpful while developing!
- How to not RiiR
- Making a *-sys crate
- Rust Closures in FFI
- Rust FFI: Microsoft Flight Simulator SDK
- Stack Overflow answers by Jake Goulding
- ReadStat pull request to add MSVC/Windows support
- jamovi-readstat appveyor.yml file to build ReadStat on Windows
- Arrow documentation for utilizing ArrayBuilders