Releases: eBay/tsv-utils
v1.1.19: Minor updates
NOTE: Unfortunately, the pre-built binaries for v1.1.19 and earlier releases have been lost. Please use the pre-built binaries from the latest release. There is nothing wrong with the old binaries, if you downloaded one earlier you can continue to use it.
Changes in v1.1.19:
tsv-uniq
- New options for printing only repeated lines:--r|repeated
,--a|at-least N
.tsv-pretty
- New option for verbatim output of an initial set of lines:--a|preamble N
.makefile help
- Bug fix in the output.
v1.1.18: Minor updates
NOTE: Unfortunately, the pre-built binaries for v1.1.19 and earlier releases have been lost. Please use the pre-built binaries from the latest release. There is nothing wrong with the old binaries, if you downloaded one earlier you can continue to use it.
Changes in v1.1.18:
tsv-uniq
- Added a--m|max
option to output up to a max number of duplicate lines. The default of course is one.tsv-sample
- Added PGO support. Small gains, up to 5% depending on sampling method.- Better unit test diagnostic output on "command line" tests. This simplifies tracking down errors when tests are run on a system like TravisCI. In the past it was necessary to run the test locally to see what failed.
- Bash completion - Fix a
tsv-filter
option. - Doc updates - Added a pair of sections to the Tips and Tricks doc. One describing TSV and CSV differences, another giving examples of using
dos2unix
andiconv
to deal with encoding and newline issues.
v1.1.17: Output Buffering
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.17:
Most of the tools were switched to use output buffering. This is a performance enhancement that works by buffering small writes into larger blocks before writing to the final output destination, usually stdout
. The amount of benefit depends on the tool and the nature of the file being processed. Narrow files (short lines) see the most benefit, and in some cases run 50% faster. More typical gains are 5-20%.
Output buffering logic is in the BufferedOutputRange
struct found in common/src/tsvutil.d
. The resulting source code in each tool turns out to be quite readable.
v1.1.16: Profile guided optimization; New sampling methods
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.16:
The main changes in this release are the use of Profile Guided Optimization (PGO) and the addition of new sampling methods in tsv-sample
.
Profile Guided Optimization - This is a follow-on to the Link Time Optimization work done in v1.1.15. It is based on LDC compiler support for LTO and PGO, including the ability to operate on the application code and the D standard libraries (druntime, phobos) together.
Profile Guided Optimization uses data collected from instrumented builds to better optimize executables. The tsv utilities build process has been updated to generate and use instrumentation for several of the tools. LTO and PGO builds are enabled by options passed to make
. The pre-built binaries available from the GitHub releases page are built with LTO and PGO, but they must enabled explicitly when building from source. See Building with Link Time Optimization and Profile Guided Optimization for details.
PGO results in material performance gains (10% or more) on csv2tsv
and tsv-summarize
, and smaller gains (2-5%) on several other tools. Considering LTO (v1.1.15) and PGO (v1.1.16) combined, performance gains on five of six measured benchmarks ranged from 8-45% on Linux, and 6-57% on MacOS. Three of the benchmarks saw gains greater than 25% on both platforms.
New sampling methods - Two sampling methods have been added to tsv-sample
. One is a simple stream sampling mode that selects a random portion of an input stream based on a sampling rate. Another is a form of sampling known as "distinct" sampling. This selects a random portion of records based on a key in the data. For example, if records contain an IP address, sampling to take all records from 1% of the unique IP addresses. See the tsv-sample reference for details.
Other changes
tsv-summarize
bug fix, incorrect headers on two operations.- Windows line ending detection when running on Unix platforms (Issue #96)
tsv-select
performance improvement: Avoid unnecessary memory allocation from std.array.join. A 5% performance improvement and less memory allocation.
v1.1.15: Link Time Optimization
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.15:
This release uses new link-time optimization (LTO) available starting with the LDC 1.5 compiler release. This improves the performance of most of the tools, typically by about 10% over the previous release, and significantly more in some cases. Benchmarks can be found in this slide deck from the Silicon Valley D Meetup, Dec 14, 2017.
Previous releases used Thin LTO on OS X builds. LTO was not used on Linux builds. In the OS X case, LTO was used on the tsv utilities code, but not the code from the D libraries, phobos and druntime.
The LDC 1.5 release supports LTO on both Linux and OS X out of the box, and includes support for building phobos and druntime with LTO.
This release of the tsv utilities adds support for the new LTO capabilities to the makefiles. It is not enabled by default, but can be turned on with make arguments. The prebuilt binaries have been built with LTO turned on. For more information, see Building With LTO.
v1.1.15-beta3: Link Time Optimization
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.15-beta3:
This release uses new link-time optimization (LTO) available starting with the LDC 1.5 compiler release.
v1.1.14 - Documentation updates
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.14:
No functional changes, updates to documentation only.
v1.1.13 - New tool: tsv-pretty
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.13: New tool, tsv-pretty
.
tsv-pretty
prints TSV data in an aligned fasion for command-line readability. Headers are detected automatically and numeric values aligned. An example, first without formatting:
$ cat sample.tsv
Color Count Ht Wt
Brown 106 202.2 1.5
Canary Yellow 7 106 0.761
Chartreuse 1139 77.02 6.22
Fluorescent Orange 422 1141.7 7.921
Grey 19 140.3 1.03
Now with tsv-pretty
, using header underlining and float formatting:
$ tsv-pretty -u -f sample.tsv
Color Count Ht Wt
----- ----- -- --
Brown 106 202.20 1.500
Canary Yellow 7 106.00 0.761
Chartreuse 1139 77.02 6.220
Fluorescent Orange 422 1141.70 7.921
Grey 19 140.30 1.030
v1.1.12: Link Time Optimization on OS X builds
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.12:
Turn on Link Time Optimization (LTO) when using the LDC compiler on OS X. This produces faster executables. The difference is especially notable for the csv2tsv
tool, which runs about 20% faster. LTO is used in the pre-built OS X binaries and will be used on OS X source code builds (git clone, dub fetch) when building with the LDC compiler.
OS X directly supports LTO with the system linker provided by XCode (Clang / LLVM). LTO can also be used on Linux, but at present it requires installing and building special linker support. This complicates the build process, which is why it is not used on Linux by this toolset. For more information on LDC's LTO support see http://johanengelen.github.io/ldc/2016/11/10/Link-Time-Optimization-LDC.html.
v1.1.11 - Field ranges
NOTE: Pre-built binaries for this release are no longer available. Please use binaries from the latest release.
Changes in v1.1.11:
Main feature is support for field ranges. Any place a list of fields can be entered, field ranges can be used as well. A field range is a pair of field numbers separated by a hyphen. Reverse order is supported as well. Single field numbers and field ranges can be used together. Some examples:
$ tsv-select --fields 1,2,17-33,10-7 data.tsv
$ tsv-summarize --group-by 3-5 --median 7-17
$ tsv-uniq --fields 7-10 data.tsv
There are also some improvements to error message text.