Skip to content

agardiner/csv-diff

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSV-Diff

CSV-Diff is a small library for performing diffs of tabular data, typically data loaded from CSV files.

Unlike a standard diff that compares line by line, and is sensitive to the ordering of records, CSV-Diff identifies common lines by key field(s), and then compares the contents of the fields in each line.

Data may be supplied in the form of CSV files, or as an array of arrays. More complex usage also allows you to specify XPath expressions to extract tabular data from XML documents for diffing.

The diff process provides a fine level of control over what to diff, and can optionally ignore certain types of changes (e.g. adds, deletes, changes in position etc).

CSV-Diff is particularly well suited to data in parent-child format. Parent- child data does not lend itself well to standard text diffs, as small changes in the organisation of the tree at an upper level can lead to big movements in the position of descendant records. By instead matching records by key, CSV-Diff avoids this issue, while still being able to detect changes in sibling order.

Usage

CSV-Diff is supplied as a gem, and has no dependencies. To use it, simply:

```
gem install csv-diff
```

To compare two CSV files where the field names are in the first row of the file, and the first field contains the unique key for each record, simply use:

```ruby
require 'csv-diff'

diff = CSVDiff.new(file1, file2)
```

The returned diff object can be queried for the differences that exist between the two files, e.g.:

puts diff.summary.inspect   # Summary of the adds, deletes, updates, and moves
puts diff.adds.inspect      # Details of the additions to file2
puts diff.deletes.inspect   # Details of the deletions to file1
puts diff.updates.inspect   # Details of the updates from file1 to file2
puts diff.moves.inspect     # Details of the moves from file1 to file2
puts diff.diffs.inspect     # Details of all differences
puts diff.warnings.inspect  # Any warnings generated during the diff process

Unique Row Identifiers

CSVDiff is preferable over a standard line-by-line diff when row order is significantly impacted by small changes. The classic example is a parent-child file generated by a hierarchy traversal. A simple change in position of a parent member near the root of the hierarchy will have a large impact on the positions of all descendant rows. Consider the following example:

Root
  |- A
  |  |- A1
  |  |- A2
  |
  |- B
     |- B1
     |- B2

A hierarchy traversal of this tree into a parent-child format would generate a CSV as follows:

Root,A
A,A1
A,A2
Root,B
B,B1
B,B2

If the positions of A and B were swapped, a hierarchy traversal would now produce a CSV as follows:

Root,B
B,B1
B,B2
Root,A
A,A1
A,A2

A simple diff using a diff utility would highlight this as 3 additions and 3 deletions. CSVDiff, however, would classify this as 2 moves (a change in sibling position for A and B).

In order to do this, CSVDiff needs to know what field(s) confer uniqueness on each row. In this example, we could use the child field alone (since each member name only appears once); however, this would imply a flat structure, where all rows are children of a single parent. This in turn would cause CSVDiff to classify the above change as a Move (i.e. a change in order) of all 6 rows.

The more correct specification of this file is that column 0 contains a unique parent identifier, and column 1 contains a unique child identifier. CSVDiff can then correctly deduce that there are in fact only two changes in order - the swap in positions of A and B below Root.

Note: If you aren't interested in changes in the order of siblings, then you could use CSVDiff with a :key_field option of column 1, and specify the :ignore_moves option.

Warnings

When processing and diffing files, CSVDiff may encounter problems with the data or the specifications it has been given. It will continue even in the face of problems, but will log details of the problems in a #warnings Array. The number of warnings will also be included in the Hash returned by the #summary method.

Warnings may be raised for any of the following:

  • Missing fields: If the right/to file contains fields that are not present in the left/from file, a warning is raised and the field is ignored for diff purposes.
  • Duplicate keys: If two rows are found that have the same values for the key field(s), a warning is raised, and the duplicate values are ignored.

Examples

The simplest use case is as shown above, where the data to be diffed is in CSV files with the column names as the first record, and where the unique key is the first column in the data. In this case, a diff can be created simply via:

```ruby
diff = CSVDiff.new(file1, file2)
```

Specifying Unique Row Identifiers

Often however, rows are not uniquely identifiable via the first column in the file. In a parent-child hierarchy, for example, combinations of parent and child may be necessary to uniquely identify a row, while in other cases a combination of fields may be needed to derive a natural unique key or identifier for each row. In these cases, it is necessary to indicate to CSVDiff which fields are needed to uniquely identify common rows across the two files. This can be done in several different ways.

:key_field(s)

The first method is using the key_fields option (or key_field if you have only a single key field). Use this option when your data represents a flat structure rather than a parent-child hierarchy or flattened tree. You can specify key_fields using field numbers/column indices (0-based):

```ruby
diff = CSVDiff.new(file1, file2, key_fields: [0, 1])
```

Alternatively, you can use the :key_fields options with column names (provided CSVDiff knows the names of your fields, either via the field_names option or from headers in the file):

```ruby
diff = CSVDiff.new(file1, file2, key_fields: ['First Name', 'Last Name'])
```

:parent_field(s)/:child_field(s)

The second method for identifying the unique identifiers in your file is to use the :parent_fields and :child_fields options. Use this option when your data represents a tree structure flattened to a table in parent-child form.

Using the :parent_fields and :child_fields with field numbers:

```ruby
diff = CSVDiff.new(file1, file2, parent_field: 1, child_fields: [2, 3])
```

Using the :parent_fields and :child_fields with column names:

```ruby
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'])
```

Using Non-CSV Sources

Data from non-CSV sources can be diffed, as long as it can be supplied as an Array of Arrays:

DATA1 = [
    ['Parent', 'Child', 'Description'],
    ['A', 'A1', 'Account 1'],
    ['A', 'A2', 'Account 2']
]

DATA2 = [
    ['Parent', 'Child', 'Description'],
    ['A', 'A1', 'Account1'],
    ['A', 'A2', 'Account2']
]

diff = CSVDiff.new(DATA1, DATA2, parent_field: 1, child_field: 0)

Data can also be diffed if it is an XML source, although this requires a little more effort to tell CSVDiff how to transform/extract content from the XML document into an array-of-arrays form. It also introduces a dependency on Nokogiri - you will need to install this gem to use CSVDiff with XML sources.

The first step is to use the CSVDiff::XMLSource class to define how to convert your XML content to an array-of-arrays. The XMLSource class is quite flexible, and can be used to convert single or multiple XML sources into a single data set for diffing, and different documents may even have different layouts.

The first step is to create an XMLSource object, which requires a label to identify the type of data it will generate:

xml_source_1 = CSVDiff::XMLSource.new('My Label')

Next, we pass XML documents to this source, and specify XPath expressions for each row and column of data to produce via the process(rec_xpath, field_maps, options) method:

  • An XPath expression is provided to select each node value in the document that will represent a row. Taking an HTML table as an example of something we wanted to parse, your rec_xpath value might be something like the following: '//table/tbody/tr'. This would locate all tables in the document, and create a new row of data in the XMLSource every time a <tr> tag was encountered.
  • A hash of field_maps is then provided to describe how to generate column values for each row of data. The keys to field_maps are the names of the fields to be output, while the values are the epression for how to generate values. Most commonly, this will be another XPath expression that is evaluated in the context of the node returned by the row XPath expression. So continuing our HTML example, we might use './td[0]/text()' as an expression to select the content of the first <td> element within the <tr> representing the current row.
xml_source1.process('//table/tbody/tr',
                    col_A: './td[0]/text()',
                    col_B: './td[1]/text()',
                    col_C: './td[2]/text()')

Finally, to diff two XML sources, we create a CSVDiff object with two XMLSource objects as the source:

diff = CSVDiff.new(xml_source1, xml_source2, key_field: 'col_A')

Specifying Column Names

If your data file does not include column headers, you can specify the names of each column when creating the diff. The names supplied are the keys used in the diff results:

DATA1 = [
    ['A', 'A1', 'Account 1'],
    ['A', 'A2', 'Account 2']
]

DATA2 = [
    ['A', 'A1', 'Account1'],
    ['A', 'A2', 'Account2']
]

diff = CSVDiff.new(DATA1, DATA2, key_fields: [1, 0], field_names: ['Parent', 'Child', 'Description'])

If your data file does contain a header row, but you wish to use your own column names, you can specify the :field_names option and the :ignore_header option to ignore the first row.

Ignoring Fields

If your data contains fields that you aren't interested in, these can be excluded from the diff process using the :ignore_fields option:

diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
                   ignore_fields: ['CreatedAt', 'UpdatedAt'])

Filtering Rows

If you need to filter source data before running the diff process, you can use the :include and :exclude options to do so. Both options take a Hash as their value; the hash should have keys that are the field names or indexes (0-based) on which to filter, and whose values are regular expressions or lambdas to be applied to values of the corresponding field. Rows will only be diffed if they satisfy :include conditions, and do not satisfy :exclude conditions.

# Generate a diff of Arsenal home games not refereed by Clattenburg
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
                   include: {HomeTeam: 'Arsenal'}, exclude: {Referee: /Clattenburg/})

# Generate a diff of games played over the Xmas/New Year period
diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
                   include: {Date: lambda{ |d| holiday_period.include?(Date.strptime(d, '%y/%m/%d')) } })

Ignoring Certain Changes

CSVDiff identifies Adds, Updates, Moves and Deletes; any of these changes can be selectively ignored, e.g. if you are not interested in Deletes, you can pass the :ignore_deletes option:

diff = CSVDiff.new(file1, file2, parent_field: 'Date', child_fields: ['HomeTeam', 'AwayTeam'],
                   ignore_fields: ['CreatedAt', 'UpdatedAt'],
                   ignore_deletes: true, ignore_moves: true)