Skip to content

Latest commit

 

History

History
20 lines (14 loc) · 1.09 KB

ARCHITECTURE.md

File metadata and controls

20 lines (14 loc) · 1.09 KB

Architecture

The algorithm is as follows:

  1. Trim whitespace from each line in both files
  2. Perform a UNIX diff
  3. Map unchanged lines
  4. Analyzing the UNIX diff, let leftLines be the lines that are deleted from the left file, and rightLines the lines that are added to the right file.
  5. Map each leftLine to a rightLine if their distance is smaller than a predefined threshold. The distance is a combination of levenshtein distance of the two lines as well as the cosine similarity of the context around each line.

See this research paper for details about the algorithm.

Differences from the research paper

The paper describes the use of simhash to improve performance. This implementation does not perform this optimization because the performance seems "good enough".

The paper also describes an option to detect line spliting. This is not implemented.