SynDiff - Generalized Syntactic Differ Tool
Wouldn't it be nice to have a generalized syntactic differ?
Definition
Well, let's go in reverse order: differ, syntactic, generalized:
- 'diff' is what you expect it to be. It is a tool that can highlight differences between at least two documents.
- 'syntactic' means that the input is parseable according to an LL(1) grammar.
- 'generalized' means that for any language which is defined by an LL(1) grammar we can diff two sentences that are in the language defined by the grammar.
Back Story
The idea for this tool arose from a problem I encountered during an internship in the summer of 2007. I was working on checked-out code that looked like the following:
... various code ... public funcOne(arguments) { ... lots of complicated code ... } public funcTwo(arguments) { ... lots of complicated code ... } public funcThree(arguments) { ... lots of complicated code ... } ... various code ...
I was working hard to understand the code and so made the following trivial change:
... various code ... public funcTwo(arguments) { ... lots of complicated code ... } public funcThree(arguments) { ... lots of complicated code ... } public funcOne(arguments) { ... lots of complicated code ... } ... various code ...
Later, I made three simple changes and tried to check-in my code. The check-in failed! Why? Because I had made too many changes according to the company's diff-ing tool. I objected because I knew I hadn't changed anything substantial and determined the problem to be the rearranging of functions! I decided that a diff-ing tool which was aware of the syntax and knew which elements did not depend on order would be a better tool.
Simple Example
Consider the following grammars in EBNF form:
Grammar 1: S -> Line S | Line Line -> [a string of zero or more characters followed by a newline character]
Grammar 2: S -> ( Line )* Line -> [a string of zero or more characters followed by a newline character]
Now consider the following fictional text files:
file1.txt: one two three four
file2.txt: three two one four
Then the SynDiff tool would report the following differences in the following two files:
According to grammar 1: Difference in token 'Line' at line 1, column 1: file1.txt: one file2.txt: three Difference in token 'Line' at line 3, column 1: file1.txt: three file2.txt: one
According to grammar 2: No differences to report.
Notice that when the Kleene star operator is used, ordering is assumed to be unimportant. Since ordering of the lines is unimportant in grammar 2, there are no differences between the files according to SynDiff.
Uses
- Version Systems - A version system may be be able to keep a better record of meaningful changes.
- Regression Tests - A syntactic differ may yield a better understanding of what has actually changed.
User Interface
Command line.
Usage: syndiff -[options] file1 file2
Options:
- g - guess the syntax of the files based on their extensions
- j - use the java syntax parser
- t - use the text syntax parser
- p - use the python syntax parser
Drawbacks
- More memory.
- More processing power.
Why do the drawbacks not matter? Because we are constantly building faster and faster computers with more and more memory. Check out Moore's Law for more information.
Feasibility Research
Is it possible? YES! For the following reasons:
- S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom, “Change Detection in Hierarchically Structured Information”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, June 1996.
- D. S. Hirschberg, “Algorithm for the Longest Common Subsequence Problem”, Journal of the ACM, 24(4): 664-675, October 1977.
- C. M. Hoffmann, M. J. O’Donnell, “Pattern Matching in Trees”, Journal of the ACM, 29: 68-95, 1982.
- E. W. Myers, “An O(ND) Difference Algorithm and Its Variations”, Algorithmica, 1(2): 251-266, 1986.
- S. M. Selkow, “The Tree-to-Tree Editing Problem”, Information Processing Letters, 6(6): 184-186, 1977.
- K. C. Tai, “The Tree-to-Tree Correction Problem”, Journal of the ACM, 26: 485-495, 1979.
- K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance between Trees and Related Problems”, SIAM Journal of Computing, 18(6): 1245-1262, 1989.
- K. Zhang, R. Statman, D. Shasha, “On the Editing Distance between Unordered Labeled Trees”, Information Processing Letters, 42: 133-139, 1992.
- K. Zhang, “A New Editing based Distance between Unordered Labeled Trees”, Combinatorial Pattern Matching, 1: 254 – 265, 1993.
Tools Used
For development:
For code creation:
For code versioning: