SynDiff - Generalized Syntactic Differ Tool

Wouldn't it be nice to have a generalized syntactic differ?

Definition

Well, let's go in reverse order: differ, syntactic, generalized:

'diff' is what you expect it to be. It is a tool that can highlight differences between at least two documents.
'syntactic' means that the input is parseable according to an LL(1) grammar.
'generalized' means that for any language which is defined by an LL(1) grammar we can diff two sentences that are in the language defined by the grammar.

Back Story

The idea for this tool arose from a problem I encountered during an internship in the summer of 2007. I was working on checked-out code that looked like the following:

... various code ...
public funcOne(arguments) {
  ... lots of complicated code ...
}
public funcTwo(arguments) {
  ... lots of complicated code ...
}
public funcThree(arguments) {
  ... lots of complicated code ...
}
... various code ...

I was working hard to understand the code and so made the following trivial change:

... various code ...
public funcTwo(arguments) {
  ... lots of complicated code ...
}
public funcThree(arguments) {
  ... lots of complicated code ...
}
public funcOne(arguments) {
  ... lots of complicated code ...
}
... various code ...

Later, I made three simple changes and tried to check-in my code. The check-in failed! Why? Because I had made too many changes according to the company's diff-ing tool. I objected because I knew I hadn't changed anything substantial and determined the problem to be the rearranging of functions! I decided that a diff-ing tool which was aware of the syntax and knew which elements did not depend on order would be a better tool.

Simple Example

Consider the following grammars in EBNF form:

Grammar 1:
S -> Line S | Line
Line -> [a string of zero or more characters followed by a newline character]

Grammar 2:
S -> ( Line )*
Line -> [a string of zero or more characters followed by a newline character]

Now consider the following fictional text files:

file1.txt:
one
two
three
four

file2.txt:
three
two
one
four

Then the SynDiff tool would report the following differences in the following two files:

According to grammar 1:
Difference in token 'Line' at line 1, column 1:
file1.txt: one
file2.txt: three
Difference in token 'Line' at line 3, column 1:
file1.txt: three
file2.txt: one

According to grammar 2:
No differences to report.

Notice that when the Kleene star operator is used, ordering is assumed to be unimportant. Since ordering of the lines is unimportant in grammar 2, there are no differences between the files according to SynDiff.

Uses

Version Systems - A version system may be be able to keep a better record of meaningful changes.
Regression Tests - A syntactic differ may yield a better understanding of what has actually changed.

User Interface

Command line.

Usage: syndiff -[options] file1 file2

Options:

g - guess the syntax of the files based on their extensions
j - use the java syntax parser
t - use the text syntax parser
p - use the python syntax parser

Drawbacks

More memory.
More processing power.

Why do the drawbacks not matter? Because we are constantly building faster and faster computers with more and more memory. Check out Moore's Law for more information.

Feasibility Research

Is it possible? YES! For the following reasons:

S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom, “Change Detection in Hierarchically Structured Information”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, June 1996.

D. S. Hirschberg, “Algorithm for the Longest Common Subsequence Problem”, Journal of the ACM, 24(4): 664-675, October 1977.

C. M. Hoffmann, M. J. O’Donnell, “Pattern Matching in Trees”, Journal of the ACM, 29: 68-95, 1982.

E. W. Myers, “An O(ND) Difference Algorithm and Its Variations”, Algorithmica, 1(2): 251-266, 1986.

S. M. Selkow, “The Tree-to-Tree Editing Problem”, Information Processing Letters, 6(6): 184-186, 1977.

K. C. Tai, “The Tree-to-Tree Correction Problem”, Journal of the ACM, 26: 485-495, 1979.

K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance between Trees and Related Problems”, SIAM Journal of Computing, 18(6): 1245-1262, 1989.

K. Zhang, R. Statman, D. Shasha, “On the Editing Distance between Unordered Labeled Trees”, Information Processing Letters, 42: 133-139, 1992.

K. Zhang, “A New Editing based Distance between Unordered Labeled Trees”, Combinatorial Pattern Matching, 1: 254 – 265, 1993.

Tools Used

For development:

Eclipse

For code creation:

Javacc
JTB

For code versioning:

Syndiff hosted by Google Code

Trace: » Membership » Recovering Sight to the Blind » Partnership Gathering Sharing » Sermon on Generosity » Room of Grace » Just Courage » Pagan Christianity » Till We Have Faces » christian_community » SynDiff - Generalized Syntactic Differ Tool

projects/syndiff/details.txt · Last modified: 2008/03/31 15:43 by grant