Requirements
Introduction
Significance
syndiff
is primarily intended as a proof-of-concept. We wish to show that it is possible to compare files in a more intelligent manner than simply line-by-line. Current differencing methods are based on the Longest Common Subsequence (LCS) algorithm. If our comparison tool can syntactically “understand” the file it is looking at, it can highlight only the changes that are relevant to the interpretation of the file, and ignore changes that do not alter the underlying meaning of the file. It appears that the LCS algorithm is a special case of the algorithm we plan to implement. We intend to implement syntax-aware differencing for MiniJava, Java, and Python files.
Goals
These goals are presented in order of priority. If only partial aspects of the lowest priority goals are achieved then the project is still to be considered a success.
- The highest priority of this project is to be able to compare MiniJava programs. MiniJava is a subset of Java with many ordered and unordered elements in the syntax of the language. It should allow us to give a rapid proof-of-concept of our ability to syntactically differ files. The subset java language is still fully functional as a programming language and therefore adds considerable validity to the project.
- From MiniJava, we will expand
syndiff
's capabilities to include comparison of regular Java files. This will be a matter of using our previous experience with MiniJava to implement syntax-aware differencing to the whole of the Java language. This expansion will not include the differencing of comments. The focus on achieving this goal is to determine changes in the syntax of a program which will affect its interpretation. - Beyond Java, we will focus on comparing Python programs. This may present a new challenge due to the syntactic value that Python places on whitespace. This goal does not further prove our ability to syntactically differ files beyond our Java demonstrations, but does add valuable functionality to
syndiff
. Again, the focus on achieving this goal is to determine changes in the syntax of a program which will affect its interpretation or compilation. - Having achieved the ability to syntactically difference MiniJava, Java, and Python programs we will add support for differencing comments that may be present in MiniJava files. Support for this feature may be limited as grammars and lexers usually discard comments. Specifically, this may restrict where comments can be for differencing. We should reiterate that the primary goal for
syndiff
is to highlight syntactic differences which will affect the operation of the file. - If, in accomplishing these things, we discover that the implementation of various languages is formulaic, we would like to include some modular ability to accept new languages with ease, based on some standard specification of the language. This goal will probably be accomplished only in the form of a help file which provides information on how to modify the code to support another language. This goal does not require a kind of on-the-fly ability to add further language support to be achieved.
Project Organization
The organization of the syndiff
project is based on an agile development methodology. The three of us (Grant, Robert, and Jason) are considered development peers. Grant is also the client and is therefore responsible for adequately describing the problem. In development, each member will be responsible for testing their own code with unit tests and working together to test the whole program with use-cases. To facilitate the use-cases we will each develop and find “real-world” Minijava, Java, and Python files to difference. The following is a high-level distribution of responsibilities in the syndiff
project:
- Grant will head up the use of Javacc and JTB for lexing, parsing, and tree building due to his familiarity and experience with the tools. This will require tailoring the automatically generated code for efficiency to the differencing algorithm's needs.
- The three of us will work together on the differencing algorithm. This includes finding and reading published papers on the topic of tree differencing. We will work together to make the present research into a practical tool. Our feasibility research indicates that the algorithm lends itself to two subparts.
- Robert will focus on the side of the algorithm that will deal with comparing items in order (much like
diff
does now). This is known as ordered tree comparison. - The three of us will work together on implementing the side of the algorithm that deals with comparing unordered items. This is known as unordered tree comparison.
- Jason will focus on writing an output module to express differences in a predefined and regular manner. This part is nontrivial since wide language support will require handling the various formats of different languages.
Risk Analysis
The chief risk in undertaking a proof-of-concept project such as syndiff
is that the concept may be inherently flawed. We can break this risk down into three parts. First is that the task of differencing two files may be algorithmically slow and provably such. Some preliminary research indicates that certain cases of the problem are known to be NP-Complete. As a result, we could add command line flags that would increase the speed of a run of syndiff
at the expense of accuracy.
Another concern is that syndiff
may require more memory than is acceptable for its intended application (as a replacement for diff
). To mitigate this risk, we could add options to limit the depth of the parse tree generated or length of the file read. This optimization will require, as above, a loss of accuracy.
Finally, we may not be able to combine the comparison of ordered and unordered tree properties in a single algorithm. If this is the case, we could add options to use one algorithm or the other, and take steps to present false positives, but never false negatives.
Resource Requirements
People
Our project has essentially no people requirements outside of the team. Since Grant is both a developer and the client, the team is very self contained. Furthermore, three developers with the tools Javacc and JTB and with the time frame of about 5 weeks for development, test, and deployment should be sufficient for this proof-of-concept project. The people that have worked to develop these tools and publish the research basis are not considered a direct part of our people resources.
Hardware
The hardware requirements for this project are simple. The software that we develop will be designed to run atop the Java Virtual Machine (JVM). For all intents and purposes, the JVM is the hardware that we require for this project. Indirectly then, the JVM has hardware requirements which essentially are met by any off the shelf computer bought today. The JVM provides a hardware abstraction which we will rely on extensively.
Software
The software requirements for this project depend entirely on Open Source Software (OSS). Our development depends on four OSS projects: Javacc, JTB, Eclipse, and the JVM. Each of these project are well established and have reached stable production cycles. Our team feels confident in using these tools for development in this project. Specifically we will use Javacc and JTB for automatic code generation. This code will be tailored to our project as necessary. Eclipse will be used as our development environment. Some research has already shown that OSS plug-ins for Eclipse allow for integration with Subversion (SVN) repositories. Finally, the JVM will be used as the platform for running our software. All of these OSS projects have licenses that allow their using in the above described ways. Furthermore, the projects allow for developing commercial products without a fee.
Data
The data required for this project can be broken into two parts. Data is needed for both initial development and test. Data for initial development is ”.jj” files which will be used with Javacc for parsing and lexing. These initial files can be acquired for free online from the Javacc database of ”.jj” files. Further extension of our tool to other languages will require us to develop these files ourselves. The second set of data required for this project are for test. In test, we will use real-world files from versioning systems in our end-to-end use-cases. These tests will require a small collection of files with various changes which we can use to assess the accuracy of our program. Aside from ”.jj” files and real-world test cases we won't need any kind of database or collection of user information in our program.
Timeline
The timeline for the syndiff
project is as follows:
- 08/02/22 - Subset Implementation
In the subset implementation we will have a functional version of syndiff
which will allow users to syntactically difference minijava files. The project goal for this deadline will be to focus on the algorithm and correctness (from the viewpoint that we wish to remove all false negatives) over performance speed and efficiency. This version of syndiff will also provide command line flags for debug and verbose modes of operation. This subset implementation corresponds to the primary goal and does not provide support for comments. Furthermore, the output module will not be fully complete at this time and so output may only appear in a kind of raw format.
- 08/03/07 - Full-Featured Implementation
The full-featured implementation will provide syntactic differencing capabilities for Java and Python as well as continuing support of MiniJava. This version will work to improve performance as well as removing false positives from the output. This implementation corresponds to achieving the first three goals with some support for the fourth goal. The output format at this stage will be improved but still not finished in design.
- 08/03/14 - Polished Version
The completed version of syndiff
will provide the user with different options for performing the diff. This version will support user help flags as well as flags designed to improve performance (i.e. run only ordered/unordered diff). The output will be clean and formatted in a consistent framework. This version will achieve all five goals according to their original priority.