Software Requirements Specification
Overview
Purpose
Syndiff is intended to expand on common differencing engines. The purpose of Syndiff is to intelligently perform a diff on two files. The goal of our comparison tool is to syntactically “understand” two versions of a file, and highlight only the changes that are relevant to the interpretation of the file, and ignore changes that do not alter the underlying meaning or functionality of the file. The Syndiff algorithm will be developed for the language, MiniJava which is a subset of Java. After this proof-of-concept is complete, support will be extended to Java and Python files. The development of Syndiff would aid software developers, code versioning systems, and regression test systems primarily.
Intended Audience
The intended audience of this document is four fold. It is intended for developers, testers, documentation writers, and users. Our small development team for this project has opted out of a hierarchy of management. Furthermore, people interested in selling or marketing Syndiff should know that our tool is currently licensed as open source software and is free to use.
This document is divided in two main parts. The first part describes functional requirements while the second part describes non-functional requirements. The functional requirements gives the project Overview, Interface, Functional Capabilities, and Data Structures. In the second the part the non-functional requirements describe the Performance, Safety, Quality, Constraints, and Reliability aspects of the project.
Developers should read this document and identify with its goals. They may want to read the appendix section which includes references to feasibility research papers and then the System Features to begin algorithm development. With a grasp of the technical concepts and background, the developer for Syndiff should read the entire document from beginning to end for a high-level understanding of the project requirements. After this, a sufficient understanding of the requirements will be present.
Testers and documentation writers should read this for two reasons. First, they will want to understand the project at a top-level and second they will need to have an intimate understanding of how it works. These project contributors may want to read the non-functional requirements sections first before reading through this entire document. Understanding the main intent of this project will be critical to the development of good documentation and test cases.
Users will be interested in this document in order to gauge interest in the utility of our project. Because our intended user base is primarily other developers who will have technical backgrounds, these people will probably want to read through the System Features before the non-functional requirements sections and then finally the entire document. This will allow users to quickly judge whether the Syndiff project will be useful for their needs.
Overall, this document is intended to give a description of Syndiff, while providing guidelines for the development of the project. If questions arise please direct them to the development team.
Contact Info
- Team name: SynDiff
Please send general questions/comments to ffidnys@gmail.com
Project Team:
- Developer/Client: Grant Jenks
- Email: tnarg.sknej@gmail.com
- Developer: Robert Nakamoto
- Email: otomakansr@ucla.edu
- Developer: Jason Stoops
- Email: jspoots@gmail.com
Please include the string ”[SynDiff]” in the email subject for all project communications.
Interface
User
The user interface for syndiff
will be similar to that of diff
in that it will be command line based. There will be a set of command line options specific to syndiff
which are described below, but beyond that, use will be similar to diff
. This will facilitate the use of syndiff
as a “smarter”, drop-in replacement for diff
. Syntax will be as follows:
''syndiff [__OPTIONS__] __PATH TO FILE1__ __PATH TO FILE2__''
We expect to implement the following command line options:
- -j Specifies Java as the target language.
- -m Specifies MiniJava as the target language.
- -p Specifies Python as the target language.
- -d Prints debug output to standard error.
- -v Prints verbose output to standard out.
- --help Writes usage and basic help information to standard out.
- --version Outputs version information to standard out.
- -p Displays a progress indicator during processing to standard out.
- -o Forces Syndiff to use only the unordered algorithm.
- -u Forces Syndiff to use only the ordered algorithm.
Hardware
Our target platform for syndiff
is the Java VM, and therefore any platform capable of running the Java VM. This abstraction allows us to rapidly deliver a functional, portable, proof-of-concept. We are aware that we sacrifice some speed in using the Java VM. We believe the gains of this platform such as memory management, exception handling, and hardware specifics abstraction far outweigh the speed disadvantage. The Java VM will then take care of the interface between our program and the hardware.
Software
We plan to extensively use Javacc
(Java Compiler Compiler) and JTB
(Java Tree Builder), and to some extent change these tools in developing our application. However, releases of our application will absorb any needed functionality from javacc
and jtb
such that syndiff
is the only application needed by the end user at runtime. Interfacing with other software that builds atop syndiff
's functionality will be accomplished through passing filenames as arguments at startup.
These software tools communicate via .jj files which describe an LL(1) grammar. General purpose versions of these files can be obtained from open source repositories. We will change these grammar as needed for our project. Even though these files will not necessarily be published with our tool, they are an important part of the project.
Communication Protocol + Interface
Syndiff
has no communication protocol with other software. Once the files to be compared are passed and the target grammar specified on the command line, Syndiff
has all the information it needs to run. The application will then print its results in a predefined, regular manner to its standard output, from which other applications may parse the output and do what they may with the results. Aside from the fundamental system interface through the Java VM, there is no communication protocol and interface.
Functional Capabilities
The functional capabilities described in this section are in order of priority from highest to lowest. It is important to note that the first three functionalities are mutually exclusive.
MiniJava Syntactic Differencing
The highest priority of this project is to be able to compare MiniJava programs. MiniJava is a subset of Java, and therefore it will allow us to give a rapid proof-of-concept of our ability to syntactically differ files without getting bogged down in recognizing all of Java.
This function will be triggered by the ”-m” option on the command line.
Specific requirements for the completion of this function are as follows:
- The classes within a MiniJava file should be compared in an unordered way.
- The fields within a MiniJava class should be compared in an unordered way.
- The methods within a MiniJava class should be compared in an unordered way.
- The fields within a MiniJava method should be compared in an unordered way.
- All other parse nodes should be compared in an ordered way.
- Invalid MiniJava programs will be rejected and not compared in any way.
Java Syntactic Differencing
From MiniJava, we will expand syndiff
's capabilities to include comparison of regular Java files. This should just be a matter of expanding our grammar for parsing MiniJava once we have a working algorithm in place. Expanding the grammar and verifying a robust system is non-trivial however and so this, as with the above function, is a high priority, major milestone.
This function will be triggered by the ”-j” option on the command line.
Specific requirements for the completion of this function are as follows:
- The classes within a Java file should be compared in an unordered way.
- The methods within a Java class should be compared in an unordered way.
- All other parse nodes should be compared in an ordered way.
- Invalid Java programs will be rejected and not compared in any way.
Python Syntactic Differencing
Beyond Java, we will focus on comparing Python programs. This may present a new challenge due to the syntactic value Python places on whitespace. This goal does not further prove our ability to syntactically differ files beyond our Java demonstrations, but does add valuable functionality to syndiff
. New challenges surrounding the grammar will be posed by support for this language.
This function will be triggered by the ”-p” option on the command line.
Specific requirements for the completion of this function are as follows:
- Parse nodes, which are dependent on order, will be compared in an ordered way.
- Parse nodes, which do not depend on order, will be compared in an unordered way.
- Invalid Python programs will be rejected and not compared in any way.
Basic Comment Support for MiniJava
Some parts of the MiniJava grammar will be retooled for basic comment support. Most lexers and parsers discard comments which makes this function non-trivial.
There will be no specific trigger for this functionality. This improves the existing MiniJava differencing functionality.
Specific requirements for the completion of this function are as follows:
- Successful differencing of block comments in popular program areas. This does not mean that block comments may be placed anywhere. Rather, block comments will be supported at the beginning of files, classes and methods.
- In addition to block comments, single line comments will be supported in popular places. These places include after statements, at the beginning of functions, and within classes.
Method of Adding Additional Language Support
Finally, of lowest priority, is support for additional languages. LL(*) grammars exist for other languages such as C++, C#, and Fortran 77. Support for these languages should be provided in future implementations of Syndiff. The point of this function is to allow support for the adding of languages. This feature will be implemented either as an additional tool that works with Syndiff or as simply a How-To help file with specific directions.
The stimulus for this functionality is completely up to the user.
Specific requirements for the completion of this function are either part of option one or option two:
Option 1: Document
- Explain basics of constructing grammar file.
- Explain specifically how to change Syndiff code for added language support.
- Explain specifically how to change build system for added language support.
Option 2: Tool
- Provide tool that automatically changes and repackages Syndiff tool.
- Explain how to use tool in a help document.
Data Structures / Elements
The most significant data structures in this project can be broken down into four sections. These sections are Program Start, Tree Building, Differencing, and Output. These sections describe the linear flow of data though the syndiff
program as it is transformed and analyzed. Fortunately, all the data structures described are implementable in Java.
Program Start
The initial Program Start phase includes a data object with various global values. The point of this data object is to present fields which may be set once and then read however many times. It should be an exception to read a value before it has been written and to write a value more than once. This data object will contain run-time values which are parsed from the command line and constants that allow for tuning the program.
The above requirements described a data object which may throw an exception. This requirement means that another data object is necessary which contains the various possible custom exceptions which are developed in this project. The exception classes will allow for better tailored error handling during program execution.
Tree Building
The tree building phase uses many data objects. These data objects are represented by code which will be automatically generated from the Javacc and JTB tools. Fundamentally this code defines a “Node” object which will be the parent class for a wide range of subclasses. The “Node” objects that are currently generated by JTB contain fields which are pointers to other “Node” objects. Our changes to JTB will add sub-node lists which will allow for the programmatic traversal of any syntax tree.
Essentially, the data structure that results from the Tree Building phase is a tree of nodes that represent the parse tree of the input file. This data structure will have a root node called “Goal.” This data structure will also have no cycles so that it may truly be called a tree. Since the tree is a parse tree, it is not necessarily binary as is common in many areas of computer science or unary as is the case in the general differencing algorithm but rather n-ary.
In addition to the “Node” objects which are generated, the code generation tools produce “Visitor” objects which are part of the design pattern for Visitors. This design pattern will require recursion and object oriented features in the language. This shouldn't be a problem since Java has a lot of support for the polymorphism and data encapsulation which is needed in this design pattern.
Differencing
The next phase of program execution is differencing. The algorithm that is currently being considered describes a dynamic programming approach wherein tables of scores are kept in order to find the minimum editing distance between trees. The “Table” objects that are used in the differencing algorithm will work together with the visitor objects that traverse the parse trees to determine differences.
The “Table” object will probably work in two parts. The first part simply records different variations on edits. The second part determines the minimum set of changes from the table. This is important since it may be a portion of the program with non-polynomial space requirements. The previously described phases have either constant or linear space requirements in the size of the input program. This part, however, may be exponential in size.
Output
The last phase of program execution is output. This phase displays the changes calculated by the differencing algorithm. This part is based primarily on a single data structure. The single data structure will be a change list which is the result of the calculations of the differencing algorithm. The change list will include information present in the “Node” such as the name in the grammar and the corresponding string in the file.
Performance
Several research papers (specifically K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance between Trees and Related Problems”) indicate that finding the edit distance between two unordered, labelled trees is an NP-complete problem. As a result, if syndiff
were given two arbitrary files, it could potentially run extremely slow. However, in practical usage, we expect that the files being compared will be similar enough that we will be able to complete the processing of differences in a reasonable amount of time.
This compels us to come up with “average” situations in which our tool may be used, and measure performance in these situations, since this would provide a more realistic estimate of whether or not the performance of syndiff
is good enough for the tool to be used in the real world. If not, the tool is at least a proof that a program such as syndiff
can understand meaningful differences between two files.
Finally, we are considering adding command line flags to alter the performance behavior of syndiff
. Since the differencing engine will have two major components (for ordered and unordered differencing), we will add switches to force one mode or the other for testing and performance purposes, though such options would sacrifice accuracy and brevity of output.
Safety / Security
Our application is a tool. It is primarily the implementation of an algorithm. Since the code is open source, we make no attempt to hide the underlying implementation of our program. Furthermore, our initial version will have no non-deterministic aspects and therefore is completely predictable. We believe this transparency aids in the security and safety of our program. If anyone can detect a security or safety flaw in the code then they are urged to contact the syndiff
project immediately.
As far as safety is concerned, syndiff
will be safe in the sense that it will not alter the files it is comparing, or any other files for that matter. Syndiff
treats all inputs as read-only and output is only directed to standard out.
Also along these lines, syndiff
will be secure in that it will not create or leave behind intermediate files such that the security of the source files may be compromised. If intermediate files were created, these files might have different permissions than the originals and unauthorized access to the original data may occur. These design choice potentially increases the memory required for syndiff
but we feel the benefit of security is worth it.
Quality
The quality of syndiff
will be directly related to the output it produces, and the time and space resources necessary to produce the output. The greatest concern in quality is not to indicate any incorrect edits. Regardless of time taken or resources required, syndiff
will always produce an edit list that is technically correct, if not of minimal length.
Minimization of the length of the list of differences found is our next concern. We wish to find the minimal number of edits to wind up with files that are syntactically the same.
Finally, we will seek to minimize the time and space requirements, but not at the expense of the above goals, unless behavior-altering command line switches are passed overriding this. Since syndiff
is a proof-of-concept first and a real world tool only if it winds up performing satisfactorially, time and space requirements take a backseat to correctness and minimal edit distance in the default behavior.
Constraints & Limitations
We are constraining our initial language support to Minijava and Java. In this way, we can prove that syntactically differing files is possible, without becoming bogged down in differences in the way that various languages are parsed.
From a performance standpoint, syndiff
may be limited due to the fact that finding the edit distance between two unordered trees is an NP-complete problem. This is something that will be assessed as construction is underway, but is less important than correctness in this proof-of-concept project.
Reliability
The reliability of syndiff may change depending on the arguments passed in. Syndiff may produce different results on two files if flags are used to modify the normal running process of syndiff (i.e. perform diff on unordered or ordered only, memory space limitations, etc). Syndiff, however, will produce consistent results if the same arguments are used while performing the diff. Syndiff will output results required to conform the files. From a legal standpoint, we direct the reader to the GPL version 2.
Documentation
Provided documentation will be divided into two primary sections, project and user documentation.
Project
We have provided an online wiki (http://grantsweb.com/doku.php?id=ideas:syndiff) which gives a synopsis of the syndiff project along with evolving information regarding the status and ideas behind syndiff. We have provided a requirements specification (which can be found on the forementioned wiki) and plan to release a design specification in the coming weeks. Further documentation concerning the project will provide unit test cases and end-to-end tests for the verification of the correctness of syndiff. Also found on the wiki are technical papers which serve as our proof-of-concept and provide detailed analysis of algorithms supporting the feasibility of the project.
User
User documentation will focus on user support in the form of help and examples. Syndiff will provide flags for user help at the command line along with examples showing sample usage to simplify the learning curve. The wiki is another valuable source of information for users of syndiff providing examples. We would like to provide a man page (similar to that of the current command line diff) which would present the above information in a comprehensive, central location.
Appendix A: Feasibility Research
The following papers are results of the original feasibility research done for this project:
- S. Chawathe, A. Rajaraman, H. Garcia-Molina and J. Widom, “Change Detection in Hierarchically Structured Information”, Proceedings of the ACM SIGMOD International Conference on Management of Data, Montreal, June 1996.
- D. S. Hirschberg, “Algorithm for the Longest Common Subsequence Problem”, Journal of the ACM, 24(4): 664-675, October 1977.
- C. M. Hoffmann, M. J. O’Donnell, “Pattern Matching in Trees”, Journal of the ACM, 29: 68-95, 1982.
- E. W. Myers, “An O(ND) Difference Algorithm and Its Variations”, Algorithmica, 1(2): 251-266, 1986.
- S. M. Selkow, “The Tree-to-Tree Editing Problem”, Information Processing Letters, 6(6): 184-186, 1977.
- K. C. Tai, “The Tree-to-Tree Correction Problem”, Journal of the ACM, 26: 485-495, 1979.
- K. Zhang and D. Shasha, “Simple Fast Algorithms for the Editing Distance between Trees and Related Problems”, SIAM Journal of Computing, 18(6): 1245-1262, 1989.
- K. Zhang, R. Statman, D. Shasha, “On the Editing Distance between Unordered Labeled Trees”, Information Processing Letters, 42: 133-139, 1992.
- K. Zhang, “A New Editing based Distance between Unordered Labeled Trees”, Combinatorial Pattern Matching, 1: 254 – 265, 1993.