Grammar Inductor

Status

I've made some progress on this. It's kind of cool. I have three tools now:

Tokenizer
Sequitur
Visualizer

Tokenizer uses JTB to translate a minijava program into its tokens. Sequitur looks for patterns and generates a compression grammar for the input stream. Visualizer rewrites the compression grammar for dot.

On my laptop, I can make nifty pictures running the command:

type test\Factorial.java | java -jar jars\Tokenizer.jar | java -jar jars\Sequitur.jar | Visualizer\Visualizer.exe | dot -Tjpg > test\Factorial.java.jpg

Onfortunately, “dot” requires too much memory for some of the tests. I'm still playing with it though.

The Visualizer.exe is written in F#. I'm planning to write other F# programs to do the grammar analysis.

If you are brave, you can see a sample image here.

English Example

I wanted to provide an English example so that more people might understand what I'm trying to do. It helps me too!

So imagine you start with a set of English sentences and you're trying to learn the English grammar. You've never been told the English grammar but you know the sentences and have some cool tools.

Here's our English sentences:

1. Balloons bounce.
2. The balloon bounced.
3. The big balloon bounced.
4. The big red balloon bounced.
5. The big red balloon bounced quickly.
6. The big red balloon bounced in the driveway.
7. The balloon bounced quickly in the driveway.

The “Tokenizer” tool that I created above works to translate the sentences into their parts of speech:

1. Noun Verb
2. Det Noun Verb
3. Det Adj Noun Verb
4. Det Adj Adj Noun Verb
5. Det Adj Adj Noun Verb Adverb
6. Det Adj Adj Noun Verb Prep Det Noun
7. Det Noun Verb Adverb Prep Det Noun

(Note that this is the English equivalent of what it's doing. The real “Tokenizer” is a lexer for the minijava language.)

Now, I create a pre-processing file that I can give to “Sequitur”, and then “Visualizer”, and then “Dot” and out pops this picture:

This one is actually better:

Notes

Is structural programming all inclusive? Think of the power of recursion. Hierarchy has to be included.

sequence
conditional
recursion

Algorithm Idea:

“Compression and Explanation using Hierarchical Grammars” to induce grammar. Build JavaCC file from this.
Process JavaCC file
Use SynDiff? to find differences between JavaCC files
Use SynDiff? results to “SynMerge” JavaCC files
repeat

Methods of combination:

linear - combine each new sentence example with existing best grammar
exponential - combine new sentences in pairs and then combine those results in pair until best grammar is reached

Algorithm Idea:

tokenize java program1
feed tokened program1 to compression grammar program
tokenize java program2
feed tokened program2 to compression grammar program
run syndiff on tokened program1 and tokened program2
merge grammars according to results of syndiff
repeat how?

Tools Available:

tokenizer for minijava program
grammar builder for minijava tokens

Idea:

use a “genetic algorithm” to blend the grammars

Merging grammars:

find all rules with same name and production terminals
- keep these
find all rules with same production terminals
- rename all these rules appropriately
find all rules with same eventual production terminals
- start as small as possible
- these are a candidate for combination in the grammar via “|”

Theory rules:

lower case letters denote Terminals
upper case letters denote either a Terminal or NonTerminal

Rule 1)
if
R_0 -> A_0, A_1, ... A_N
R_1 -> B_0, B_1, ... B_M
and
{A_x, ... A_y} => c_0, c_1, ... c_n
{B_v, ... B_w} => c_0, c_1, ... c_n
then
R_0 -> A_0, A_1, ... A_x-1 (A_x, ... A_y | B_v, ... B_w) A_y+1 ... A_N
R_1 -> B_0, B_1, ... B_v-1 (A_x, ... A_y | B_v, ... B_w) B_w+1 ... B_M

Rule 2)
if
R_0 -> A_0, ... A_N
R_1 -> B_0, ... B_M
and
{A_0, ... A_N} => c_0, c_1, ... c_n
{B_0, ... B_M} => c_0, c_1, ... c_n
then
R_2 -> R_0 | R_1

Rule 3)
if
R_0 -> A_0, ... A_N | B_0, ... B_M
and
A_x = B_v, A_x+1 = B_v+1, ... A_y = B_w
then
R_0 -> 	(A_0, ... A_x-1 | B_0, ... B_v-1)
	A_x, ... A_y
	(A_y+1, ... A_N | B_w+1, ... B_M)

How do you meaningfully “diff” two grammars? What does it even mean to do so?

Newest Notes

If order into sequitur matters, then does it always build the most compact hierarchical grammar?

Example: 2 3 / 1 2 3 / 1 2 3

Grammar: There will be a NT for “2 3” but none for “1 2” unless ”/ 1 2” is added → what happens then?

Random Idea

This would be super cool: Use a genetic algorithm to pick and choose parts of grammars in order to find the best. So like, “breed” grammars together and keep the ones that are best and continue the process.

Trace: » Talks / Speeches / Testimonies » Books » TERMS OF USE » PRIVACY POLICY » Celebrate Recovery Lesson 15: Victory » SynDiff Project » Automata » Mashup : Google Maps and Resume Data » Open Secret » Grammar Inductor

projects/grammar_inductor.txt · Last modified: 2009/10/23 20:39 by grant