ShingleCloud library is a Java library for approximate string matching. It comprises an implementation of the ShingleCloud algorithm and the Greedy String Tiling algorithm proposed by Michael J. Wise. ShingleCloud was originally developed by Arno Mittelbach under supervision of Lasse Lehmann as part of the LIS.KOM framework. Arno Mittelbach further extended and adapted the algorithm in the course of the Holinshed Project (see below).
The Greedy String Tiling (GST) algorithm proposed by Wise in [1] is a heuristic that tries to approximate the maximal tiling between needle and haystack. This problem belongs, like the Longest Common Subsequence, to the class of NP-hard problems. An effcient solution is therefore unknown and whether one exists an open question. Consequently GST has to cut back on the quality of results in order to increase effciency. However, it still has a worst-case complexity of O(n³).
The algorithm works by searching for matches of a maximum-length marking them and continuing the search ignoring marked tokens. Hence, the algorithm might prefer a larger match containing for example 6 tokens over two smaller matches containing 4 and 3 tokens, which obviously would be the better tiling. It can be proven that the algorithm is optimal if tiles of length 1 are allowed. However, allowing tiles of length 1 is usually unwise, since it reduces the algorithm to a simple word matching algorithm. For a detailed description, please refer to [1].
[1] Wise, Michael J.: String Similarity via Greedy String Tiling and Running Karp-Rabin Matching. (1993)
ShingleCloud belongs the family of n-gram overlap algorithms. The algorithm works by building a hash-table for each n-gram in the needle which is then used to test each n-gram in the haystack. From this a marker string (e.g., [NoMatch, NoMatch, Match, Match, NoMatch, Match]) is constructed which is then further processed.
This set up allows for very flexible matching, which is still very fast (ShingleCloud runs in linear time). In the example, ShingleCloud could, for example, either decide to create two small matches (one with two matching shingles and one with one matching shingle) or to create a larger match ignoring the one not matching shingle in the group of matching shingles.
To give you a general idea on how to work with the two algorithms we have compiled a few examples using different configurations. You should also have a look at the command line interface which was added in version 0.4.
This example uses a standard ShingleCloud match object configured to use 2-grams.
String haystack = "This is some text that you want to search through";
String needle = "some text you want to find";
/* preparing the match object */
ShingleCloud sc = new ShingleCloud(haystack);
sc.setNGramSize(2);
sc.setMinimumNumberOfOnesInMatch(1);
sc.setSortMatchesByRating(false);
/* searching for the needle */
sc.match(needle);
/* displaying results */
System.out.println("ShingleCloud found " + sc.getMatches().size() + " match(es).");
ShingleCloudMatch firstMatch = sc.getMatches().get(0);
System.out.println("The first match consists of " + firstMatch.getNumberOfMatchedShingles() + " shingle(s).");
System.out.println("Its indirect rating is " + firstMatch.getIndirectRating());
System.out.println("The matching shingles were: " + firstMatch.getMatchedShingles());
ShingleCloudMatch secondMatch = sc.getMatches().get(1);
System.out.println("The second match consists of " + secondMatch.getNumberOfMatchedShingles() + " shingle(s).");
System.out.println("Its indirect rating is " + secondMatch.getIndirectRating());
System.out.println("The matching shingles were: " + secondMatch.getMatchedShingles());
This would produce the following output:
ShingleCloud found 2 match(es).
The first match consists of 1 shingle(s).
Its indirect rating is 0.2
The matching shingles were: some text
The second match consists of 2 shingle(s).
Its indirect rating is 0.4
The matching shingles were: you want to
In this example we use a standard GST match object configured to find tiles with a minimum length of 2.
String haystack = "This is some text that you want to search through";
String needle = "some text you want to find";
/* preparing the GST object */
GST gst = new GST(haystack);
gst.setMinimumTileLength(2);
/* performing the match operation */
gst.match(needle);
/* displaying the results */
System.out.println("GST found " + gst.getTiles().size() + " tile(es).");
System.out.println("Containment in the needle " + gst.getContainmentInNeedle());
System.out.println("Containment in the haystack " + gst.getContainmentInHaystack());
GSTTile firstTile = gst.getTiles().get(0);
System.out.println("The first tile consists of " + firstTile.getLength() + " tokens.");
System.out.println("The matching tokens were: " + firstTile.getText());
GSTTile secondTile = gst.getTiles().get(1);
System.out.println("The second tile consists of " + secondTile.getLength() + " tokens.");
System.out.println("The matching tokens were: " + secondTile.getText());
This would produce the following output:
GST found 2 tile(es).
Containment in the needle 0.8333333333333334
Containment in the haystack 0.5
The first tile consists of 2 tokens.
The matching tokens were: some text
The second tile consists of 3 tokens.
The matching tokens were: you want to
This example uses the GSTHighlighter class. This is a utility class that uses the GST object and marks all matches in the haystack with a customizable tag.
String haystack = "This is some text that you want to search through";
String needle = "some text you want to find";
GSTHighlighter highlighter = new GSTHighlighter();
highlighter.setOpeningDelimiter("[[");
highlighter.setClosingDelimiter("]]");
highlighter.setMinimumTileLength(2);
String highlightedText = highlighter.produceHighlightedText(haystack, needle);
System.out.println(highlightedText);
This would generate the following output:
This is [[some text]] that [[you want to]] search through
In case your input data is in an XML format (such as for example XHTML) you can use the XMLHighlighter which uses the GST object in XML mode.
String haystack = "This <italic>is some</italic> text that <marker arg1=\"value1\">you</marker> want to search through";
String needle = "some text you want to find";
XMLHighlighter highlighter = new XMLHighlighter();
highlighter.setOpeningDelimiter("<matchedTile>");
highlighter.setClosingDelimiter("</matchedTile>");
highlighter.setMinimumTileLength(2);
String highlightedText = highlighter.produceHighlightedText(haystack, needle);
System.out.println(highlightedText);
This would generate the following output:
This <italic> is <matchedTile>some </matchedTile></italic><matchedTile> text</matchedTile> that <marker arg1="value1"> <matchedTile>you </matchedTile></marker><matchedTile> want to</matchedTile> to search through
In version 0.4 we have added a simple command line interface that allows you to run shinglecloud and GST. It does not give you all the flexibility but should be sufficient for simple comparisons. If you want to use the command line interface you additionally need the JOpt library. We have tested the stringmatching library with jopt-simple-3.2.jar.
To run shinglecloud from the command line use the following command:
java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli -h
This displays a help message describing all available parameters. If you, for example, want to compare two files with an n-gram size of 2 use the following.
java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli --needle=needle.txt --haystack=haystack.txt --ngram=2
You can also directly pass in the search strings and constrain the output to only one containment measure:
java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.ShingleCloudCli --needledirect="This is the needle that i want to find in my haystack" --haystackdirect="My haystack hopefully contains the needle that i am looking for" --ngram=2 --output=containmentneedle
Which would simply print out: 0.333333. If you add the option minimum1=1, meaning that a match is counted if it contains of a single matching n-gram, then the containment would be 0.5
To run GST from the command line us the following command:
java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.GstCli -h
This again displays a help message. If you, for example, want to compare two protein sequences (in the example the LCK_HUMAN and the VAV_HUMAN) with a tilelength of 3 and without any preprocessing you could run gst using the following command:
java -classpath jopt-simple-3.2.jar:shinglecloud-0.4.jar de.tud.kom.stringmatching.cli.GstCli --tokenizer=character --tilelength=3 --preprocess=none --needledirect="GCGCSSHPEDDWMENIDVCENCHYPIVPLDGKGTLLIRNGSEVRDPLVTYEGSNPPASPLQDNLVIALHSYEPSHDGDLGFEKGEQLRILEQSGEWWKAQSLTTGQEGFIPFNFVAKANSLEPEPWFFKNLSRKDAERQLLAPGNTHGSFLIRESESTAGSFSLSVRDFDQNQGEVVKHYKIRNLDNGGFYISPRITFPGLHELVRHYTNASDGLCTRLSRPCQTQKPQKPWWEDEWEVPRETLKLVERLGAGQFGEVWMGYYNGHTKVAVKSLKQGSMSPDAFLAEANLMKQLQHQRLVRLYAVVTQEPIYIITEYMENGSLVDFLKTPSGIKLTINKLLDMAAQIAEGMAFIEERNYIHRDLRAANILVSDTLSCKIADFGLARLIEDNEYTAREGAKFPIKWTAPEAINYGTFTIKSDVWSFGILLTEIVTHGRIPYPGMTNPEVIQNLERGYRMVRPDNCPEELYQLMRLCWKERPEDRPTFDYLRSVLEDFFTATEGQYQPQP" --haystackdirect="MELWRQCTHWLIQCRVLPPSHRVTWDGAQVCELAQALRDGVLLCQLLNNLLPHAINLREVNLRPQMSQFLCLKNIRTFLSTCCEKFGLKRSELFEAFDLFDVQDFGKVIYTLSALSWTPIAQNRGIMPFPTEEESVGDEDIYSGLSDQIDDTVEEDEDLYDCVENEEAEGDEIYEDLMRSEPVSMPPKMTEYDKRCCCLREIQQTEEKYTDTLGSIQQHFLKPLQRFLKPQDIEIIFINIEDLLRVHTHFLKEMKEALGTPGAANLYQVFIKYKERFLVYGRYCSQVESASKHLDRVAAAREDVQMKLEECSQRANNGRFTLRDLLMVPMQRVLKYHLLLQELVKHTQEAMEKENLRLALDAMRDLAQCVNEVKRDNETLRQITNFQLSIENLDQSLAHYGRPKIDGELKITSVERRSKMDRYAFLLDKALLICKRRGDSYDLKDFVNLHSFQVRDDSSGDRDNKKWSHMFLLIEDQGAQGYELFFKTRELKKKWMEQFEMAISNIYPENATANGHDFQMFSFEETTSCKACQMLLRGTFYQGYRCHRCRASAHKECLGR
VPPCGRHGQDFPGTMKKDKLHRRAQDKKRNELGLPKMEVFQEYYGLPPPPGAIGPFLRLNPGDIVELTKAEAEQNWWEGRNTSTNEIGWFPCNRVKPYVHGPPQDLSVHLWYAGPMERAGAESILANRSDGTFLVRQRVKDAAEFAISIKYNVEVKHIKIMTAEGLYRITEKKAFRGLTELVEFYQQNSLKDCFKSLDTTLQFPFKEPEKRTISRPAVGSTKYFGTAKARYDFCARDRSELSLKEGDIIKILNKKGQQGWWRGEIYGRVGWFPANYVEEDYSEYC" --output=containmentneedle,containmenthaystack
The protein sequences were taken from Wikiversity (http://en.wikiversity.org/wik/Dot-matrix_methods, visited 04-16-2010).
We have tried to fully document the source code so with the download you'll get the javadoc style documentation. We hope that this together with the examples found here will suffice to get you started. If you have any questions, please feel free to contact us.
In this section we present projects using ShingleCloud library.
The LIS.KOM framework is a framework for capturing metadata for documents from processes running on them. Specifically relations emerging from reuse processes are captured and stored in a structured manner. An early C# implementation of ShingleCloud is used within this project to ensure the validity of captured relations. This is done by calculating the containment of the related artefacts and judging the validity by means of a preset containment threshold.
Holinshed's Chronicles of England, Scotland, and Ireland was the most important single source for contemporary playwrights and poets, above all Shakespeare. The work was first published in 1577. Eleven years later, in 1587, a second revised and largely extended edition was printed.
With the Holinshed Project a team based at the University of Oxford has started the endeavour to create a new annotated edition of the two versions of Holinshed's Chronicles. In a first phase the changes between the two editions were to be traced. The team was primarily interested in how portions of text were re-used, replaced, expanded, deleted, and modified from one edition to another and it was decided that the granularity was to be set at paragraph level. As the two editions are very large (each consists of approximately 20,000 paragraphs) and had been digitized by Early English Books Online, an initial comparison was made programmatically to serve as a starting point which was then refined by experts. The initial comparison was made using the ShingleCloud algorithm together with a custom preprocessing that for example stripped the text of all vowels. Although there were still many stumbling blocks such as illegible portions or completely rewritten paragraphs ShingleCloud achieved a recall of about 92% and a precision of 98%.
For more information on the project visit http://www.cems.ox.ac.uk/holinshed/
The TEI-Comparator is a text-comparison engine designed to compare two XML files where items of a paragraph-like granularity have been moved around, split up or otherwise reorganized. It was developed for the Holinshed Project (v.s) to allow experts to refine the initial comparison created with ShingleCloud. It is a database backed web application for comparing two XML files that allows users to link and annotate paragraphs. It uses the ShingleCloud library to propose matching paragraphs in the other file for a given paragraph (as this involves searching through the entire file and users are hardly willing to wait for several minutes the ShingleCloud algorithm is used for this purpose) and to highlight matching phrases in two linked paragraphs (the XMLHighlighter based on the Greedy String Tiling algorithm is used here).
For more information on the project visit http://tei-comparator.sourceforge.net/
The ShingleCloud library is published under the terms of the GNU LESSER GENERAL PUBLIC LICENSE
(http://www.gnu.org/licenses/lgpl.txt). If you plan on using the library in your project we would be happy if you could drop us a line.
The ShingleCloud library has been maintained by Arno Mittelbach and Lasse Lehmann. Both have leaved KOM in between. For questions, comments, or suggestions regarding ShingleCloud library you can contact Christoph Rensing.