We are developing a plagiarism detection framework. In there i have to highlight the possible plagiarized phrases in the document. The document gets preprocessed with stop word removal, stemming and number removal first. So the highlighting gets difficult with the preprocessed token As and example:
Orginal Text: "Extreme programming is one approach of agile software development which emphasizes on frequent releases in short development cycles which are called time boxes. This result in reducing the costs spend for changes, by having multiple short development cycles, rather than one long one. Extreme programming includes pair-wise programming (for code review, unit testing). Also it avoids implementing features which are not included in the current time box, so the schedule creep can be minimized. "
phrase want to highlight: Extreme programming includes pair-wise programming
preprocessed token : Extrem program pair-wise program
Is there anyway I can highlight the preprocessed token in the original document????
Thanx
You could use java.text.AttributedString to annotate the preprocessed tokens in the original document. Then apply TextAttributes to the relevant ones (which whould take effect in the original document.
From a technical point of view: You can either choose or develop a markup language and add annotations or tags to the original document. Or you want to create a second file that records all potential plagiarisms.
With markup, your text could look like this:
(with ref referencing to some metadata record that describes the original)
You'd better use JTextPane or JEditorPane, instead of JTextArea.
A text area is a "plain" text component, which means taht although it can display text in any font, all of the text is in the same font.
So,
JTextArea
is not a convenient component to make any text formatting.On the contrary, using
JTextPane
orJEditorPane
, it's quite easy to change style (highlight) of any part of loaded text.See How to Use Editor Panes and Text Panes for details.
Update:
The following code highlights the desired part of your text. It's not exectly what you want. It simply finds the exact phrase in the text.
But I hope that if you apply your algorithms, you can easily modify it to fit your needs.
This example is based on Highlighting Words in a JTextComponent.