I'm writing my own text editor with syntax highlighting in Java, and at the moment it simply parses and highlights the current line every time the user enters a single character. While presumably not the most efficient way, it's good enough and doesn't cause any noticeable performance issues. In pseudo-Java, this would be the core concept of my code:
public void textUpdated(String wholeText, int updateOffset, int updateLength) {
int lineStart = getFirstLineStart(wholeText, updateOffset);
int lineEnd = getLastLineEnd(wholeText, updateOffset + updateLength);
List<Token> foundTokens = tokenizeText(wholeText, lineStart, lineEnd);
for(Token token : foundTokens) {
highlightText(token.offset, token.length, token.tokenType);
}
}
The real problem lies with multi-line comments. To check if an entered character is inside a multi-line comment, the program would need to parse back to the most recent occurrence of a "/*", while also being aware of whether this occurrence is inside a literal or another comment. This would not be an issue if the amount of text is small, but if the text consists of 20,000 lines of code, it would possibly have to scan and (re)highlight 20,000 lines of code on each key press, which would be very inefficient.
So my ultimate question is: how do I handle multi-line tokens/comments in a syntax highlighter while keeping it efficient?
I tried to do this (for fun) about 10 years ago (or more). Because the code is so old, I don't remember all the details of the code and the logic conditions in the code. All the code here is basically a brute force solution. It in no way attempts to keep the state of each line as suggest by rici.
I'll try to explain the high level concept of the code. Hope some of it makes sense to you.
at the moment it simply parses and highlights the current line every time the user enters a single character.
This is the basic premise of my code as well. However, it does handle pasting multiple lines of code as well.
how do I handle multi-line tokens/comments in a syntax highlighter while keeping it efficient?
In my solution, when you enter "/*"
to start the multi-line comment, I will comment all the following lines of code until I find the end of the comment or the start of the another multi-line comment or the end of the Document. When you then enter the matching "*/"
to end the multi-line comment I will re-highlight the following lines until the next multi-line comment or the end of the Document.
So the amount of highlighting done depends on how much code you have between multi-line comments.
That is a quick overview of how it works. I doubt it is 100% accurate since I've only played with it a little bit. It should be noted this code was written when I was just learning Java, so in no way would I suggest it is the best approach, just the best I knew at the time.
Here is the code for your amusement :)
Just run the code and click on the button to get started.
import java.awt.*;
import java.awt.event.*;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.*;
import javax.swing.event.*;
import javax.swing.text.*;
class SyntaxDocument extends DefaultStyledDocument
{
private DefaultStyledDocument doc;
private Element rootElement;
private boolean multiLineComment;
private MutableAttributeSet normal;
private MutableAttributeSet keyword;
private MutableAttributeSet comment;
private MutableAttributeSet quote;
private Set<String> keywords;
private int lastLineProcessed = -1;
public SyntaxDocument()
{
doc = this;
rootElement = doc.getDefaultRootElement();
putProperty( DefaultEditorKit.EndOfLineStringProperty, "\n" );
normal = new SimpleAttributeSet();
StyleConstants.setForeground(normal, Color.black);
comment = new SimpleAttributeSet();
StyleConstants.setForeground(comment, Color.gray);
StyleConstants.setItalic(comment, true);
keyword = new SimpleAttributeSet();
StyleConstants.setForeground(keyword, Color.blue);
quote = new SimpleAttributeSet();
StyleConstants.setForeground(quote, Color.red);
keywords = new HashSet<String>();
keywords.add( "abstract" );
keywords.add( "boolean" );
keywords.add( "break" );
keywords.add( "byte" );
keywords.add( "byvalue" );
keywords.add( "case" );
keywords.add( "cast" );
keywords.add( "catch" );
keywords.add( "char" );
keywords.add( "class" );
keywords.add( "const" );
keywords.add( "continue" );
keywords.add( "default" );
keywords.add( "do" );
keywords.add( "double" );
keywords.add( "else" );
keywords.add( "extends" );
keywords.add( "false" );
keywords.add( "final" );
keywords.add( "finally" );
keywords.add( "float" );
keywords.add( "for" );
keywords.add( "future" );
keywords.add( "generic" );
keywords.add( "goto" );
keywords.add( "if" );
keywords.add( "implements" );
keywords.add( "import" );
keywords.add( "inner" );
keywords.add( "instanceof" );
keywords.add( "int" );
keywords.add( "interface" );
keywords.add( "long" );
keywords.add( "native" );
keywords.add( "new" );
keywords.add( "null" );
keywords.add( "operator" );
keywords.add( "outer" );
keywords.add( "package" );
keywords.add( "private" );
keywords.add( "protected" );
keywords.add( "public" );
keywords.add( "rest" );
keywords.add( "return" );
keywords.add( "short" );
keywords.add( "static" );
keywords.add( "super" );
keywords.add( "switch" );
keywords.add( "synchronized" );
keywords.add( "this" );
keywords.add( "throw" );
keywords.add( "throws" );
keywords.add( "transient" );
keywords.add( "true" );
keywords.add( "try" );
keywords.add( "var" );
keywords.add( "void" );
keywords.add( "volatile" );
keywords.add( "while" );
}
/*
* Override to apply syntax highlighting after the document has been updated
*/
public void insertString(int offset, String str, AttributeSet a) throws BadLocationException
{
if (str.equals("{"))
str = addMatchingBrace(offset);
super.insertString(offset, str, a);
processChangedLines(offset, str.length());
}
/*
* Override to apply syntax highlighting after the document has been updated
*/
public void remove(int offset, int length) throws BadLocationException
{
super.remove(offset, length);
processChangedLines(offset, 0);
}
/*
* Determine how many lines have been changed,
* then apply highlighting to each line
*/
public void processChangedLines(int offset, int length)
throws BadLocationException
{
String content = doc.getText(0, doc.getLength());
// The lines affected by the latest document update
int startLine = rootElement.getElementIndex(offset);
int endLine = rootElement.getElementIndex(offset + length);
if (startLine > endLine)
startLine = endLine;
// Make sure all comment lines prior to the start line are commented
// and determine if the start line is still in a multi line comment
if (startLine != lastLineProcessed
&& startLine != lastLineProcessed + 1)
{
setMultiLineComment( commentLinesBefore( content, startLine ) );
}
// Do the actual highlighting
for (int i = startLine; i <= endLine; i++)
{
applyHighlighting(content, i);
}
// Resolve highlighting to the next end multi line delimiter
if (isMultiLineComment())
commentLinesAfter(content, endLine);
else
highlightLinesAfter(content, endLine);
}
/*
* Highlight lines when a multi line comment is still 'open'
* (ie. matching end delimiter has not yet been encountered)
*/
private boolean commentLinesBefore(String content, int line)
{
int offset = rootElement.getElement( line ).getStartOffset();
// Start of comment not found, nothing to do
int startDelimiter = lastIndexOf( content, getStartDelimiter(), offset - 2 );
if (startDelimiter < 0)
return false;
// Matching start/end of comment found, nothing to do
int endDelimiter = indexOf( content, getEndDelimiter(), startDelimiter );
if (endDelimiter < offset & endDelimiter != -1)
return false;
// End of comment not found, highlight the lines
doc.setCharacterAttributes(startDelimiter, offset - startDelimiter + 1, comment, false);
return true;
}
/*
* Highlight comment lines to matching end delimiter
*/
private void commentLinesAfter(String content, int line)
{
int offset = rootElement.getElement( line ).getStartOffset();
// End of comment and Start of comment not found
// highlight until the end of the Document
int endDelimiter = indexOf( content, getEndDelimiter(), offset );
if (endDelimiter < 0)
{
endDelimiter = indexOf( content, getStartDelimiter(), offset + 2);
if (endDelimiter < 0)
{
doc.setCharacterAttributes(offset, content.length() - offset + 1, comment, false);
return;
}
}
// Matching start/end of comment found, comment the lines
int startDelimiter = lastIndexOf( content, getStartDelimiter(), endDelimiter );
if (startDelimiter < 0 || startDelimiter >= offset)
{
doc.setCharacterAttributes(offset, endDelimiter - offset + 1, comment, false);
}
}
/*
* Highlight lines to start or end delimiter
*/
private void highlightLinesAfter(String content, int line)
throws BadLocationException
{
int offset = rootElement.getElement( line ).getEndOffset();
// Start/End delimiter not found, nothing to do
int startDelimiter = indexOf( content, getStartDelimiter(), offset );
int endDelimiter = indexOf( content, getEndDelimiter(), offset );
if (startDelimiter < 0)
startDelimiter = content.length();
if (endDelimiter < 0)
endDelimiter = content.length();
int delimiter = Math.min(startDelimiter, endDelimiter);
if (delimiter < offset)
return;
// Start/End delimiter found, reapply highlighting
int endLine = rootElement.getElementIndex( delimiter );
for (int i = line + 1; i <= endLine; i++)
{
Element branch = rootElement.getElement( i );
Element leaf = doc.getCharacterElement( branch.getStartOffset() );
AttributeSet as = leaf.getAttributes();
if ( as.isEqual(comment) )
{
applyHighlighting(content, i);
}
}
}
/*
* Parse the line to determine the appropriate highlighting
*/
private void applyHighlighting(String content, int line)
throws BadLocationException
{
lastLineProcessed = line;
int startOffset = rootElement.getElement( line ).getStartOffset();
int endOffset = rootElement.getElement( line ).getEndOffset() - 1;
int lineLength = endOffset - startOffset;
int contentLength = content.length();
if (endOffset >= contentLength)
endOffset = contentLength - 1;
// check for multi line comments
// (always set the comment attribute for the entire line)
if (endingMultiLineComment(content, startOffset, endOffset)
|| isMultiLineComment()
|| startingMultiLineComment(content, startOffset, endOffset) )
{
doc.setCharacterAttributes(startOffset, endOffset - startOffset + 1, comment, false);
lastLineProcessed = -1;
return;
}
// set normal attributes for the line
doc.setCharacterAttributes(startOffset, lineLength, normal, true);
// check for single line comment
int index = content.indexOf(getSingleLineDelimiter(), startOffset);
if ( (index > -1) && (index < endOffset) )
{
doc.setCharacterAttributes(index, endOffset - index + 1, comment, false);
endOffset = index - 1;
}
// check for tokens
checkForTokens(content, startOffset, endOffset);
}
/*
* Does this line contain the start delimiter
*/
private boolean startingMultiLineComment(String content, int startOffset, int endOffset)
throws BadLocationException
{
int index = indexOf( content, getStartDelimiter(), startOffset );
if ( (index < 0) || (index > endOffset) )
return false;
else
{
setMultiLineComment( true );
return true;
}
}
/*
* Does this line contain the end delimiter
*/
private boolean endingMultiLineComment(String content, int startOffset, int endOffset)
throws BadLocationException
{
int index = indexOf( content, getEndDelimiter(), startOffset );
if ( (index < 0) || (index > endOffset) )
return false;
else
{
setMultiLineComment( false );
return true;
}
}
/*
* We have found a start delimiter
* and are still searching for the end delimiter
*/
private boolean isMultiLineComment()
{
return multiLineComment;
}
private void setMultiLineComment(boolean value)
{
multiLineComment = value;
}
/*
* Parse the line for tokens to highlight
*/
private void checkForTokens(String content, int startOffset, int endOffset)
{
while (startOffset <= endOffset)
{
// skip the delimiters to find the start of a new token
while ( isDelimiter( content.substring(startOffset, startOffset + 1) ) )
{
if (startOffset < endOffset)
startOffset++;
else
return;
}
// Extract and process the entire token
if ( isQuoteDelimiter( content.substring(startOffset, startOffset + 1) ) )
startOffset = getQuoteToken(content, startOffset, endOffset);
else
startOffset = getOtherToken(content, startOffset, endOffset);
}
}
/*
*
*/
private int getQuoteToken(String content, int startOffset, int endOffset)
{
String quoteDelimiter = content.substring(startOffset, startOffset + 1);
String escapeString = getEscapeString(quoteDelimiter);
int index;
int endOfQuote = startOffset;
// skip over the escape quotes in this quote
index = content.indexOf(escapeString, endOfQuote + 1);
while ( (index > -1) && (index < endOffset) )
{
endOfQuote = index + 1;
index = content.indexOf(escapeString, endOfQuote);
}
// now find the matching delimiter
index = content.indexOf(quoteDelimiter, endOfQuote + 1);
if ( (index < 0) || (index > endOffset) )
endOfQuote = endOffset;
else
endOfQuote = index;
doc.setCharacterAttributes(startOffset, endOfQuote - startOffset + 1, quote, false);
return endOfQuote + 1;
}
/*
*
*/
private int getOtherToken(String content, int startOffset, int endOffset)
{
int endOfToken = startOffset + 1;
while ( endOfToken <= endOffset )
{
if ( isDelimiter( content.substring(endOfToken, endOfToken + 1) ) )
break;
endOfToken++;
}
String token = content.substring(startOffset, endOfToken);
if ( isKeyword( token ) )
{
doc.setCharacterAttributes(startOffset, endOfToken - startOffset, keyword, false);
}
return endOfToken + 1;
}
/*
* Assume the needle will be found at the start/end of the line
*/
private int indexOf(String content, String needle, int offset)
{
int index;
while ( (index = content.indexOf(needle, offset)) != -1 )
{
String text = getLine( content, index ).trim();
if (text.startsWith(needle) || text.endsWith(needle))
break;
else
offset = index + 1;
}
return index;
}
/*
* Assume the needle will the found at the start/end of the line
*/
private int lastIndexOf(String content, String needle, int offset)
{
int index;
while ( (index = content.lastIndexOf(needle, offset)) != -1 )
{
String text = getLine( content, index ).trim();
if (text.startsWith(needle) || text.endsWith(needle))
break;
else
offset = index - 1;
}
return index;
}
private String getLine(String content, int offset)
{
int line = rootElement.getElementIndex( offset );
Element lineElement = rootElement.getElement( line );
int start = lineElement.getStartOffset();
int end = lineElement.getEndOffset();
return content.substring(start, end - 1);
}
/*
* Override for other languages
*/
protected boolean isDelimiter(String character)
{
String operands = ";:{}()[]+-/%<=>!&|^~*";
if (Character.isWhitespace( character.charAt(0) )
|| operands.indexOf(character) != -1 )
return true;
else
return false;
}
/*
* Override for other languages
*/
protected boolean isQuoteDelimiter(String character)
{
String quoteDelimiters = "\"'";
if (quoteDelimiters.indexOf(character) < 0)
return false;
else
return true;
}
/*
* Override for other languages
*/
protected boolean isKeyword(String token)
{
return keywords.contains( token );
}
/*
* Override for other languages
*/
protected String getStartDelimiter()
{
return "/*";
}
/*
* Override for other languages
*/
protected String getEndDelimiter()
{
return "*/";
}
/*
* Override for other languages
*/
protected String getSingleLineDelimiter()
{
return "//";
}
/*
* Override for other languages
*/
protected String getEscapeString(String quoteDelimiter)
{
return "\\" + quoteDelimiter;
}
/*
*
*/
protected String addMatchingBrace(int offset) throws BadLocationException
{
StringBuffer whiteSpace = new StringBuffer();
int line = rootElement.getElementIndex( offset );
int i = rootElement.getElement(line).getStartOffset();
while (true)
{
String temp = doc.getText(i, 1);
if (temp.equals(" ") || temp.equals("\t"))
{
whiteSpace.append(temp);
i++;
}
else
break;
}
return "{\n" + whiteSpace.toString() + "\t\n" + whiteSpace.toString() + "}";
}
/*
public void setCharacterAttributes(int offset, int length, AttributeSet s, boolean replace)
{
super.setCharacterAttributes(offset, length, s, replace);
}
*/
public static void main(String a[])
{
EditorKit editorKit = new StyledEditorKit()
{
public Document createDefaultDocument()
{
return new SyntaxDocument();
}
};
// final JEditorPane edit = new JEditorPane()
final JTextPane edit = new JTextPane();
// LinePainter painter = new LinePainter(edit, Color.cyan);
// LinePainter2 painter = new LinePainter2(edit, Color.cyan);
// edit.setEditorKitForContentType("text/java", editorKit);
// edit.setContentType("text/java");
edit.setEditorKit(editorKit);
JButton button = new JButton("Load SyntaxDocument.java");
button.addActionListener( new ActionListener()
{
public void actionPerformed(ActionEvent e)
{
try
{
long startTime = System.currentTimeMillis();
FileReader fr = new FileReader( "SyntaxDocument.java" );
// FileReader fr = new FileReader( "C:\\Java\\j2sdk1.4.2\\src\\javax\\swing\\JComponent.java" );
BufferedReader br = new BufferedReader(fr);
edit.read( br, null );
System.out.println("Load: " + (System.currentTimeMillis() - startTime));
System.out.println("Document contains: " + edit.getDocument().getLength() + " characters");
edit.requestFocus();
}
catch(Exception e2) {}
}
});
JFrame frame = new JFrame("Syntax Highlighting");
frame.getContentPane().add( new JScrollPane(edit) );
frame.getContentPane().add(button, BorderLayout.SOUTH);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.setSize(800,300);
frame.setVisible(true);
}
}
Note: this code does not check if the comment delimiters are inside a literal, so that would need to be improved upon.
I don't really expect you to use this code, but I thought it might give you an idea of the performance you might get when using the brute force approach.
One common approach is to save the lexer state at the start of each line. (Typically, the lexer state will be a small integer or enum; for Java-like languages, it would probably be limited to three values: normal, inside multiline comment, and inside multiline string constant.)
A change to a line could change the lexer state at the start of the next line, but it can't change the state at the beginning of the current line, so the retokenisation of the line can be done from the start of the line, using the current line's lexer state as a starting condition. Keeping per-line lexer states makes it easy to handle the case where the cursor is moved to another line, possibly quite some distance away.
If the edit changes the lexer state at the end of the line (which is to say the start of the next line) you could rescan the rest of the file. However, doing so immediately is really annoying for the user because it means that every time they type a quote, the entire scrern gets repainted, because it has become part of a multiline string (for example). Since most of the time, the user wil close the string (or comment), it is usually better to delay the rescan. For example, you might wait until the user moves the cursor or completes the lexical element or some other such signal. Another comon approach is to insert a "ghost" close symbol after the cursor, which will keep the lex in sync. The ghost will be deleted if the user types it explicitly, or deletes it explicitly.
You seem to be keeping the entire program as a single string. IMHO, it's better to keep it as a list of lines, to avoid having to copy the entire string when a character is inserted or deleted. Otherwise, editing very long files becomes really annoying.
Finally, you should never tokenise text which is not visible. Avoiding that will limit the damage of large retokenisations.