GNU Source-highlight 2.0-rc1


Next: , Previous: (dir), Up: (dir)


Next: , Previous: Top, Up: Top

1 Introduction

GNU Source-highlight, given a source file, produces a document with syntax highlighting. The colors and the styles can be specified (bold, italics, underline) by means of a configuration file, and some other options can be specified at the command line. The output format can be HTML, XHTML and ANSI color escape sequences.

The program already recognizes many programming languages (e.g., C++, Java, Perl, etc.) and file formats (e.g., log files, ChangeLog, etc.). Since version 2.0, it also allows you to specify your own input source language via a simple syntax described later in this manual (Language Definitions).

The complete list of languages (indeed, file extensions) natively supported by this version of Source-highlight (2.0-rc1), as reported by --lang-list, is the following:

     Supported languages (file extensions) and associated language definition files
     
     java = java.lang
     cpp = cpp.lang
     c = cpp.lang
     C = cpp.lang
     cc = cpp.lang
     h = cpp.lang
     hh = cpp.lang
     H = cpp.lang
     hpp = cpp.lang
     javascript = javascript.lang
     js = javascript.lang
     prolog = prolog.lang
     pl = prolog.lang
     perl = perl.lang
     pm = perl.lang
     php3 = php3.lang
     php = php3.lang
     python = python.lang
     py = python.lang
     ruby = ruby.lang
     rb = ruby.lang
     flex = flex.lang
     lex = flex.lang
     l = flex.lang
     ll = flex.lang
     bison = bison.lang
     yacc = bison.lang
     y = bison.lang
     yy = bison.lang
     changelog = changelog.lang
     lua = lua.lang
     ml = caml.lang
     caml = caml.lang
     sml = sml.lang
     syslog = syslog.lang
     log = syslog.lang
     
     


Next: , Previous: Introduction, Up: Top

2 Installation

See the file INSTALL for detailed building and installation instructions; anyway if you're used to compiling Linux software that comes with sources you may simply follow the usual procedure, i.e. untar the file you downloaded in a directory and then:

     cd <source code main directory>
     ./configure
     make
     make install

Note: unless you specify a different install directory by --prefix option of configure (e.g. ./configure --prefix=<your home>), you must be root to run make install.

Files will be installed in the following directories:

Executables
/prefix/bin
docs and samples
/prefix/share/doc/source-highlight
conf files
/prefix/share/source-highlight

Default value for prefix is /usr/local but you may change it with --prefix option to configure.

NOTICE: Originally, instead of Source-highlight, there were two separate programs, namely GNU java2html and GNU cpp2html. There are two shell scripts with the same name that will be installed together with Source-highlight in order to facilitate the migration (however their use is not advised and it is deprecated).

2.1 Download

You can download it from GNU's ftp site: ftp://ftp.gnu.org/gnu/src-highlite or from one of its mirrors (see http://www.gnu.org/prep/ftp.html).

I do not distribute Windows binaries anymore; since, they can be easily built by using Cygnus C/C++ compiler, available at http://www.cygwin.com. However, if you don't feel like downloading such compiler, you can request such binaries directly to me, by e-mail (find my e-mail at my home page) and I can send them to you. An MS-Windows port of Source-highlight is available from http://gnuwin32.sourceforge.net.

Archives are digitally signed by me (Lorenzo Bettini) with GNU gpg (http://www.gnupg.org). My GPG public key can be found at my home page (http://www.lorenzobettini.it).

You can also get the patches, if they are available for a particular release (see below for patching from a previous version).

2.2 Anonymous CVS Access

This project's CVS repository can be checked out through anonymous (pserver) CVS with the following instruction set. When prompted for a password for anoncvs, simply press the Enter key.

     cvs -d:pserver:anoncvs@subversions.gnu.org:/cvsroot/src-highlite login
     
     cvs -z3 -d:pserver:anoncvs@subversions.gnu.org:/cvsroot/src-highlite \
       co src-highlite

Further instructions can be found at the address:

http://savannah.gnu.org/projects/src-highlite.

2.3 What you need to build source-highlight

Since version 2.0 Source-highlight relies on regular expressions as provided by boost (http://www.boost.org), so you need to install at least the regex library from boost. Most GNU/Linux distributions provide this library already in a compiled form.

Source-highlight has been developed under GNU/Linux, using gcc (C++), and bison (yacc) and flex (lex), and ported under Win32 with Cygnus C/C++compiler, available at http://www.cygwin.com. I used the excellent GNU Autoconf and GNU Automake. I also used Autotools (ftp://ftp.ugcs.caltech.edu/pub/elef/autotools) which creates a starting source tree (according to GNU standards) with autoconf, automake starting files. Finally I used GNU gengetopt (http://www.gnu.org/software/gengetopt), for command line parsing.

I started to use also doublecpp (http://www.lorenzobettini.it/software/doublecpp) that permits achieving dynamic overloading.

If you want to use a specific version of the Boost regex library, you can use the configure option --with-boost-regex to specify a particural suffix. For instance,

     ./configure --with-boost-regex=boost_regex-gcc-1_31

Actually, apart from the boost regex library, you don't need the other tools above to build source-highlight because I provide generated sources, unless you want to develop source-highlight.

2.4 Patching from a previous version

If you downloaded a patch, say source-highlight-1.3-1.3.1-patch.gz (i.e., the patch to go from version 1.3 to version 1.3.1), cd to the directory with sources from the previous version (source-highlight-1.3) and type:

     gunzip -cd ../source-highlight-1.3-1.3.1.patch.gz | patch -p1

and restart the compilation process (if you had already run configure a simple make should do).

2.5 Using source-highlight with less

This was suggested by Konstantine Serebriany. The script src-hilite-lesspipe.sh will be installed together with source-highlight. You can use the following environment variables:

     export LESSOPEN="| /path/to/src-hilite-lesspipe.sh %s"
     export LESS=' -R '

This way, when you use less to browse a file, if it is a source file handled by source-highlight, it will be automatically highlighted.

2.6 Building .rpm

Christian W. Zuckschwerdt added support for building an .rpm and an .rpm.src. You can issue the following command

     rpm -tb source-highlight-2.0-rc1.tar.gz

for building an .rpm with binaries and

     rpm -ts source-highlight-2.0-rc1.tar.gz

for building an .rpm.src with sources.

2.7 Related Software and Links

Martin Gebert is also implementing a KDE interface to source-highlight programs (and he did a wonderful job!), and it is called ksrc2html; if you want to test it: http://murphy.netsolution-net.de.

CGI support was enabled thanks to Robert Wetzel; I haven't tested it personally yet, so you may ask him directly. Moreover he set up some examples at the page http://www.inf.tu-dresden.de/~rw8/java2.html. If you want to use source-highlight as a CGI program, you have to use the executable source-highlight-cgi.

Moreover there's also a Java version of java2html, you can find it at http://www.generationjava.com/projects/Java2Html.shtml.


Next: , Previous: Installation, Up: Top

3 Copying Conditions

GNU Source-highlight is free software; you are free to use, share and modify it under the terms of the GNU General Public License that accompanies this software (see COPYING).

GNU source-highlight was written and maintained by Lorenzo Bettini http://www.lorenzobettini.it.


Next: , Previous: Copying, Up: Top

4 Simple Usage

Here are some realistic examples of running source-highlight1.

Source-highlight only does a lexical analysis of the source code, so the program source is assumed to be correct!

Here's how to run source-highlight (for this example we will use C/C++ input files, but this is valid also for other source-highlight input languages):

     source-highlight --src-lang cpp --out-format html \
         --input <C++ file> \
         --output <html file> options

For input files, apart from the -i (--input) option and the standard input redirection, you can simply specify some files at the command line and also use regular expressions (for instance *.java). In this case the name for the output files will be formed using the name of the source file with a .<ext> appended, where <ext> is the extension chosen according to the output format specified (in this example it would be .html).

If STDOUT string is passed as -o (--output) option, then the output is forced to the standard output anyway.

If -s (--src-lang) is not specified, the source language is inferred by the extension of the input file (this, of course, does not work with standard input redirection).

If -f (--out-format) is not specified, the output will be produced in HTML.


Next: , Previous: Simple Usage, Up: Top

5 Configuration files

During execution, source-highlight needs some files where it finds directives on how to recognize the source language (if not explicitly specified with --src-lang or --lang-def), on how to format specific source elements (e.g., keywords, comments, etc.), and source language definitions. These files will be explained in the next sections.

If the directory for such files is not explicitly specified with the command line option --data-dir, these files are searched for in the following order:

If you want to be sure about which file is used during the execution, you can use the command line option --verbose.

5.1 Output format style

You must specify your options for syntax highlighting in the file tags.j2h. Here's the one that comes with this distribution:

     keyword blue b ;      // for language keywords
     type darkgreen ;      // for basic types
     string red ;          // for strings and chars
     comment brown i ;     // for comments
     number purple ;       // for literal numbers
     preproc darkblue b ;  // for preproc directives (e.g. #include, import)
     symbol darkred ;      // for simbols (e.g. <, >, +)
     function black b;     // for function calls and declarations
     cbracket red;         // for block brackets (e.g. {, })
     
     // line numbers
     linenum black;
     
     // Internet related
     url blue u;
     
     // other elements for ChangeLog and Log files
     date blue b ;
     time darkblue b ;
     ip darkgreen ;
     file darkblue b ;
     name darkgreen ;
     
     // for Prolog, Perl...
     variable darkgreen ;

You can specify your own file (it doesn't have to be named tags.j2h) with the command line option --tags-file, see Invoking source-highlight.

You can also specify the color of normal text by adding this line

     normal darkblue ;

As you might see the syntax of this file is quite straightforward:

     b = bold
     i = italics
     u = underline

You may also specify more than on of these options separated by commas e.g.

     keyword blue u, b ;

These are all possible HTML color logical names handled by source-highlight:

      black (#000000)
      red (#FF0000)
      darkred (#990000)
      brown (#660000)
      yellow (#FFCC00)
      cyan (#66FFFF)
      blue (#3333FF)
      pink (#CC33CC)
      purple (#993399)
      orange (#FF6600)
      brightorange (#FF9900)
      green (#33CC00)
      brightgreen (#33FF33)
      darkgreen (#009900)
      teal (#008080)
      gray (#808080)
      darkblue (#000080)

You can see these colors in the file colors.html. You can also use the standard #<number> html syntax for specifying a color.

5.2 Language map

This configuration file associates a file extension to a specific language definition file. You can also use such file extension to specify the --src-lang option (see Simple Usage). Source-highlight comes with such a file, called lang.map.

Of course, you can ovverride the settings of this file by writing your own language map file and specify such file with the command line option --lang-map). Moreover, as explained above, if a file lang.map is present in the current directory, such version will be used. The format of such file is quite simple:

     extension = language definition file

The default language definition file is shown in Introduction.

5.3 Language definition files

These files are crucial for source-highlight since they specify the source elements that have to be highlighted. These files also allow to specify your own language definitions in order to deal with a language that is not handled by source-highlight2.

I encourage those who write new language definitions or correct/modify existing language definitions to send them to me so that they can be added to the source-highlight distribution!

Since these files require more explainations (that, however, are not necessary to the standard usage of source-highlight), they carefully explained in a separate part: Language Definitions.


Next: , Previous: Configuration files, Up: Top

6 Invoking source-highlight

The format for running the source-highlight program is:

     source-highlight option ...

source-highlight supports the following options, shown by the output of source-highlight --help:

     source-highlight
     
     Highlight the syntax of a source file (e.g. Java) into a specific format (e.g.
     HTML)
     
     Usage: source-highlight [OPTIONS]...
     
       -h, --help               Print help and exit
       -V, --version            Print version and exit
       -i, --input=STRING       input file. default std input
       -o, --output=STRING      output file. default std output
       -s, --src-lang=STRING    source language (use --lang-list to get the complete
                                  list).   If not specified, the source language
                                  will be guessed from the file extension.
           --lang-list          list all the supported language and associated
                                  language definition file
       -f, --out-format=STRING  output format (e.g. html, xhtml, esc)
                                  (default=`html')
       -v, --verbose            verbose mode on
       -d, --doc                create html with title and header
           --no-doc             cancel the --doc option even if it is implied (e.g.,
                                  when css is given)
       -c, --css=STRING         use a css for formatting. Implies --doc
       -T, --title=STRING       give a title to the html. Implies --doc
       -t, --tab=INT            specify tab length. default 8
       -H, --header=STRING      file to insert as header
       -F, --footer=STRING      file to insert as footer
           --tags-file=STRING   specify format options  (default=`tags.j2h')
       -n, --line-number        number all output lines
           --line-number-ref    number all output lines and generate an anchor that
                                  can be referred to from another document
           --output-dir=STRING  output directory
           --gen-version        put source-highlight version in the generated file
                                  (default=on)
           --lang-def=STRING    language definition file
           --lang-map=STRING    language map file  (default=`lang.map')
           --data-dir=PATH      directory where language definition files and
                                  language map are searched for.  If not specified
                                  these files are searched for in the current
                                  directory and in the data dir installation
                                  directory

Let us explain some options in details (apart from those that should be clear from the --help output itself, and those already explained in Simple Usage).

--doc
-d
If you want a real html document, specify this option (otherwise you just get some text to copy and paste in you own html pages). If you choose this option the page will have a white background and your source file name as title.
--no-doc
The --doc option above is actually implied by other command line options (e.g., --css). If you do not want a complete html document to be created in such cases (e.g., you want to include the output in an existing document containing the global CSS style), you can disable it by using --no-doc.
--css
-c
Specify the .css file that will be included in the generated html output.
--tab
-t
With this options, tab characters will be converted into specified number of space characters (tabulation points will be preserved). This option is automatically selected when generating line numbers.
--output-dir
You can pass to source-highlight more than one input file (see Simple Usage). In this case you cannot specify the output file name. In such cases the output files will be automatically generated into the directory where you invoked the command from; if you want the output files to be generated into a different directory you can use this option.


Next: , Previous: Invoking source-highlight, Up: Top

7 Language Definitions

Since version 2.0 source-highlight uses a specific syntax to specify source language elements (e.g., keywords, strings, comments, etc.). Before version 2.0, language elements were scanned through Flex. This had the drawback of writing a new flex file to deal with a new language; even worse, a new language could not be added “dynamically”: you had to recompile the whole source-highlight program.

Instead, now, language elements are specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, regular expressions that are used to highlight the elements. In particular, we use the regular expressions provided by the Boost library (see Installation). Thus, when writing a language definition file you will surely have to deal with regular expressions. Of cource, we use the Boost regex library regular expression syntax. We refer to Boost documentation for such syntax, http://www.boost.org/libs/regex/doc/syntax.html.

Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing language definition and create a new one. These files have, typically, extension .lang.

Each definition basically associates a regular expression to a language element and defines a name for the language element. Such name will be used to associate a particular style (e.g., bold face, color, etc.) to the highlighting of such elements. You cannot use names that are the same of keywords used in the language definition syntax (e.g., start, as shown later, is a reserved word).

Comments can be given by using #; the rest of the line is considered as a comment.

7.1 Simple definitions

The simpler way of specify language elements is to list the possible alternatives. This is the case, for instance, for keywords. For instance, in java.lang you have:

     keyword = "abstract|assert|break|case|catch|class|const",
               "continue|default|do|else|extends|false|final",
               "finally|for|goto|if|implements|instanceof|interface"
     keyword = "native|new|null|private|protected|public|return",
               "static|strictfp|super|switch|synchronized|throw",
               "throws|true|this|transient|try|volatile|while"

The elements must be specified in double quotes. You can separate quoted definitions with commas. Alternatively, within a quoted definition, alternatives can be separated with the pipe symbol |. The above definition defines the language element keyword. Each time an element is found in the source file, it is highlighted with the style for the element with the same name in the output format style file (notice that all elements shown in the example are take from the language definition files that come with source-highlight and there is a style for each of such elements, see Configuration files). If such an element is not specified in the output format style file, it is simply not highlighted (so pay attention to typos :-).

From the above example you may have noticed that language element definitions are cumulative, so the second keyword definition does not replace the first one. (Indeed, in some case you may want to actually redefine a language element; this is possible as explained in the following sections.)

Notice that words specified in double quotes have to match exactly in a source file, and they must be isolated (not surrounded by anything but spaces). Thus for instance class is matched as a keyword, but in my_class the substring class is not matched as keyword. From the point of view of regular expressions a string such as class in a double quote simple definition is intended as \<(class)\>.

Special characters have to be escaped with the character \. So for instance if you want to specify the character |, which is normally used to separate alternatives in double quoted strings, you have to specify \|.

Definitions in double quotes are interpreted literarly (thus, e.g., a dot . is interpreted as the character . not as the regular expression wild card). If you want to enjoy the full power of regular expressions to specify a language alternative, you have to use single quoted strings ('), instead of double quoted strings.

For instance, the following is the definition for a preprocessor directive in C/C++:

     preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'

Notice that the definition 'class' is different from "class", as explained above. Thus, for instance 'class' matches also the sub-expression class inside my_class.

7.2 Line wide definitions

It is often useful to define a language element that affects all the remaining characters up to the end of the line. For such definitions, instead of the = you must use the keyword start. For instance, the following is the definition of a single line comment in C++:

     comment start "//"

This says that when the two characters // are encountered in the source file, everything from these characters, include, up to the end of the line, will be highlighted according to the style comment.

7.3 Order of definitions

It is important to observe that the order of language definitions is important since it will be used during regular expression matching. You then have to make sure that, if there are definitions that start with same characters, the longest expression is specified first in the file. For instance if you write

     symbol = "/"
     comment start "//"

The first expression will always be matched first, and the second expression will never be matched. The right order is

     comment start "//"
     symbol = "/"

7.4 Delimited definitions

Many elements are delimited by specific character sequences. For instance, strings and multiline comments. The syntax for such an element definition is

     <name> delim <left delimited> <right delimiter> \
             {escape <escape character>} \
             {multiline} {nested}

The escape specification allows to specify the escape character that may preceed one of the delimiters inside the element. This is optional.

For instance, this is the definition of C-like strings:

     string delim "\"" "\"" escape "\\"

Notice that \ is a special characters in definitions so it has to be escaped. If the escape specification was omitted, the C string "write \"hello\" string" would have been highlight incorrectly (it would have been highlighted as the string "write \", the normal character sequence hello\ and the string " string").

The option multiline specifies that the element can spawn multiple lines. For instance, PHP strings are defined as follows:

     string delim "\"" "\"" escape "\\" multiline

The option nested instructs to count possible multiple occurrences of delimited characters and to match relative multiple occurrences. For instance, C-like multiline comments are specified as follows:

     comment delim "/*" "*/" multiline nested

If nested was not used the following nested comment would have not been highglighted correctly:

     /*
        This is a /* nested comment */
     */

As said above, definitions are cumulative, and they are also cumulative even when using different syntactic forms. Thus, for instance, the complete definition for C++-style comments are the following:

     comment start "//"
     comment delim "/*" "*/" multiline nested

7.5 Variable definitions

It is possible to define variables to be re-used in many parts in a language definition file. A variable is defined by using

vardef <name of the variable> = <list of definitions>

Once defined, a variable can be used by prepending the symbol $ to its name. For instance,

     vardef FUNCTION = '(?:[[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=\()'
     function = $FUNCTION

The capital letters are used only for readability.

It is also possible to concatenate variables and expressions, and reuse variables inside further variable definitions:

     vardef basic_time = '[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}'
     vardef time = '\<' + $basic_time + '\>'

7.6 File Inclusion

It is possible to include other language definition files into another file. This is inclusion actually physically includes the contents of the included file into the current file during parsing, at the exact point of inclusion (just like the #include in C/C++). This is useful for re-using definitions in many files. For instance, C++ comment definitions are given in a file c_comment.lang, and this file is included in the Java and C++ definition files. The same happens for number and functions. For instance, the file java.lang contains the following include instructions:

     include "c_comment.lang"
     
     include "number.lang"
     
     keywords ...
     
     include "function.lang"

Notice that the order of inclusion is crucial since the order of definition is crucial. If function definition was included before keyword definitions, then the sentence if (exp) would be highlighted as a function invocation.

7.7 State/Environment Definitions

Sometimes you want some source element to be highlighted only if they are surrounded by other elements. Source-highlight language definitions provides also this feature.

     state|environment <standard definition> begin
       <other definitions>
     end

This structure is recursive (so other state/environment definitions can be given within a state/environment). The meaning of a state/environment is that the definitions within the begin ... end are matched only if the definitions that define the state/environment have been matched. When entering a state/environment, however, the definitions given outside the state/environment are not matched. The difference between state and environment is that in the latter, normal parts of the source language (i.e., those that do not match any definition) are highlighted according to the style of the definition that defines the environment.

As an example, the following defines the multiline nested C comment, and highlights URL and e-mail addresses only when they appear inside a comment (notice that this uses file inclusion):

     environment comment delim "/*" "*/" multiline nested begin
           include "url.lang"
     end

Notice that we used environment because everything else inside a comment has to be formatted according to the comment style.

While for programming language definitions states/environments can be avoided, they are pretty important for highlighting files such as logs and ChangeLog files, since elements have to be higlighted when they appear in a specific position. For instance, for ChangeLog (see changelog.lang), we use a state for highlighting the date, name, e-mail:

     state date start '[[:digit:]]{2,4}-?[[:digit:]]{2}-?[[:digit:]]{2}' begin
       string = '<(?:[[:word:]]*|\.)+@(?:[[:word:]]*|\.)+>'
       url = '(?:[[:word:]]|[[:punct:]])+'
     end

Notice that definitions that appear inside a state/environment have the same scope of the expressions that define the environment. While this makes sense for start and delim definitions, it may makes less sense for simple definitions (i.e., those that simply lists all possible expressions): infact, in this case, such expressions do not define a scope. For such definitions, the semantics of state/environment is that the state/enviroment starts after matching one of the alternatives. And where will it end? In this case you must explicitly exit the enviroment. For instance, you can say that, when inside a state/environment, a specific language definition, when encountered also exits the environment (with the keyword exit). You can even exit all the environments with exitall. For instance, the following definition, highlights a non empty string following a web method:

     vardef non_empty = '[^[:blank:]]+'
     
     state webmethod = "OPTIONS|GET|HEAD|POST|PUT|DELETE",
               "TRACE|CONNECT|PROPFIND|MKCOL|COPY|MOVE|LOCK|UNLOCK" begin
       string = $non_empty exit
     end

If you ever need such advanced features, you may want to take a look at the log.lang definition file that defines higlighting for several log files (access logs, Apache logs, etc.).

7.8 Concluding Remarks

By mixin all these features you can unleash your immagination and define highlighting for complex source languages such as Flex and Bison by writing few lines of code and re-use existing ones. For instance, Flex and Bison have their own syntax and lets you write C/C++ code in specific parts of the source language, e.g., the code between the outmost brackets, in the following example, is C++ code, and should be highlighted following C++ language definitions (apart from variables that are prefixed with $):

     globaltags : options { if (...) { setTags( $1 ); } }

This is easy to do (taken from flex.lang):

     state cbracket delim "{" "}" multiline nested begin
       variable = '\$.'
       include "cpp.lang"
     end

Notice that, since we used nested we can be sure that the C++ language definitions are not considered anymore when we matched the last closing }.


Next: , Previous: Language Definitions, Up: Top

8 Reporting Bugs

If you find a bug in source-highlight, please send electronic mail to

bug-source-highlight at gnu dot org

Include the version number, which you can find by running source-highlight --version. Also include in your message the output that the program produced and the output you expected.

If you have other questions, comments or suggestions about source-highlight, contact the author via electronic mail (find the address at http://www.lorenzobettini.it). The author will try to help you out, although he may not have time to fix your problems.


Next: , Previous: Problems, Up: Top

9 Mailing Lists

The following mailing lists are available:

help-source-highlight at gnu dot org

for generic discussions about the program and for asking for help about it (open mailing list), http://mail.gnu.org/mailman/listinfo/help-source-highlight

info-source-highlight at gnu dot org

for receiving information about new releases and features (read-only mailing list), http://mail.gnu.org/mailman/listinfo/info-source-highlight.

If you want to subscribe to a mailing list just go to the URL and follow the instructions, or send me an e-mail and I'll subscribe you.


Previous: Mailing Lists, Up: Top

Concept Index

Short Contents

Table of Contents


Footnotes

[1] Command lines that are too long are split into multiple indented lines separated by a \. Of course these commands are to be given in one line only, anyway.

[2] This is the main difference introduced in version 2.0 with respect the the previous version.