eLyXer Developer Guide
Alex Fernández (elyxer@gmail.com)
1 The Basics
This document should help you get started if you want to understand how eLyXer works, and maybe extending its functionality. The package (including this guide and all accompanying materials) is licensed under the
GPL version 3 or, at your option, any later version. See the
LICENSE file for details. Also visit the
main page to find out about the latest developments.
In this first section we will outline how eLyXer performs the basic tasks. Next section will be devoted to more arcane matters. The third section deals with future planned extensions, and the fourth one includes things that will probably
not be implemented. Finally there is a FAQ that contains answers to questions asked privately and on the lyx-devel list
7.
1.1 Getting eLyXer
If you are interested in eLyXer from a developer perspective the first thing to do is fetch the code. It is included in the standard distribution, so just navigate to the src/ folder and take a look at the .py Python code files.
For more serious use, or if your distribution did not carry the source code, or perhaps to get the latest copy of the code: you need to install the tool
git, created by Linus Torvalds (of Linux fame)
8. You will also need to have Python installed; versions at or above 2.4 should be fine
9. The code is hosted in Savannah
1, a GNU project for hosting non-GNU projects. So first you have to fetch the code:
$ git clone git://git.sv.gnu.org/elyxer.git
You should see some output similar to this:
Initialized empty Git repository in /home/user/install/elyxer/.git/
remote: Counting objects: 528, done.
remote: Compressing objects: 100% (157/157), done.
remote: Total 528 (delta 371), reused 528 (delta 371)
Receiving objects: 100% (528/528), 150.00 KiB | 140 KiB/s, done.
Resolving deltas: 100% (371/371), done.
Now enter the directory that git has created.
$ cd elyxer
Your first task is to create the main executable file:
$ ./make
The build system for eLyXer will compile it for you, and even run some basic tests. (We will see later on section
2.5↓ how this “compilation” is done.) Now you can try it out:
$ cd docs/
$ ../elyxer.py devguide.lyx devguide2.html
You have just created your first eLyXer page! The result is in devguide2.html; to view it in Firefox:
$ firefox-bin devguide2.html
If you want to debug eLyXer then it is better to run it from the source code folder, instead of the compiled version. For this you need to make just a small change, instead of elyxer.py run src/principal.py:
$ ../src/principal.py --debug devguide.lyx devguide2.html
and you will see the internal debug messages.
Note for Windows developers: on Windows eLyXer needs to be invoked using the Python executable, and of course changing the slashes to backward-slashes:
> Python ..\elyxer.py devguide.lyx devguide2.html
or for the source code version:
> Python ..\src\elyxer.py devguide.lyx devguide2.html
In the rest of this section we will delve a little bit into how eLyXer works.
1.2 Containers
The basic code artifact (or ‘class
↓’ in Python talk) is the
Container, located in the
gen package. Its responsibility is to take a bit of LyX code and generate working HTML code. This includes (with the aid of some helper classes): reading from file a few lines, converting it to HTML, and writing the lines back to a second file.
The following figure
1↓ shows how a
Container works. Each type of
Container should have a
parser and an
output, and a list of
contents. The
parser object receives LyX input and produces a list of
contents that is stored in the
Container. The
output object then converts those
contents to a portion of valid HTML code.
Two important class attributes of a Container are:
-
start: a string of text containing the LyX command that we are about to process;
-
and ending, which is used on some Containers to determine when to stop parsing.
A class called ContainerFactory has the responsibility of creating the appropriate containers, as the strings in their start attributes are found.
The basic method of a Container is:
-
process(): called after parsing the LyX text and before outputting the HTML result. Here the Container can alter its contents as needed, once everything has been read and before it is output.
Now we will see each subordinate class in detail.
1.3 Parsers
The package parse contains almost all parsing code; it has been isolated on purpose so that LyX format changes can be tackled just by changing the code in that directory.
A Parser has two main methods: parseheader() and parse().
parseheader(): parses the first line and returns the contents as a list of words. This method is common for all Parsers. For example, for the command ’\\emph on’ the Parser will return a list [’\\emph’,’on’]. This list will end up in the Container as an attribute header.
parse(): parses all the remaining lines of the command. They will end up in the Container as an attribute contents. This method depends on the particular Parser employed.
Basic Parsers reside in the file parser.py. Among them are the following usual classes:
LoneCommand: parses a single line containing a LyX command.
BoundedParser: reads until it finds the ending. For each line found within, the BoundedParser will call the ContainerFactory to recursively parse its contents. The parser then returns everything found inside as a list.
1.4 Outputs
Common outputs reside in output.py. They have just one method:
gethtml(): processes the contents of a Container and returns a list with file lines. Carriage returns \n must be added manually at the desired points; eLyXer will just merge all lines and write them to file.
Outputs do not however inherit from a common class; all you need is an object with a method gethtml(self,container) that processes the Container’s contents (as a list attribute). An output can also use all attributes of a Container to do their job.
1.5 Tutorial: Adding Your Own Container
If you want to add your own Container to the processing you do not need to modify all these files. You just need to create your own source file that includes the Container, the Parser and the output (or reuse existing ones). Once it is added to the types in the ContainerFactory eLyXer will happily start matching it against LyX commands as they are parsed.
There are good examples of parsing commands in just one file in
image.py and
formula.py. But let us build our own container
BibitemInset here. We want to parse the LyX command in listing
1↓. In the resulting HTML we will generate an anchor: a single tag
<a name="mykey"> with fixed text
"[ref]".
\begin_inset CommandInset bibitem
LatexCommand bibitem
key "mykey"
\end_inset
Algorithm 1 The LyX command to parse.
We will call the
Container BibitemInset, and it will process precisely the inset that we have here. We will place the class in
bibitem.py. So this file starts as shown in listing
2↓.
class BibitemInset(Container):
"An inset containing a bibitem command"
start = ’\\begin_inset CommandInset bibitem’
ending = ’\\end_inset’
Algorithm 2 Class definition for BibitemInset.
We can use the parser for a bounded command with start and ending,
BoundedParser. For the output we will generate a single HTML tag
<a>, so the
TagOutput() is appropriate. Finally we will set the
breaklines attribute to
False, so that the output shows the tag in the same line as the contents:
<a …>[ref]</a>. Listing
3↓ shows the constructor.
def __init__(self):
self.parser = BoundedParser()
self.output = TagOutput()
self.tag = ’a’
self.breaklines = False
Algorithm 3 Constructor for BibitemInset.
The
BoundedParser will automatically parse the header and the contents. In the
process() method we will discard the first line with the
LatexCommand, and place the key from the second line as link destination. The class
StringContainer holds string constants; in our case we will have to isolate the key by splitting the string around the double quote
", and then access the anchor with the same name. The contents will be set to the fixed string
[ref]. The result is shown in listing
4↓.
def process(self):
#skip first line
del self.contents[0]
# parse second line: fixed string
string = self.contents[0]
# split around the "
key = string.contents[0].split(’"’)[1]
# make tag and contents
self.tag = ’a name="’ + key + ’"’
string.contents[0] = ’[ref] ’
Algorithm 4 Processing for BibitemInset.
And then we have to add the new class to the types parsed by the
ContainerFactory; this has to be done outside the class definition. The complete file is shown in listing
5↓.
from parser import *
from output import *
from container import *
class BibitemInset(Container):
"An inset containing a bibitem command"
start = ’\\begin_inset CommandInset bibitem’
ending = ’\\end_inset’
def __init__(self):
self.parser = BoundedParser()
self.output = TagOutput()
self.breaklines = False
def process(self):
#skip first line
del self.contents[0]
# parse second line: fixed string
string = self.contents[0]
# split around the "
key = string.contents[0].split(’"’)[1]
# make tag and contents
self.tag = ’a name="’ + key + ’"’
string.contents[0] = ’[ref] ’
ContainerFactory.types.append(BibitemInset)
Algorithm 5 Full listing for BibitemInset.
The end result of processing the command in listing
1↑ is a valid anchor:
<a name="mykey">[ref] </a>
The final touch is to make sure that the class is run, importing it in the file
gen/factory.py, as shown in listing
6↓. This ensures that the
ContainerFactory will know what to do when it finds an element that corresponds to the
BibitemInset.
…
from structure import *
from bibitem import *
from container import *
…
Algorithm 6 Importing the BibitemInset from the factory file.
Now this Container is not too refined: the link text is fixed, and we need to do additional processing on the bibitem entry to show consecutive numbers. The approach is not very flexible either: e.g. anchor text is fixed. But in less than 20 lines we have parsed a new LyX command and have outputted valid, working XHTML code. The actual code is a bit different but follows the same principles; it can be found in src/link.py: in the classes BiblioCite and BiblioEntry, and it processes bibliography entries and citations (with all our missing bits) in about 50 lines.
2 Advanced Features
This section tackles other, more complex features; all of them are included in current versions.
2.1 Parse Tree
On initialization of the ContainerFactory, a ParseTree is created to quickly pass each incoming LyX command to the appropriate containers, which are created on demand. For example, when the ContainerFactory finds a command:
\\emph on
it will create and initialize an EmphaticText object. The ParseTree works with words: it creates a tree where each keyword has its own node. At that node there may be a leaf, which is a Container class, and/or additional branches that point to other nodes. If the tree finds a Container leaf at the last node then it has found the right point; otherwise it must backtrack to the last node with a Container leaf.
Figure
2↓ shows a piece of the actual parse tree. You can see how if the string to parse is
\begin_inset LatexCommand, at the node for the second keyword
LatexCommand there is no leaf, just two more branches
label and
ref. In this case the
ParseTree would backtrack to
begin_inset, and choose the generic
Inset.
Parsing is much faster this way, but there are disadvantages; for one, parsing can only be done using whole words and not prefixes. SGML tags (such as <lyxtabular>) pose particular problems: sometimes they may appear with attributes (as in <lyxtabular version="3">), and in this case the starting word is <lyxtabular without the trailing ’>’ character. So the parse tree removes any trailing ’>’, and the start string would be just <lyxtabular; this way both starting words <lyxtabular> and <lyxtabular are recognized.
2.2 Postprocessors
Some post-processing of the resulting HTML page can make the results look much better. The main stage in the postprocessing pipeline inserts a title “Bibliography” before the first bibliographical entry. But more can be added to alter the result. As eLyXer parses a LyX document it automatically numbers all chapters and sections. This is also done in the postprocessor.
The package post contains most postprocessing code, although some postprocessors are located in the classes of their containers for easy access.
2.3 Mathematical Formulae
Formulae in LyX are rendered beautifully into TeX and PDF documents. For HTML the conversion is not so simple. There are basically three options:
-
render the formula as an image (GIF or PNG), then import the image;
-
export a specific language called MathML
-
or render using a variety of Unicode characters, HTML and CSS wizardry 2.
eLyXer employs the third technique, with varied results. Basic fractions and square roots should be rendered fine, albeit at the moment there may be some issues pending. Complex fractions with several levels do not come out right. (But see subsection
3.3↓.)
2.4 Baskets
eLyXer supports a few distinct modes of operation. In each incarnation the tasks to do are quite different:
-
A pure filter↓: read from disk and write to disk each Container, keeping no context in memory.
-
In-memory processing: read a complete file, process it and write it all to disk.
-
TOC↓ generation: output just a table of contents for a LyX document.
-
Split document generation: separates each chapter, section or subsection in a different file.
How can it do so many different tasks without changing a lot of code? The answer is in the file gen/basket.py. A Basket is an object that keeps Containers. Once a batch is ready, the Basket outputs them to disk or to some other Basket, but it may decide to just wait a little longer.
The basic Basket is the WriterBasket: it writes everything that it gets to disk immediately and then forgets about it. Some bits of state are kept around, like for example which biliography entries have been found so far, but the bulk of the memory is released.
Another more complex object is the TOCBasket: it checks if the Container is worthy to appear in a TOC, and otherwise just discards it. For chapters, sections and so on it converts them to TOC entries and outputs them to disk.
The MemoryBasket does most of its work in memory: it stores all Containers until they have all been read, then does some further processing on them and outputs an improved version of the document, at the cost of using quite more memory. This allows us for example to generate a list of figures or to set consecutive labels for bibliography entries (instead of just numbering them as they appear in the text).
The most complex kind of Basket is the SplittingBasket: it writes each document part to a separate file, choosing what parts to split depending on the configuration passed in the --splitpart option. It can be taught to create a TOC at the top of each page with an additional --toc option. (Warning: not yet working in version 0.41.)
2.5 Distribution
You will notice that in the src/ folder there are several Python files, while in the main directory there is just a big one. The reason is that before distributing the source code is coalesced and placed on the main directory, so that users can run it without worrying about libraries, directories and the such. (They need of course to have Python 2.5 installed.) And the weapon is a little Python script called coalesce.py that does the dirty job of parsing dependencies and inserting them into the main file. There is also a make Bash script that takes care of permissions and generates the documentation. Just type
$ ./make
at the prompt. It is a primitive way perhaps to generate the “binary” (ok, not really a binary but a distributable Python file), but it works great. It also runs all of the included tests to check that no functionality has been lost from one release to the next — although some issues in a feature can slip undetected if there is no test for them.
The configuration file src/conf/config.py is also generated from another file, in this case src/conf/base.cfg. Changes to the configuration should always go against this latter file; running make afterwards regenerates config.py.
At the moment there is no way to do this packaging on non-Unix operating systems with a single script, e.g. a Windows .bat script. However the steps themselves are trivial.
If you are willing to send a patch to the eLyXer mailing list then you should patch against the proper sources in
src/ and submit that to the
mailing list.
2.6 License and Contributions
eLyXer is published under the GPL, version 3 or later
3. This basically means that you can modify the code and distribute the result as desired, as long as you publish your modifications under the same license. But consult a lawyer if you want an authoritative opinion.
All contributions will be published under this same license, so if you send them this way you implicitly give your consent. An explicit license grant would be even better and may be required for larger contributions.
The first external patches have started arriving during late 2009 and early 2010 (provided by Olivier Ripoll, Geremy Condra and Simon South). You can join in the fun!
3 Future Extensions
The author has plans for the following extensions.
3.1 Templates
Some header and footer content is automatically added to the resulting document. The use of templates might make the job far more flexible.
3.2 Page Segmentors
A page segmentor should build a set of pages and cross-reference them, but generally avoids the complexities of the internal structure. Ideally it uses templates to construct the header and footer. A --splitpart option to eLyXer achieves this result. As a parameter it accepts the depth at which to split pages:
$ elyxer.py --splitpart 1
yields one chapter per page (or one section per page for article classes).
The complete package should implement something like the flow in figure
3↓. This is the high-level design, but some details have to be filled in.
3.3 MathML
As suggested by Günther Milne and Abdelrazak Younes
4,5, MathML is by now well supported in Firefox. An option to emit MathML (instead of more-or-less clumsy HTML and CSS code) could be very useful.
3.4 Roadmap
Nearing the end of Q4 2009, LyX integration is almost ready for Linux (Debian) and Windows. The full pipeline (document segmentation and in-memory processing) is also nearing completion. By the end of 2009 both tasks should be finished, marking the release of eLyXer 1.0.
From that point on, full LyX document support is the goal for 2010. By Q1 2010 equation numbering should be correct for all environments; arrays and cases should improve in appearance. Q2 2010 should see clean conversion of the Math guide and the --mathml option. By Q3 2010 all other guides should convert without glitches, and for Q4 2010 all rough edges should be smoothed. All this within the usual constraints: day job, family, etc.
4 Discarded Bits
Not everything that has been planned or can be done with eLyXer is planned; some extensions have been discarded. However, this means basically that the author is too ignorant to know how to do them right; help (and patches!) towards a sane implementation would be welcome if they fit with the design.
4.1 Spellchecking
LyX can use a spellchecker to verify the words used. However it is not interactive so you may forget to run it before generating a version. It is possible to integrate eLyXer with a spellchecker and verify the spelling before generating the HTML, but it is not clear that it can be done cleanly.
4.2 URL Checking
Another fun possibility is to make eLyXer check all the external URLs embedded in the document. However the Python facilities for URL checking are not very mature, at least with Python 2.5: some of them do not return errors, others throw complex exceptions that have to be parsed… It is far easier to just create the HTML page and use wget (or a similar tool) to recursively check all links in the page.
4.3 Use of lyx2lyx Framework
Abdelrazak Younes suggests using the
lyx2lyx framework, which after all already knows about LyX formats
5. It is an interesting suggestion, but one that for now does not fit well with the design of eLyXer: a standalone tool to convert between two formats, or as Kernighan and Plauger put it, a standalone
filter 6. Long-term maintenance might result a bit heavier with this approach though, especially if LyX changes to a different file format in the future.
5 FAQ
Q: I don’t like how your tool outputs my document, what can I do?
A: First make sure that you are using the proper CSS file, i.e. copy the existing docs/lyx.css file to your web page directory. Next try to customize the CSS file to your liking; it is a flexible approach that requires no code changes. Then try changing the code (and submitting the patch back).
Q: Why does your Python code suck so much? You don’t make proper use of most features!
A: Because I’m mostly a novice with little Python culture. If you want to help it suck less, please send mail and enlighten me.
Q: How is the code maintained?
A: It is kept in a git repository. Patches in git format are welcome (but keep in mind that my knowledge of git is even shallower than my Python skills).
Q: I found a bug, what should I do?
Nomenclature
↑class A self-contained piece of code that hosts attributes (values) and methods (functions).
↑filter A type of program that reads from a file and writes to another file, keeping in memory only what is needed short term.
Bibliography
[1] Free Software Foundation, Inc.: eLyXer summary. https://savannah.nongnu.org/projects/elyxer/
[2] S White: “Math in HTML with CSS”, accessed March 2009. http://www.zipcon.net/~swhite/docs/math/math.html
[3] R S Stallman et al: “GNU GENERAL PUBLIC LICENSE” version 3, 20070629. http://www.gnu.org/copyleft/gpl.html
[4] G Milde: “Re: eLyXer: LyX to HTML converter”, message to list lyx-devel, 20090309. http://www.mail-archive.com/lyx-devel@lists.lyx.org/msg148627.html
[5] A Younes: “Re: eLyXer: LyX to HTML converter”, message to list lyx-devel, 20090309. http://www.mail-archive.com/lyx-devel@lists.lyx.org/msg148634.html
[6] B W Kernighan, P J Plauger: “Software Tools”, ed. Addison-Wesley Professional 1976, p. 35.
[7] Various authors: “lyx-devel mailing list”, accessed November 2009. http://www.mail-archive.com/lyx-devel@lists.lyx.org/
[8] S Chacon: “Git — Download”, accessed November 2009. http://git-scm.com/download
[9] Python community: “Download Python”, accessed November 2009. http://www.python.org/download/
Copyright (C) 2010 Alex Fernández (elyxer@gmail.com)