PpLexer Tutorial¶
The PpLexer module represents the user side view of pre-processing. This tutorial shows you how to get going.
Setting Up¶
Files to Pre-Process¶
First let’s get some demonstration code to pre-process. You can find this at cpip/demo/ and the directory structure looks like this:
\---demo/
| cpip.py
|
\---proj/
+---src/
| main.cpp
|
+---sys/
| system.h
|
\---usr/
user.h
In proj/
is some source code that includes files from usr/
and sys/
.
This tutorial will take you through writing cpip.py to use PpLexer to
pre-process them.
First lets have a look at the source code that we are preprocessing. It is a pretty trivial variation of a common them, but beware, pre-processing directives abound!
The file demo/proj/src/main.cpp
looks like this:
#include "user.h"
int main(char **argv, int argc)
{
#if defined(LANG_SUPPORT) && defined(FRENCH)
printf("Bonjour tout le monde\n");
#elif defined(LANG_SUPPORT) && defined(AUSTRALIAN)
printf("Wotcha\n");
#else
printf("Hello world\n");
#endif
return 1;
}
That includes a file user.h
that can be found at demo/proj/usr/user.h
:
#ifndef __USER_H__
#define __USER_H__
#include <system.h>
#define FRENCH
#endif // __USER_H__
In turn that includes a file system.h
that can be found at
demo/proj/sys/system.h:
#ifndef __SYSTEM_H__
#define __SYSTEM_H__
#define LANG_SUPPORT
#endif // __SYSTEM_H__
Clearly since the system is mandating language support and the user is specifying French as their language of choice then you would not expect this to write out “Hello World”, or would you?
Well you are in the hands of the pre-processor and that is what
CPIP knows all about. First we need to create a PpLexer
.
Creating a PpLexer¶
This is the template that we will use for the tutorial, it just takes a
single argument from the command line sys.argv[1]
:
1 2 3 4 5 6 7 8 | import sys
def main():
print('Processing:', sys.argv[1])
# Your code here
if __name__ == "__main__":
main()
|
Of course this doesn’t do much yet, invoking it just gives:
python cpip.py proj/src/main.cpp
Processing: proj/src/main.cpp
We now need to import and create and PpLexer.PpLexer
object, and this
takes at least two arguments; firstly the file to pre-process, the secondly an
include handler. The latter is need because the C/C++ standards do not
specify how an #include
directive is to be processed as that is
as an implementation issue. So we need to provide an defined implementation
of something that can find #include'd
files.
CPIP provides several such implementations in the module
IncludeHandler
and the one that does what, I guess,
most developers expect from a pre-processor is
IncludeHandler.CppIncludeStdOs
. This class takes at least two
arguments; a list of search paths to the user include directories and a list of
search paths to the system include directories. With this we can construct a
PpLexer
object so our code now looks like this:
import sys
from cpip.core import PpLexer, IncludeHandler
def main():
print('Processing:', sys.argv[1])
myH = IncludeHandler.CppIncludeStdOs(
theUsrDirs=['proj/usr',],
theSysDirs=['proj/sys',],
)
myLex = PpLexer.PpLexer(sys.argv[1], myH)
if __name__ == "__main__":
main()
This still doesn’t do much yet, invoking it just gives:
python cpip.py proj/src/main.cpp
Processing: proj/src/main.cpp
But, in the absence of error, shows that we can construct a
PpLexer
.
Put the PpLexer to Work¶
To get PpLexer
to do something, we need to make the call
to PpLexer.PpTokens()
. This function is a generator of preprocessing tokens.
Lets just print them out with this code:
import sys
from cpip.core import PpLexer, IncludeHandler
def main():
print('Processing:', sys.argv[1])
myH = IncludeHandler.CppIncludeStdOs(
theUsrDirs=['proj/usr',],
theSysDirs=['proj/sys',],
)
myLex = PpLexer.PpLexer(sys.argv[1], myH)
for tok in myLex.ppTokens():
print(tok)
if __name__ == "__main__":
main()
Invoking it now gives:
$ python cpip.py proj/src/main.cpp
Processing: proj/src/main.cpp
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
...
PpToken(t="int", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=" ", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="main", tt=identifier, line=True, prev=False, ?=False)
PpToken(t="(", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="char", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=" ", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="*", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="*", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="argv", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=",", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t=" ", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="int", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=" ", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="argc", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=")", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="{", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="printf", tt=identifier, line=True, prev=False, ?=False)
PpToken(t="(", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t=""Bonjour tout le monde\n"", tt=string-literal, line=False, prev=False, ?=False)
PpToken(t=")", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t=";", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="return", tt=identifier, line=True, prev=False, ?=False)
PpToken(t=" ", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="1", tt=pp-number, line=False, prev=False, ?=False)
PpToken(t=";", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
PpToken(t="}", tt=preprocessing-op-or-punc, line=False, prev=False, ?=False)
PpToken(t="\n", tt=whitespace, line=False, prev=False, ?=False)
The PpLexer is yielding PpToken objects that are interesting in themselves because they not only have content but the type of content (whitespace, punctuation, literals etc.). A simplification is to change the code to print out the token value by changing a line in the code from:
print tok
To:
print tok.t
To give:
Processing: proj/src/main.cpp
int main ( char * * argv , int argc )
{
printf ( "Bonjour tout le monde\n" ) ;
return 1 ;
}
It is definately pre-processed and although the output is correct it is rather verbose because of all the whitespace generated by the pre-processing (newlines are always the consequence of pre-processing directives).
We can clean this whitespace up very simply by invoking
PpTokens.ppTokens()
with a suitable argument to reduce spurious
whitespace thus: myLex.ppTokens(minWs=True)
. This minimises the whitespace
runs to a single space or newline. Our code now
looks like this:
import sys
from cpip.core import PpLexer, IncludeHandler
def main():
print('Processing:', sys.argv[1])
myH = IncludeHandler.CppIncludeStdOs(
theUsrDirs=['proj/usr',],
theSysDirs=['proj/sys',],
)
myLex = PpLexer.PpLexer(sys.argv[1], myH)
for tok in myLex.ppTokens(minWs=True):
print(tok.t, end=' ')
if __name__ == "__main__":
main()
Invoking it now gives:
Processing: proj/src/main.cpp
int main ( char * * argv , int argc )
{
printf ( "Bonjour tout le monde\n" ) ;
return 1 ;
}
This is exactly the result that one would expect from pre-processing the original source code.
And now for something Completely Different¶
So far, so boring because any pre-processor can do the same, PpLexer
can do far more than this.
PpLexer keeps track of a large amount of significant pre-processing
information and that is available to you through the PpLexer
APIs.
For a moment lets remove the minWs=True
from myLex.ppTokens()
so that we can inspect the state of the PpLexer at every token (rather
than skipping whitespace tokens that might represent pre-processing
directives).
File Include Stack¶
Changing the code to this shows the include
file
hierarchy every step of the way:
for tok in myLex.ppTokens():
print myLex.fileStack
Gives the following output:
$ python cpip.py proj/src/main.cpp
Processing: proj/src/main.cpp
['proj/src/main.cpp', 'proj/usr/user.h']
['proj/src/main.cpp', 'proj/usr/user.h']
['proj/src/main.cpp', 'proj/usr/user.h', 'proj/sys/system.h']
['proj/src/main.cpp', 'proj/usr/user.h', 'proj/sys/system.h']
['proj/src/main.cpp', 'proj/usr/user.h', 'proj/sys/system.h']
['proj/src/main.cpp', 'proj/usr/user.h', 'proj/sys/system.h']
['proj/src/main.cpp', 'proj/usr/user.h']
['proj/src/main.cpp', 'proj/usr/user.h']
['proj/src/main.cpp', 'proj/usr/user.h']
['proj/src/main.cpp']
...
Conditional State¶
Changing the code to this:
for tok in myLex.ppTokens(condLevel=1):
print myLex.condState
Produces this output:
Processing: proj/src/main.cpp
(True, '')
...
(True, '')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(True, 'defined(LANG_SUPPORT) && defined(FRENCH)')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && defined(LANG_SUPPORT) && defined(AUSTRALIAN))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(False, '(!(defined(LANG_SUPPORT) && defined(FRENCH)) && !(defined(LANG_SUPPORT) && defined(AUSTRALIAN)))')
(True, '')
...
(True, '')
State of the PpLexer
After Pre-processing¶
A more common use case is to query the PpLexer
after processing the file. The following code example will:
- Capture all tokens as a Translation Unit and write it out with minimal whitespace [lines 11-16].
- Print out a text representation of the file include graph [lines 18-21].
- Print out a text representation of the conditional compilation graph [lines 23-26].
- Print out a text representation of the macro environment as it exists at the end of processing the Translation Unit [lines 28-31].
- Print out a text representation of the macro history for all macros, whether referenced or not, as it exists at the end of processing the Translation Unit [lines 33-36].
Here is the code, named cpip_07.py
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | import sys
from cpip.core import PpLexer, IncludeHandler
def main():
print('Processing:', sys.argv[1])
myH = IncludeHandler.CppIncludeStdOs(
theUsrDirs=['proj/usr',],
theSysDirs=['proj/sys',],
)
myLex = PpLexer.PpLexer(sys.argv[1], myH)
tu = ''.join(tok.t for tok in myLex.ppTokens(minWs=True))
print()
print(' Translation Unit '.center(75, '='))
print(tu)
print(' Translation Unit END '.center(75, '='))
print()
print(' File Include Graph '.center(75, '='))
print(myLex.fileIncludeGraphRoot)
print(' File Include Graph END '.center(75, '='))
print()
print(' Conditional Compilation Graph '.center(75, '='))
print(myLex.condCompGraph)
print(' Conditional Compilation Graph END '.center(75, '='))
print()
print(' Macro Environment '.center(75, '='))
print(myLex.macroEnvironment)
print(' Macro Environment END '.center(75, '='))
print()
print(' Macro History '.center(75, '='))
print(myLex.macroEnvironment.macroHistory(incEnv=False, onlyRef=False))
print(' Macro History END '.center(75, '='))
if __name__ == "__main__":
main()
|
Invoking this code thus:
$ python3 cpip_07.py ../src/main.cpp
Gives this output:
Processing: ../src/main.cpp
============================= Translation Unit ============================
int main(char **argv, int argc)
{
printf("Bonjour tout le monde\n");
return 1;
}
=========================== Translation Unit END ==========================
============================ File Include Graph ===========================
../src/main.cpp [43, 21]: True "" ""
000002: #include ../usr/user.h
../usr/user.h [10, 6]: True "" "['"user.h"', 'CP=None', 'usr=../usr']"
000004: #include ../sys/system.h
../sys/system.h [10, 6]: True "!def __USER_H__" "['<system.h>', 'sys=../sys']"
========================== File Include Graph END =========================
====================== Conditional Compilation Graph ======================
#ifndef __USER_H__ /* True "../usr/user.h" 1 0 */
#ifndef __SYSTEM_H__ /* True "../sys/system.h" 1 4 */
#endif /* True "../sys/system.h" 6 13 */
#endif /* True "../usr/user.h" 7 20 */
#if defined(LANG_SUPPORT) && defined(FRENCH) /* True "../src/main.cpp" 5 69 */
#elif defined(LANG_SUPPORT) && defined(AUSTRALIAN) /* False "../src/main.cpp" 7 110 */
#else /* False "../src/main.cpp" 9 117 */
#endif /* False "../src/main.cpp" 11 124 */
==================== Conditional Compilation Graph END ====================
============================ Macro Environment ============================
#define FRENCH /* ../usr/user.h#5 Ref: 1 True */
#define LANG_SUPPORT /* ../sys/system.h#4 Ref: 2 True */
#define __SYSTEM_H__ /* ../sys/system.h#2 Ref: 0 True */
#define __USER_H__ /* ../usr/user.h#2 Ref: 0 True */
========================== Macro Environment END ==========================
============================== Macro History ==============================
Macro History (all macros):
In scope:
#define FRENCH /* ../usr/user.h#5 Ref: 1 True */
../src/main.cpp 5 38
#define LANG_SUPPORT /* ../sys/system.h#4 Ref: 2 True */
../src/main.cpp 5 13
../src/main.cpp 7 15
#define __SYSTEM_H__ /* ../sys/system.h#2 Ref: 0 True */
#define __USER_H__ /* ../usr/user.h#2 Ref: 0 True */
============================ Macro History END ============================
This is simple to the point of crude as the PpLexer
supplies a far richer data seam than just text.
File Include Graph interface is described here: FileIncludeGraph Tutorial
Summary¶
There are several ways that you can inspect pre-processing with PpLexer:
- Supplying arguments to
PpLexer.ppTokens()
with arguments such asminWs
orincCond
. - Accessing the state of each token as it is generated such as
tok.tt
ortok.isCond
. - Accessing the state of PpLexer as each token as it is generated or once all tokens have been generated such as PpLexer.condState.
- Creating PpLexer with a user specified behaviour. This is the subject of the next section.
Advanced PpLexer Construction¶
The PpLexer
constructor allows you to change the behaviour of pre-processing is a number of ways, effectively these are hooks into pre-processing that can:
- Varying how
#include
‘d files are inserted into the Translation Unit. - Pre-including header files.
- Changing the behaviour of
PpLexer
in unusual circumstances (errors etc.). - Handling
#pragma
statements, in this way various compilers can be imitated.
Include Handler¶
When an #include
directive is encountered a compliant implementation is required to search for and insert into the Translation Unit the content referenced by the payload of the #include
directive.
The standard does not specify how this should be accomplished. In CPIP the how is achieved by an implementation of an cpip.core.IncludeHandler
.
An Aside¶
It is entirely acceptable within the standard to have an #include
system that does not rely on a file system at all. Perhaps it might rely on a database like this:
#include "SQL:spam.eggs#1284"
An include handler could take that payload and recover the content from some database rather than the local file system.
Or, more prosaically, an include mechanism such as this:
#include "http:://some.url.org/spam/eggs#1284"
That leads to a fairly obvious way of managing that #include
payload.
Implementation¶
If you want to create a new include mechanism then you should sub-class the base class cpip.core.IncludeHandler.CppIncludeStd
[reference documentation: IncludeHandler].
Sub-classing this requires implementing the following methods :
def initialTu(self, theTuIdentifier):
Given an Translation Unit Identifier this should return a class FilePathOrigin or None for the initial translation unit. As a precaution this should include code to check that the stack of current places is empty. For example:
if len(self._cpStack) != 0: raise ExceptionCppInclude('setTu() with CP stack: %s' % self._cpStack)
def _searchFile(self, theCharSeq, theSearchPath):
Given an HcharSeq/Qcharseq and a searchpath this should return a class
FilePathOrigin
or None.
As examples there are a couple of reference implementations in cpip.core.IncludeHandler
:
cpip.core.IncludeHandler.CppIncludeStdOs
- An implementation that behaves as most developers think the#include
mechanism works.cpip.core.IncludeHandler.CppIncludeStringIO
- An implementation that recovers content from a dictionary of in-memory files. This is used a lot within CPIP for unit testing.
Pre-includes¶
The PpLexer can be supplied with an ordered list of file like objects that are pre-include files. These are processed in order before the ITU is processed. Macro redefinition rules apply.
For example CPIPMain.py
can take a list of user defined macros on the command line. It then creates a list with a single pre-include file thus:
import io
from cpip.core import PpLexer
# defines is a list thus:
# ['spam(x)=x+4', 'eggs',]
myStr = '\n'.join(['#define '+' '.join(d.split('=')) for d in defines])+'\n'
myPreIncFiles = [io.StringIO(myStr), ]
# Create other constructor information here...
myLexer = PpLexer.PpLexer(
anItu, # File to pre-process
myIncH, # Include handler
preIncFiles=myPreIncFiles,
)
Diagnostic¶
You can pass in to PpLexer
a diagnostic object, this controls how the lexer responds to various conditions such as warning error etc. The default is for the lexer to create a CppDiagnostic.PreprocessDiagnosticStd
.
If you want to create your own then sub-class the cpip.core.CppDiagnostic.PreprocessDiagnosticStd
class in the module cpip.ref.CppDiagnostic
.
Sub-classing PreprocessDiagnosticStd
allows you to override any of the following that might be called by the PpLexer
:
def undefined(self, msg, theLoc=None):
Reports when an ‘undefined’ event happens.def partialTokenStream(self, msg, theLoc=None):
Reports when an partial token stream exists (e.g. an unclosed comment).def implementationDefined(self, msg, theLoc=None):
Reports when an ‘implementation defined’ event happens.def error(self, msg, theLoc=None):
Reports when an error event happens.def warning(self, msg, theLoc=None):
Reports when an warning event happens.def handleUnclosedComment(self, msg, theLoc=None):
Reports when an unclosed comment is seen at EOF.def unspecified(self, msg, theLoc=None):
Reports when unspecified behaviour is happening, For example order of evaluation of ‘#’ and ‘##’.def debug(self, msg, theLoc=None):
Reports a debug message.
There are a couple of implementations in the CppDiagnostic module that may be of interest:
cpip.core.CppDiagnostic.PreprocessDiagnosticKeepGoing
: Sub-class that does not raise exceptions.cpip.core.CppDiagnostic.PreprocessDiagnosticRaiseOnError
: Sub-class that raises an exception on a#error
directive.
Pragma¶
You can pass in a specialised handler for #pragma
statements [default: None]. This shall sub-class cpip.core.PragmaHandler.PragmaHandlerABC
and can implement:
- The boolean attribute
replaceTokens
is to be implemented. If True then the tokens following the#pragma
statement will be be macro replaced by the PpLexer using the current macro environment before being passed to this pragma handler. - A method
def pragma(self, theTokS):
that takes a non-zero length list ofPpTokens
the last of which will be a newline token. Any token this method returns will be yielded as part of the Translation Unit (and thus subject to macro replacement for example).
Have a look at the core module cpip.core.PragmaHandler
for some example implementations.