PpTokeniser

Performs translation phases 0, 1, 2, 3 on C/C++ source code.

Translation phases from ISO/IEC 9899:1999 (E):

5.1.1.2 Translation phases 5.1.1.2-1 The precedence among the syntax rules of translation is specified by the following phases.

Phase 1. Physical source file multibyte characters are mapped, in an implementation defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Phase 2. Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.

Phase 3. The source file is decomposed into preprocessing tokens6) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.

TODO: Do phases 0,1,2 as generators i.e. not in memory?

TODO: Check coverage with a complete but minimal example of every token

TODO: remove self._cppTokType and have it as a return value?

TODO: Remove commented out code.

TODO: Performance of phase 1 processing.

TODO: rename next() as genPpTokens()?

TODO: Perf rewrite slice functions to take an integer argument of where in the array to start inspecting for a slice. This avoids calls to ...[x:] e.g. myCharS = myCharS[sliceIdx:] in genLexPptokenAndSeqWs.

cpip.core.PpTokeniser.COMMENT_REPLACEMENT = ' '

Comments are replaced by a single space

cpip.core.PpTokeniser.C_KEYWORDS = ('auto', 'break', 'case', 'char', 'const', 'continue', 'default', 'do', 'double', 'else', 'enum', 'extern', 'float', 'for', 'goto', 'if', 'inline', 'int', 'long', 'register', 'restrict', 'return', 'short', 'signed', 'sizeof', 'static', 'struct', 'switch', 'typedef', 'union', 'unsigned', 'void', 'volatile', 'while', '_Bool', '_Complex', '_Imaginary')

ISO/IEC 9899:1999 (E) 6.4.1 Keywords

cpip.core.PpTokeniser.DIGRAPH_TABLE = {'<:': '[', 'and': '&&', 'not_eq': '!=', '%:%:': '##', '%>': '}', ':>': ']', 'or_eq': '|=', 'bitor': '|', 'not': '!', 'xor_eq': '^=', '%:': '#', 'or': '||', 'bitand': '&', 'xor': '^', 'compl': '~', '<%': '{', 'and_eq': '&='}

Map of Digraph alternates

exception cpip.core.PpTokeniser.ExceptionCpipTokeniser

Simple specialisation of an exception class for the preprocessor.

exception cpip.core.PpTokeniser.ExceptionCpipTokeniserUcnConstraint

Specialisation for when universal character name exceeds constraints.

cpip.core.PpTokeniser.LEN_SOURCE_CHARACTER_SET = 96

Size of the source code character set

class cpip.core.PpTokeniser.PpTokeniser(theFileObj=None, theFileId=None, theDiagnostic=None)

Imitates a Preprocessor that conforms to ISO/IEC 14882:1998(E).

Takes an optional file like object. If theFileObj has a ‘name’ attribute then that will be use as the name otherwise theFileId will be used as the file name.

Implementation note: On all _slice...() and __slice...() functions: A _slice...() function takes a buffer-like object and an integer offset as arguments. The buffer-like object will be accessed by index so just needs to implement __getitem__(). On overrun or other out of bounds index an IndexError must be caught by the _slice...() function. i.e. len() should not be called on the buffer-like object, or rather, if len() (i.e. __len__()) is called a TypeError will be raised and propagated out of this class to the caller.

StrTree, for example, conforms to these requirements.

The function is expected to return an integer that represents the number of objects that can be consumed from the buffer-like object. If the return value is non-zero the PpTokeniser is side-affected in that self._cppTokType is set to a non-None value. Before doing that a test is made and if self._cppTokType is already non-None then an assertion error is raised.

The buffer-like object should not be side-affected by the _slice...() function regardless of the return value.

So a _slice...() function pattern is:

def _slice...(self, theBuf, theOfs):
    i = theOfs
    try:
        # Only access theBuf with [i] so that __getitem__() is called
        ...theBuf[i]...
        # Success as the absence of an IndexError!
        # So return the length of objects that pass
        # First test and set for type of slice found
        if i > theOfs:
            assert(self._cppTokType is None), '_cppTokType was %s now %s' % (self._cppTokType, ...)
            self._cppTokType = ...
        # NOTE: Return size of slice not the index of the end of the slice
        return i - theOfs
    except IndexError:
        pass
    # Here either return 0 on IndexError or i-theOfs
    return ...

NOTE: Functions starting with __slice... do not trap the IndexError, the caller must do that.

TODO: ISO/IEC 14882:1998(E) Escape sequences Table 5?

cppTokType

Returns the type of the last preprocessing-token found by _sliceLexPptoken().

fileLineCol

Return an instance of FileLineCol from the current physical line column.

fileLocator

Returns the FileLocation object.

fileName

Returns the ID of the file.

filterHeaderNames(theToks)

Returns a list of ‘header-name’ tokens from the supplied stream. May raise ExceptionCpipTokeniser if un-parsable or theToks has non-(whitespace, header-name).

genLexPptokenAndSeqWs(theCharS)

Generates a sequence of PpToken objects. Either:

  • a sequence of whitespace (comments are replaces with a single whitespace).
  • a pre-processing token.

This performs translation phase 3.

NOTE: Whitespace sequences are not merged so '  /\*\*/ ' will generate three tokens each of PpToken.PpToken(' ', 'whitespace') i.e. leading whitespace, comment replced by single space, trailing whitespace.

So this yields the tokens from translation phase 3 if supplied with the results of translation phase 2.

NOTE: This does not generate ‘header-name’ tokens as these are context dependent i.e. they are only valid in the context of a #include directive.

ISO/IEC 9899:1999 (E) 6.4.7 Header names Para 3 says that: “A header name preprocessing token is recognised only within a #include preprocessing directive.”.

initLexPhase12()

Process phases one and two and returns the result as a string.

lexPhases_0()

An non-standard phase that just reads the file and returns its contents as a list of lines (including EOL characters). May raise an ExceptionCpipTokeniser if self has been created with None or the file is unreadable

lexPhases_1(theLineS)

ISO/IEC 14882:1998(E) 2.1 Phases of translation [lex.phases] - Phase one Takes a list of lines (including EOL characters), replaces trigraphs and returns the new list of lines.

lexPhases_2(theLineS)

ISO/IEC 14882:1998(E) 2.1 Phases of translation [lex.phases] - Phase two This joins physical to logical lines. NOTE: This side-effects the supplied lines and returns None.

next()

The token generator. On being called this performs translations phases 1, 2 and 3 (unless already done) and then generates pairs of: (preprocessing token, token type) Token type is an enumerated integer from LEX_PPTOKEN_TYPES. Proprocessing tokens include sequences of whitespace characters and these are not necessarily concatenated i.e. this generator can produce more than one whitespace token in sequence. TODO: Rename this to ppTokens() or something

pLineCol

Returns the current physical (line, column) as integers.

reduceToksToHeaderName(theToks)

This takes a list of PpTokens and retuns a list of PpTokens that might have a header-name token type in them. May raise an ExceptionCpipTokeniser if tokens are not all consumed. This is used at lexer level for re-interpreting PpTokens in the context of a #include directive.

resetTokType()

Erases the memory of the previously seen token type.

substAltToken(tok)

If a PpToken is a Digraph this alters its value to its alternative. If not the supplied token is returned unchanged. There are no side effects on self.

cpip.core.PpTokeniser.TRIGRAPH_PREFIX = '?'

Note: This is redoubled

cpip.core.PpTokeniser.TRIGRAPH_SIZE = 3

Well it is a Trigraph

cpip.core.PpTokeniser.TRIGRAPH_TABLE = {"'": '^', '<': '{', '=': '#', '/': '\\', ')': ']', '(': '[', '-': '~', '!': '|', '>': '}'}

Map of Trigraph alternates after the ?? prefix