PpTokeniser

Performs translation phases 0, 1, 2, 3 on C/C++ source code.

Translation phases from ISO/IEC 9899:1999 (E):

5.1.1.2 Translation phases 5.1.1.2-1 The precedence among the syntax rules of translation is specified by the following phases.

Phase 1. Physical source file multibyte characters are mapped, in an implementation defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.

Phase 2. Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.

Phase 3. The source file is decomposed into preprocessing tokens6) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.

TODO: Do phases 0,1,2 as generators i.e. not in memory?

TODO: Check coverage with a complete but minimal example of every token

TODO: remove self._cppTokType and have it as a return value?

TODO: Remove commented out code.

TODO: Performance of phase 1 processing.

TODO: rename next() as genPpTokens()?

TODO: Perf rewrite slice functions to take an integer argument of where in the array to start inspecting for a slice. This avoids calls to ...[x:] e.g. myCharS = myCharS[sliceIdx:] in genLexPptokenAndSeqWs.

cpip.core.PpTokeniser.CHAR_SET_MAP = {'lex.bool': {'set': {'true', 'false'}}, 'lex.ccon': {'c-char': {'/', 'M', 'P', 'Q', 'k', ')', '!', 'f', 'o', '}', '\x0c', 'S', 'F', 'U', 'C', '0', 'X', 'T', '\t', '*', '\x0b', '3', 's', 'b', 'u', 'B', '2', 'd', '|', 'G', '_', '~', 'W', 'R', '9', ']', 'Y', 'p', 'N', '{', ';', ',', 'H', '@', '#', '`', '6', '?', 'A', '(', '7', 'r', 'y', 'v', 'e', '%', '"', '4', 'q', '-', 'I', '5', '<', 'z', 'g', 'a', '^', 'l', 'c', '.', 'J', 'K', 'V', 'w', 'j', '=', ':', 'm', 't', '1', '8', '&', 'i', '+', '$', ' ', '[', 'D', 'h', 'L', 'x', 'O', 'E', 'Z', '>', 'n'}, 'simple-escape-sequence': {'\\', '?', 't', "'", 'r', 'a', 'f', 'v', '"', 'n', 'b'}, 'c-con_omit': {'\n', '\\', "'"}}, 'lex.key': {'keywords': {'mutable', 'static', 'int', 'false', 'enum', 'signed', 'static_cast', 'dynamic_cast', 'do', 'throw', 'typedef', 'catch', 'try', 'typeid', 'asm', 'protected', 'typename', 'const_cast', 'volatile', 'true', 'else', 'sizeof', 'friend', 'inline', 'operator', 'continue', 'public', 'namespace', 'short', 'goto', 'for', 'export', 'this', 'case', 'return', 'unsigned', 'char', 'bool', 'break', 'union', 'new', 'extern', 'float', 'while', 'default', 'register', 'reinterpret_cast', 'struct', 'auto', 'void', 'private', 'using', 'template', 'delete', 'wchar_t', 'long', 'explicit', 'if', 'double', 'const', 'virtual', 'switch', 'class'}}, 'lex.charset': {'ucn ordinals': {64, 36, 96}, 'whitespace': {'\t', '\x0c', '\n', '\x0b', ' '}, 'source character set': {'/', 'M', 'P', "'", 'Q', 'f', '}', 'S', 'F', 'C', 'X', 'T', '\t', '\x0b', 's', 'b', '2', 'd', '|', '_', 'W', '\\', '9', 'p', 'H', '?', '7', 'v', 'e', '%', '"', '5', '<', 'z', 'g', '^', '\n', 'c', '.', 'J', 'K', 'V', 'w', '1', '8', '[', 'x', 'O', 'E', 'n', 'k', ')', '!', 'o', '\x0c', 'U', '0', '*', '3', 'u', 'B', 'G', '~', 'R', ']', 'Y', 'N', '{', ';', ',', '#', '6', 'A', '(', 'r', 'y', '4', 'q', '-', 'I', 'a', 'l', 'j', '=', ':', 'm', 't', '&', 'i', '+', ' ', 'D', 'h', 'L', 'Z', '>'}}, 'lex.header': {'h-char': {'/', 'M', 'P', "'", 'Q', 'k', ')', '!', 'f', 'o', '}', '\x0c', 'S', 'F', 'U', 'C', '0', 'X', 'T', '\t', '*', '\x0b', '3', 's', 'b', 'u', 'B', '2', 'd', '|', 'G', '_', '~', 'W', '\\', 'R', '9', ']', 'Y', 'p', 'N', '{', ';', ',', 'H', '#', '6', '?', 'A', '(', '7', 'r', 'y', 'v', 'e', '%', '"', '4', 'q', '-', 'I', '5', '<', 'z', 'g', 'a', '^', 'l', 'c', '.', 'J', 'K', 'V', 'w', 'j', '=', ':', 'm', 't', '1', '8', '&', 'i', '+', ' ', '[', 'D', 'h', 'L', 'x', 'O', 'E', 'Z', 'n'}, 'h-char_omit': {'>', '\n'}, 'q-char': {'/', 'M', 'P', "'", 'Q', 'k', ')', '!', 'f', 'o', '}', '\x0c', 'S', 'F', 'U', 'C', '0', 'X', 'T', '\t', '*', '\x0b', '3', 's', 'b', 'u', 'B', '2', 'd', '|', 'G', '_', '~', 'W', '\\', 'R', '9', ']', 'Y', 'p', 'N', '{', ';', ',', 'H', '#', '6', '?', 'A', '(', '7', 'r', 'y', 'v', 'e', '%', '4', 'q', '-', 'I', '5', '<', 'z', 'g', 'a', '^', 'l', 'c', '.', 'J', 'K', 'V', 'w', 'j', '=', ':', 'm', 't', '1', '8', '&', 'i', '+', ' ', '[', 'D', 'h', 'L', 'x', 'O', 'E', 'Z', '>', 'n'}, 'undefined_q_words': {'\\', '//', '/*', "'"}, 'q-char_omit': {'\n', '"'}, 'undefined_h_words': {'\\', '//', '"', '/*', "'"}}, 'lex.ppnumber': {'hexadecimal-digit': {'6', 'A', '7', 'f', 'e', 'F', '4', 'C', '0', '5', '3', 'a', 'c', 'b', '1', '8', '2', 'B', 'd', '9', 'D', 'E'}, 'octal-digit': {'6', '0', '1', '5', '2', '3', '7', '4'}, 'nonzero-digit': {'6', '9', '1', '5', '2', '8', '3', '7', '4'}, 'digit': {'6', '0', '9', '1', '5', '2', '8', '3', '7', '4'}}, 'lex.string': {'s-char': {'/', 'M', 'P', "'", 'Q', 'k', ')', '!', 'f', 'o', '}', '\x0c', 'S', 'F', 'U', 'C', '0', 'X', 'T', '\t', '*', '\x0b', '3', 's', 'b', 'u', 'B', '2', 'd', '|', 'G', '_', '~', 'W', 'R', '9', ']', 'Y', 'p', 'N', '{', ';', ',', 'H', '@', '#', '`', '6', '?', 'A', '(', '7', 'r', 'y', 'v', 'e', '%', '4', 'q', '-', 'I', '5', '<', 'z', 'g', 'a', '^', 'l', 'c', '.', 'J', 'K', 'V', 'w', 'j', '=', ':', 'm', 't', '1', '8', '&', 'i', '+', '$', ' ', '[', 'D', 'h', 'L', 'x', 'O', 'E', 'Z', '>', 'n'}, 's-char_omit': {'\n', '\\', '"'}}, 'cpp': {'new-line': '\n', 'lparen': '('}, 'lex.fcon': {'sign': {'-', '+'}, 'floating-suffix': {'l', 'f', 'F', 'L'}, 'exponent_prefix': {'e', 'E'}}, 'lex.op': {'operators': {'/=', '/', '?', '*=', '::', '(', '%=', '%>', '<%', '...', 'bitand', ')', '!', '<=', '<<=', 'and_eq', '}', ':>', '%', 'or', '!=', '-', '--', '##', '*', '<:', '%:', '<', '>>=', 'not_eq', '^', '&=', '>>', 'not', '||', '.', '=', ':', '->*', 'or_eq', '+=', '++', '&', '^=', 'and', 'new', '+', 'compl', 'xor', '|', '==', '<<', '|=', 'bitor', '->', '~', 'delete', ']', '%:%:', '[', '.*', '-=', '{', ';', '&&', '>', '>=', ',', 'xor_eq', '#'}}, 'lex.icon': {'unsigned-suffix': {'u', 'U'}, 'long-suffix': {'l', 'L'}}, 'lex.name': {'part_non_digit': {'A', 'M', 'P', 'r', 'k', 'y', 'Q', 'f', 'o', 'v', 'e', 'S', 'F', 'U', 'q', 'C', 'I', 'X', 'T', 'z', 'a', 'g', 'l', 's', 'c', 'j', 'b', 'J', 'w', 'K', 'm', 'V', 't', 'u', 'B', 'i', 'H', 'd', '$', 'G', '_', 'W', 'R', 'D', 'h', 'L', 'Y', 'p', 'x', 'O', 'N', 'E', 'Z', '@', 'n', '`'}}}

Preprocess character sets:

The 'source character set' should be 96 characters i.e. 91 plus whitespace (5) See assertions below that check length, if not content. Note: Before jumping to conclusions about how slow this might be go and look at TestPpTokeniserIsInCharSet NOTE: whitespace is now handled by the PpWhitespace class and this entry is dynamically added to CHAR_SET_MAP on import: 'whitespace'            : set('\t\v\f\n '),`

Set of ordinal characters not treated as Universal Character names i.e. treated literally. NOTE: ISO/IEC 9899:1999 (E) 6.4.3-2 “A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (back tick), nor one in the range D800 through DFFF inclusive.61)

['lex.header']['undefined_h_words'] are from:
ISO/IEC 9899:1999 (E) 6.4.7 Header names Para 3 i.e. 6.4.7-3

'lex.key': From: ISO/IEC 14882:1998(E) 2.11 Keywords [lex.key] Note these are of no particular interest to the pre-processor as they do not occur in phases 1 to 4. For example ‘new’ is an operator but is re-interpreted after phase 4 (probably in phase 7) as a keyword.

'lex.op' From: ISO/IEC 14882:1998(E) 2.12 Operators and punctuators [lex.operators] These contain Digraphs and “Alternative Tokens” e.g. ‘and’ so that ‘and’ will be seen as an operator and an identifier. Similarly ‘new’ appears as an operator but is re-interpreted after phase 4 (probably in phase 7) as a keyword.

cpip.core.PpTokeniser.CHAR_SET_STR_TREE_MAP = {'lex.bool': {'set': <cpip.util.StrTree.StrTree object>}, 'lex.key': {'keywords': <cpip.util.StrTree.StrTree object>}, 'lex.op': {'operators': <cpip.util.StrTree.StrTree object>}}

Create StrTree objects for fast look up for words

cpip.core.PpTokeniser.COMMENT_REPLACEMENT = ' '

Comments are replaced by a single space

cpip.core.PpTokeniser.COMMENT_TYPES = ('C comment', 'C++ comment')

All comments

cpip.core.PpTokeniser.COMMENT_TYPE_C = 'C comment'

C comment

cpip.core.PpTokeniser.COMMENT_TYPE_CXX = 'C++ comment'

C++ comment

cpip.core.PpTokeniser.C_KEYWORDS = ('auto', 'break', 'case', 'char', 'const', 'continue', 'default', 'do', 'double', 'else', 'enum', 'extern', 'float', 'for', 'goto', 'if', 'inline', 'int', 'long', 'register', 'restrict', 'return', 'short', 'signed', 'sizeof', 'static', 'struct', 'switch', 'typedef', 'union', 'unsigned', 'void', 'volatile', 'while', '_Bool', '_Complex', '_Imaginary')

‘C’ keywords ISO/IEC 9899:1999 (E) 6.4.1 Keywords

cpip.core.PpTokeniser.DIGRAPH_TABLE = {'or_eq': '|=', 'xor_eq': '^=', 'or': '||', 'and': '&&', '%>': '}', '<%': '{', '<:': '[', '%:': '#', 'bitand': '&', 'compl': '~', 'xor': '^', 'not_eq': '!=', 'and_eq': '&=', '%:%:': '##', 'not': '!', 'bitor': '|', ':>': ']'}

Map of Digraph alternates

exception cpip.core.PpTokeniser.ExceptionCpipTokeniser

Simple specialisation of an exception class for the preprocessor.

exception cpip.core.PpTokeniser.ExceptionCpipTokeniserUcnConstraint

Specialisation for when universal character name exceeds constraints.

cpip.core.PpTokeniser.LEN_SOURCE_CHARACTER_SET = 96

Size of the source code character set

class cpip.core.PpTokeniser.PpTokeniser(theFileObj=None, theFileId=None, theDiagnostic=None)

Imitates a Preprocessor that conforms to ISO/IEC 14882:1998(E).

Takes an optional file like object. If theFileObj has a ‘name’ attribute then that will be use as the name otherwise theFileId will be used as the file name.

Implementation note: On all _slice...() and __slice...() functions: A _slice...() function takes a buffer-like object and an integer offset as arguments. The buffer-like object will be accessed by index so just needs to implement __getitem__(). On overrun or other out of bounds index an IndexError must be caught by the _slice...() function. i.e. len() should not be called on the buffer-like object, or rather, if len() (i.e. __len__()) is called a TypeError will be raised and propagated out of this class to the caller.

StrTree, for example, conforms to these requirements.

The function is expected to return an integer that represents the number of objects that can be consumed from the buffer-like object. If the return value is non-zero the PpTokeniser is side-affected in that self._cppTokType is set to a non-None value. Before doing that a test is made and if self._cppTokType is already non-None then an assertion error is raised.

The buffer-like object should not be side-affected by the _slice...() function regardless of the return value.

So a _slice...() function pattern is:

def _slice...(self, theBuf, theOfs):
    i = theOfs
    try:
        # Only access theBuf with [i] so that __getitem__() is called
        ...theBuf[i]...
        # Success as the absence of an IndexError!
        # So return the length of objects that pass
        # First test and set for type of slice found
        if i > theOfs:
            assert(self._cppTokType is None), '_cppTokType was %s now %s' % (self._cppTokType, ...)
            self._cppTokType = ...
        # NOTE: Return size of slice not the index of the end of the slice
        return i - theOfs
    except IndexError:
        pass
    # Here either return 0 on IndexError or i-theOfs
    return ...

NOTE: Functions starting with __slice... do not trap the IndexError, the caller must do that.

TODO: ISO/IEC 14882:1998(E) Escape sequences Table 5?

_PpTokeniser__sliceCCharCharacter(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - c-char character.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceLexKey(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.11 Keywords [lex.key].

_PpTokeniser__sliceLexPpnumberDigit(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.9 Preprocessing numbers [lex.ppnumber] - digit.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceLexPpnumberExpSign(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.9 Preprocessing numbers [lex.ppnumber] - exponent and sign. Returns 2 if theCharS is ‘e’ or ‘E’ followed by a sign.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceLongSuffix(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - long-suffix.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceNondigit(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.10 Identifiers [lex.name] - nondigit:

nondigit: one of
universal-character-name
_ a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceSimpleEscapeSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - simple-escape-sequence.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceUniversalCharacterName(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.2 Character sets [lex.charset] - universal-character-name.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_PpTokeniser__sliceUnsignedSuffix(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - unsigned-suffix.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

__init__(theFileObj=None, theFileId=None, theDiagnostic=None)

Constructor. Takes an optional file like object. If theFileObj has a ‘name’ attribute then that will be use as the name otherwise theFileId will be used as the file name.

Parameters:
Returns:

NoneType

__weakref__

list of weak references to the object (if defined)

_convertToLexCharset(theLineS)

Converts a list of lines expanding non-lex.charset characters to universal-character-name and returns a set of lines so encoded.

ISO/IEC 9899:1999 (E) 6.4.3

Note

ISO/IEC 9899:1999 (E) 6.4.3-2 “A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (back tick), nor one in the range D800 through DFFF inclusive.61).

Note

This side-effects the supplied lines and returns None.

Parameters:theLineS (list([]), list([str])) – The source code.
Returns:NoneType
_rewindFile()

Sets the file to position zero and resets the FileLocator.

Returns:NoneType
_sliceAccumulateOfs(theBuf, theOfs, theFn)

Repeats the function as many times as possible on theBuf from theOfs. An IndexError raised by the function will be caught and not propagated.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – Offset.
  • theFn (method) – Function.
Returns:

int – The index of the find or -1 if none found.

_sliceBoolLiteral(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.5 String literals [lex.bool].

_sliceCChar(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - c-char.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceCCharSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - c-char-sequence.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceCharacterLiteral(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceDecimalLiteral(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - decimal-literal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceEscapeSequence(theBuf, theOfs)

Returns the length of a slice of theCharS that matches the longest integer literal or 0. ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - escape-sequence.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteral(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon]:

floating-literal:
   fractional-constant exponent-part opt floating-suffix opt
   digit-sequence exponent-part floating-suffix opt
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteralDigitSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon] - digit-sequence.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteralExponentPart(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon] - exponent-part.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteralFloatingSuffix(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon] - floating-suffix.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteralFractionalConstant(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon] - fractional-constant:

fractional-constant:
   digit-sequence opt . digit-sequence
   digit-sequence .

i.e there are three posibilities: a: . digit-sequence b: digit-sequence . c: digit-sequence . digit-sequence

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceFloatingLiteralSign(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.3 Floating literals [lex.fcon] - floating-suffix.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceHexQuad(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.2 Character sets [lex.charset] - hex-quad.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceHexadecimalEscapeSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - hexadecimal-escape-sequence:

hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceHexadecimalLiteral(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - hexadecimal-literal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceIntegerLiteral(theBuf, theOfs=0)

Returns the length of a slice of theCharS that matches the longest integer literal or 0. ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceIntegerSuffix(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - integer-suffix:

integer-suffix:
    unsigned-suffix long-suffix opt
    long-suffix unsigned-suffix opt
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexComment(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.7 Comments [lex.comment].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexHeader(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.8 Header names [lex.header]. Might raise a ExceptionCpipUndefinedLocal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexHeaderHchar(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.8 Header names [lex.header] - h-char character.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexHeaderHcharSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.8 Header names [lex.header] - h-char-sequence. Might raise a ExceptionCpipUndefinedLocal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexHeaderQchar(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.8 Header names [lex.header] - q-char.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexHeaderQcharSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.8 Header names [lex.header] - q-char-sequence. Might raise a ExceptionCpipUndefinedLocal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexKey(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.11 Keywords [lex.key].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexName(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.10 Identifiers [lex.name]:

identifier:
    nondigit
    identifier nondigit
    identifier digit
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexOperators(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.12 Operators and punctuators [lex.operators]. i.e. preprocessing-op-or-punc

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexPpnumber(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.9 Preprocessing numbers [lex.ppnumber]:

pp-number:
    digit
    . digit
    pp-number digit
    pp-number nondigit
    pp-number e sign
    pp-number E sign
    pp-number .

TODO: Spec says “Preprocessing number tokens lexically include all integral literal tokens (2.13.1) and all floating literal tokens (2.13.3).” But the pp-number list does not specify that.

NOTE: ISO/IEC 9899:1999 Programming languages - C allows ‘p’ and ‘P’ suffixes.

NOTE: The standard appears to allow ‘.1.2.3.4.’

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexPptoken(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.4 Preprocessing tokens [lex.pptoken].

ISO/IEC 9899:1999 (E) 6.4 Lexical elements

NOTE: Does not identify header-name tokens. See NOTE on genLexPptokenAndSeqWs()

NOTE: _sliceLexPptokenGeneral is an exclusive search as ‘bitand’ can appear to be both an operator (correct) and an identifier (incorrect). The order of applying functions is therefore highly significant _sliceLexPpnumber must be before _sliceLexOperators as the leading ‘.’ on a number can be seen as an operator. _sliceCharacterLiteral and _sliceStringLiteral must be before _sliceLexName as the leading ‘L’ on char/string can be seen as a name.

self._sliceLexOperators has to be after self._sliceLexName as otherwise:

#define complex

gets seen as:

# -> operator
define -> identifier
compl -> operator because of alternative tokens
ex -> identifier
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLexPptokenGeneral(theBuf, theOfs, theFuncS)

Applies theFuncS to theCharS and returns the longest match.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – Integer offset.
  • theFuncS (tuple([method, method, method, method, method])) – Sequence of functions.
Returns:

int – Length of the longest match.

_sliceLexPptokenWithHeaderName(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.4 Preprocessing tokens [lex.pptoken].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

Note

This does identify header-name tokens where possible.

_sliceLiteral(theBuf, theOfs=0)

Returns the length of a slice of theCharS that matches the longest integer literal or 0. ISO/IEC 14882:1998(E) 2.13 Literals [lex.literal].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceLongestMatchOfs(theBuf, theOfs, theFnS, isExcl=False)

Returns the length of the longest slice of theBuf from theOfs using the functions theFnS, or 0 if nothing matches.

This preserves self._cppTokType to be the one that gives the longest match. Functions that raise an IndexError will be ignored.

If isExcl is False (the default) then all functions are tested, longest match is returned.

If isExcl is True then first function returning a non-zero value is used.

TODO (maybe): Have slice functions return (size, type) and get rid of self._changeOfTokenTypeIsOk and self._cppTokType

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – Offset.
  • theFnS (tuple([method, method, method, method, method]), tuple([method, method, method]), tuple([method, method])) – Functions.
  • isExcl (bool) – Exclusivity flag.
Returns:

int – The length of the longest match.

_sliceNonWhitespaceSingleChar(theBuf, theOfs=0)

Returns 1 if the first character is non-whitespace, 0 otherwise. ISO/IEC 9899:1999 (E) 6.4-3 and ISO/IEC 14882:1998(E) 2.4.2 States that if the character is ' or " the behaviour is undefined.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceOctalEscapeSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.2 Character literals [lex.ccon] - octal-escape-sequence:

octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit
Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceOctalLiteral(theBuf, theOfs=0)

ISO/IEC 14882:1998(E) 2.13.1 Integer literals [lex.icon] - octal-literal.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceSChar(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.4 String literals [lex.string] - s-char.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceSCharCharacter(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.4 String literals [lex.string] - s-char character.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceSCharSequence(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.4 String literals [lex.string] - s-char-sequence.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceStringLiteral(theBuf, theOfs)

ISO/IEC 14882:1998(E) 2.13.4 String literals [lex.string].

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_sliceWhitespace(theBuf, theOfs=0)

Returns whitespace length.

Parameters:
  • theBuf (cpip.util.BufGen.BufGen, str) – Buffer or string.
  • theOfs (int) – The starting offset.
Returns:

int – Length of matching characters found.

_translateTrigraphs(theLineS)

ISO/IEC 14882:1998(E) 2.3 Trigraph sequences [lex.trigraphs]

This replaces the trigraphs in-place and updates the FileLocator so that the physical lines and columns can be recovered.

Parameters:theLineS (list([]), list([str])) – Source code lines.
Returns:NoneType
_wordFoundInUpTo(theBuf, theLen, theWord)

Searches theBuf for any complete instance of a word in theBuf. Returns the index of the find or -1 if none found.

TODO: This looks wrong, buffer = ‘ abc abd’, word = ‘abd’ will return -1

Parameters:
  • theBuf (str) – Buffer or string.
  • theLen (int) – Buffer length.
  • theWord (str) – Word to find.
Returns:

int – The index of the find or -1 if none found.

_wordsFoundInUpTo(theBuf, theLen, theWordS)

Searches theCharS for any complete instance of any word in theWordS. Returns the index of the find or -1 if none found.

Parameters:
  • theBuf (str) – Buffer or string.
  • theLen (int) – Buffer length.
  • theWordS (set([str])) – Set of words, any of these can be found.
Returns:

int – The index of the find or -1 if none found.

cppTokType

Returns the type of the last preprocessing-token found by _sliceLexPptoken().

fileLineCol

Return an instance of cpip.core.FileLocation.FileLineCol from the current physical line column.

Returns:cpip.core.FileLocation.FileLineCol([str, int, int]) – The FileLineCol object
fileLocator

Returns the FileLocation object.

fileName

Returns the ID of the file.

filterHeaderNames(theToks)

Returns a list of ‘header-name’ tokens from the supplied stream. May raise ExceptionCpipTokeniser if un-parsable or theToks has non-(whitespace, header-name).

Parameters:theToks (list([cpip.core.PpToken.PpToken])) – The tokens.
Returns:list([cpip.core.PpToken.PpToken]) – List of ‘header-name’ tokens.
genLexPptokenAndSeqWs(theCharS)

Generates a sequence of PpToken objects. Either:

  • a sequence of whitespace (comments are replaces with a single whitespace).
  • a pre-processing token.

This performs translation phase 3.

NOTE: Whitespace sequences are not merged so '  /\*\*/ ' will generate three tokens each of PpToken.PpToken(' ', 'whitespace') i.e. leading whitespace, comment replced by single space, trailing whitespace.

So this yields the tokens from translation phase 3 if supplied with the results of translation phase 2.

NOTE: This does not generate ‘header-name’ tokens as these are context dependent i.e. they are only valid in the context of a #include directive.

ISO/IEC 9899:1999 (E) 6.4.7 Header names Para 3 says that: “A header name preprocessing token is recognised only within a #include preprocessing directive.”.

Parameters:theCharS (str) – The source code.
Returns:cpip.core.PpToken.PpToken – Sequence of tokens.
Raises:GeneratorExit, IndexError
initLexPhase12()

Process phases one and two and returns the result as a string.

Returns:str – <insert documentation for return values>
lexPhases_0()

An non-standard phase that just reads the file and returns its contents as a list of lines (including EOL characters).

May raise an ExceptionCpipTokeniser if self has been created with None or the file is unreadable

Returns:list([]),list([str]) – List of source code lines.
lexPhases_1(theLineS)

ISO/IEC 14882:1998(E) 2.1 Phases of translation [lex.phases] - Phase one Takes a list of lines (including EOL characters)and replaces trigraphs.

NOTE: This side-effects the supplied lines and returns None.

Parameters:theLineS (list([]), list([str])) – The source code.
Returns:NoneType
lexPhases_2(theLineS)

ISO/IEC 14882:1998(E) 2.1 Phases of translation [lex.phases] - Phase two

This joins physical to logical lines.

NOTE: This side-effects the supplied lines and returns None.

Parameters:theLineS (list([]), list([str])) – The source code.
Returns:NoneType
next()

The token generator. On being called this performs translations phases 1, 2 and 3 (unless already done) and then generates pairs of: (preprocessing token, token type)

Token type is an enumerated integer from LEX_PPTOKEN_TYPES.

Proprocessing tokens include sequences of whitespace characters and these are not necessarily concatenated i.e. this generator can produce more than one whitespace token in sequence.

TODO: Rename this to ppTokens() or something.

Returns:cpip.core.PpToken.PpToken – Sequence ot tokens.
Raises:GeneratorExit, StopIteration
pLineCol

Returns the current physical (line, column) as integers.

Returns:tuple([int, int]) – Physical position.
reduceToksToHeaderName(theToks)

This takes a list of PpTokens and retuns a list of PpTokens that might have a header-name token type in them. May raise an ExceptionCpipTokeniser if tokens are not all consumed. This is used at lexer level for re-interpreting PpTokens in the context of a #include directive.

Parameters:theToks (list([cpip.core.PpToken.PpToken])) – The tokens.
Returns:list([cpip.core.PpToken.PpToken]) – List of ‘header-name’ tokens.
resetTokType()

Erases the memory of the previously seen token type.

substAltToken(tok)

If a PpToken is a Digraph this alters its value to its alternative. If not the supplied token is returned unchanged.

There are no side effects on self.

cpip.core.PpTokeniser.TRIGRAPH_PREFIX = '?'

Note: This is redoubled

cpip.core.PpTokeniser.TRIGRAPH_SIZE = 3

Well it is a Trigraph

cpip.core.PpTokeniser.TRIGRAPH_TABLE = {')': ']', '/': '\\', '-': '~', "'": '^', '>': '}', '(': '[', '=': '#', '!': '|', '<': '{'}

Map of Trigraph alternates after the ?? prefix