McUtils.Parsers
Utilities for writing parsers of structured text.
An entirely standalone package which is used extensively by GaussianInterface
.
Three main threads are handled:
- A
FileStreamer
interface which allows for efficient searching for blocks of text in large files with no pattern matching - A
Regex
interface that provides declarative tools for building and manipulating a regular expression as a python tree - A
StringParser
/StructuredTypeArray
interface that takes theRegex
tools and allows for automatic construction of complicatedNumPy
-backed arrays from the parsed data. Generally works well but the problem is complicated and there are no doubt many unhandled edge cases. This is used extensively with (1.) to provide efficient parsing of data from Gaussian.log
files by using a streamer to match chunks and a parser to extract data from the matched chunks.
Members
Examples
RegexPattern
A RegexPattern
is a higher-level interface to work with the regular expression (regex) string pattern matching language.
Python provides support for regular expressions through the re
module.
Being comfortable with regex is not a requirement for working with RegexPattern
but will help explain some of the more confusing design decisions.
There are a bunch of different RegexPattern
instances that cover different cases, e.g.
Word
: matches a string of characters that are generally considered textPositiveInteger
: matches a string of characters that are only digitsInteger
: aPositiveInteger
with and optional signNumber
: matchesInteger.PositiveInteger
VariableName
: matches a string of digits or text as the first character is a letterOptional
: represents an optional pattern to match
Capturing/Named
When matching pieces of text it is also important to specify which pieces of text we would like to actually get back out.
For this there are two main RegexPattern
instances.
The simplest one is Capturing
.
This just specifies that we would like to capture a piece of text.
There is a slightly more sophisticated instance called Named
which allows us to attach a name to a group.
key_value_matcher = RegexPattern([Named(Word, "key"), "=", Named(Word, "value")])
print(key_value_matcher)
(?P<key>\w+)(?:=)(?P<value>\w+)
This can be used directly to pull info out of files
test_data = os.path.join(os.path.dirname(McUtils.__file__), 'ci', 'tests', 'TestData')
with open(os.path.join(test_data, 'water_OH_scan.log')) as log_dat:
sample_data = log_dat.read()
matches = list(key_value_matcher.finditer(sample_data))
for match in matches[:5]:
print(match.groupdict())
{'key': '0', 'value': 'g09'}
{'key': 'Input', 'value': 'water_OH_scan'}
{'key': 'Output', 'value': 'water_OH_scan'}
{'key': 'Chk', 'value': 'water_OH_scan'}
{'key': 'NProc', 'value': '8'}
StringParser
A more powerful interface than RegexPattern
is through a StringParser
instance.
This provides a wrapper on RegexPattern
that handles the process of turning matches into NumPy
arrays of the appropriate type.
The actual interface is quite simple, e.g. we can take our matcher from before and use it directly
key_vals = StringParser(key_value_matcher).parse_all(sample_data)
print(key_vals)
StructuredTypeArray(shape=[(11493, 0), (11493, 0)], dtype=OrderedDict([('key', StructuredType(<class 'str'>, shape=(None,))), ('value', StructuredType(<class 'str'>, shape=(None,)))]))
This StructuredTypeArray
is basically a version of NumPy
record arrays,
but was written without knowing about them.
A smarter reimplementation of this portion of the parsing process would make use of recarray
instead of this custom array type.
That said, getting the raw ndarray
objects out is straight-forward
key_vals['key'].array
array(['0', 'Input', 'Output', ..., 'State', 'RMSD', 'PG'], dtype='<U7')
NOTE: 90% of all bugs in the StringParser
ecosystem will come from the design of StructuredTypeArray
.
The need to be efficient in data handling can lead to some difficult implementation details.
As the data type has organically evolved it has become potentially tough to understand.
A reimplementation based on recarray
would potentially solve some issues.
Block Handlers
For efficiency sake, StringParser
objects also provide a block_handlers
argument (and handlers can be defined on RegexPatterns
directly).
A handler is a function that can be applied to a parsed piece of text and should directly return a NumPy
array so that it can be worked into the returned StructuredTypeArray
.
The simplest handlers are already provided for convenience on StringParser
, e.g. from GaussianLogComponents.py
Named(
Repeating(
Capturing(Number),
min = 3, max = 3,
prefix=Optional(Whitespace),
joiner = Whitespace
),
"Coordinates", handler=StringParser.array_handler(dtype=float)
)
Here StringParser.array_handler(dtype=float)
provides efficient parsing of data through np.loadtxt
with a float
as the target dtype
.
We also see the prefix
and joiner
options to RegexPattern
in action.