McUtils.Parsers
Utilities for writing parsers of structured text.
An entirely standalone package which is used extensively by GaussianInterface.
Three main threads are handled:
- A
FileStreamerinterface which allows for efficient searching for blocks of text in large files with no pattern matching - A
Regexinterface that provides declarative tools for building and manipulating a regular expression as a python tree - A
StringParser/StructuredTypeArrayinterface that takes theRegextools and allows for automatic construction of complicatedNumPy-backed arrays from the parsed data. Generally works well but the problem is complicated and there are no doubt many unhandled edge cases. This is used extensively with (1.) to provide efficient parsing of data from Gaussian.logfiles by using a streamer to match chunks and a parser to extract data from the matched chunks.
Members
Examples
RegexPattern
A RegexPattern is a higher-level interface to work with the regular expression (regex) string pattern matching language.
Python provides support for regular expressions through the re module.
Being comfortable with regex is not a requirement for working with RegexPattern but will help explain some of the more confusing design decisions.
There are a bunch of different RegexPattern instances that cover different cases, e.g.
Word: matches a string of characters that are generally considered textPositiveInteger: matches a string of characters that are only digitsInteger: aPositiveIntegerwith and optional signNumber: matchesInteger.PositiveIntegerVariableName: matches a string of digits or text as the first character is a letterOptional: represents an optional pattern to match
Capturing/Named
When matching pieces of text it is also important to specify which pieces of text we would like to actually get back out.
For this there are two main RegexPattern instances.
The simplest one is Capturing.
This just specifies that we would like to capture a piece of text.
There is a slightly more sophisticated instance called Named which allows us to attach a name to a group.
key_value_matcher = RegexPattern([Named(Word, "key"), "=", Named(Word, "value")])
print(key_value_matcher)
(?P<key>\w+)(?:=)(?P<value>\w+)
This can be used directly to pull info out of files
test_data = os.path.join(os.path.dirname(McUtils.__file__), 'ci', 'tests', 'TestData')
with open(os.path.join(test_data, 'water_OH_scan.log')) as log_dat:
sample_data = log_dat.read()
matches = list(key_value_matcher.finditer(sample_data))
for match in matches[:5]:
print(match.groupdict())
{'key': '0', 'value': 'g09'}
{'key': 'Input', 'value': 'water_OH_scan'}
{'key': 'Output', 'value': 'water_OH_scan'}
{'key': 'Chk', 'value': 'water_OH_scan'}
{'key': 'NProc', 'value': '8'}
StringParser
A more powerful interface than RegexPattern is through a StringParser instance.
This provides a wrapper on RegexPattern that handles the process of turning matches into NumPy arrays of the appropriate type.
The actual interface is quite simple, e.g. we can take our matcher from before and use it directly
key_vals = StringParser(key_value_matcher).parse_all(sample_data)
print(key_vals)
StructuredTypeArray(shape=[(11493, 0), (11493, 0)], dtype=OrderedDict([('key', StructuredType(<class 'str'>, shape=(None,))), ('value', StructuredType(<class 'str'>, shape=(None,)))]))
This StructuredTypeArray is basically a version of NumPy record arrays,
but was written without knowing about them.
A smarter reimplementation of this portion of the parsing process would make use of recarray instead of this custom array type.
That said, getting the raw ndarray objects out is straight-forward
key_vals['key'].array
array(['0', 'Input', 'Output', ..., 'State', 'RMSD', 'PG'], dtype='<U7')
NOTE: 90% of all bugs in the StringParser ecosystem will come from the design of StructuredTypeArray.
The need to be efficient in data handling can lead to some difficult implementation details.
As the data type has organically evolved it has become potentially tough to understand.
A reimplementation based on recarray would potentially solve some issues.
Block Handlers
For efficiency sake, StringParser objects also provide a block_handlers argument (and handlers can be defined on RegexPatterns directly).
A handler is a function that can be applied to a parsed piece of text and should directly return a NumPy array so that it can be worked into the returned StructuredTypeArray.
The simplest handlers are already provided for convenience on StringParser, e.g. from GaussianLogComponents.py
Named(
Repeating(
Capturing(Number),
min = 3, max = 3,
prefix=Optional(Whitespace),
joiner = Whitespace
),
"Coordinates", handler=StringParser.array_handler(dtype=float)
)
Here StringParser.array_handler(dtype=float) provides efficient parsing of data through np.loadtxt with a float as the target dtype.
We also see the prefix and joiner options to RegexPattern in action.
Before we can run our examples we should get a bit of setup out of the way. Since these examples were harvested from the unit tests not all pieces will be necessary for all situations.
All tests are wrapped in a test class
class ParserTests(TestCase):
RegexGroups
def test_RegexGroups(self):
# tests whether we capture subgroups or not (by default _not_)
test_str = "1 2 3 4 a b c d "
pattern = RegexPattern(
(
Capturing(
Repeating(
Capturing(Repeating(PositiveInteger, 2, 2, suffix=Optional(Whitespace)))
)
),
Repeating(Capturing(ASCIILetter), suffix=Whitespace)
)
)
self.assertEquals(len(pattern.search(test_str).groups()), 2)
OptScan
def test_OptScan(self):
eigsPattern = RegexPattern(
(
"Eigenvalues --",
Repeating(Capturing(Number), suffix=Optional(Whitespace))
),
joiner=Whitespace
)
coordsPattern = RegexPattern(
(
Capturing(VariableName),
Repeating(Capturing(Number), suffix=Optional(Whitespace))
),
prefix=Whitespace,
joiner=Whitespace
)
full_pattern = RegexPattern(
(
Named(eigsPattern,
"Eigenvalues"
#parser=lambda t: np.array(Number.findall(t), 'float')
),
Named(Repeating(coordsPattern, suffix=Optional(Newline)), "Coordinates")
),
joiner=Newline
)
with open(TestManager.test_data('scan_params_test.txt')) as test:
test_str = test.read()
parser = StringParser(full_pattern)
parse_res = parser.parse_all(test_str)
parse_single = parser.parse(test_str)
parse_its = list(parser.parse_iter(test_str))
self.assertEquals(parse_res.shape, [(4, 5), [(4, 32), (4, 32, 5)]])
self.assertIsInstance(parse_res["Coordinates"][1].array, np.ndarray)
self.assertEquals(int(parse_res["Coordinates"][1, 0].sum()), 3230)
XYZ
def test_XYZ(self):
with open(TestManager.test_data('test_100.xyz')) as test:
test_str = test.read()
# print(
# "\n".join(test_str.splitlines()[:15]),
# "\n",
# XYZParser.regex.search(test_str),
# file=sys.stderr
# )
res = XYZParser.parse_all(
test_str
)
# print(
# res["Atoms"],
# file=sys.stderr
# )
atom_coords = res["Atoms"].array[1].array
self.assertIsInstance(atom_coords, np.ndarray)
self.assertEquals(atom_coords.shape, (100, 13, 3))
BasicParse
def test_BasicParse(self):
regex = RegexPattern(
(
Named(PositiveInteger, "NumAtoms"),
Named(
Repeating(Any, min = None), "Comment", dtype=str
),
Named(
Repeating(
Capturing(
Repeating(Capturing(Number), 3, 3, prefix = Whitespace, suffix = Optional(Whitespace)),
handler= StringParser.array_handler(shape = (None, 3))
),
suffix = Optional(Newline)
),
"Atoms"
)
),
"XYZ",
joiner=Newline
)
with open(TestManager.test_data('coord_parse.txt')) as test:
test_str = test.read()
res = StringParser(regex).parse(test_str)
comment_string = res["Comment"].array[0]
self.assertTrue('comment' in comment_string)
self.assertEquals(res['Atoms'].array.shape, (4, 3))