Getting Data in to Your Code

We’ve already talked a bit about the power of good data formatting. It makes everyone’s life better. We think it’d be good to get a bit more concrete about what that means for us, though. We also know that you’re not always going to have clean data to work with, and being able to deal with the mess is a useful skill.

But let’s start with the first one.

Clean Data

If you can use the built in NumPy formats, do so.

Otherwise, we argued that you should “just use JSON” when working with your data. Let’s take a look at what that kind of thing can mean. JSON is a nice format. It’s designed to transmit data on the web. And it looks almost exactly like a python dictionary. Here’s how we could store results from a simulation

{
  "simulation_name": "My Simulation",
  "simulator_version": "0.0.1",
  "timings": {
    "seconds_elapsed": 1000,
    "time_step": 0.1,
    "total_steps": 10000000
    },
  "energies":  [0.0, 1.2, -1.2, 1.2, 0.5, 2.8, ...]
}

The python json module can load this in easily. You can extend it to add units, coordinates, etc. It’s not the world’s most efficient format, but we also don’t work with all that much data. If we need a speed boost, there’s a hybrid JSON/data table format that we’ll talk about in the next section.

The one thing we’ll want to make sure of is that when we load in our data we convert that "energies" array to a proper NumPy array so we can work with it more easily.

A Note on np.loadtxt()

Sometimes you can’t work with JSON, but instead someone (probably using Fortran) gave you a .txt, .csv, or .dat file that looks like

 1.213123  3.294932 -2.321323 ...
-2.321323  0.349876  5.193273 ...
...

and you just need to load that in. In this case, using loadtxt is the quickest way to get the data into python, and then you can save that as in one of the NumPy formats.

Dirty Data

Unfortunately, not all data is clean. Many people started writing code in 1980, when Fortran’s 80 character line length was hip & cool. Many people also got with the times and adapted to improvements in computation that make everyone’s lives better. Many scientists did not. And particularly Gaussian, the electronic structure software makers, did not.

So unfortunately we’re still stuck with many, many situations where our data looks like

Mulliken charges and spin densities with hydrogens summed into heavy atoms:
               1          2
     1  C    0.284937  -0.047805
     2  C   -0.084087   0.975289
     5  O   -0.198441   0.044417
     6  O   -0.009235   0.004233
     8  C    0.009751   0.022752
    12  C   -0.002925   0.001113
 Electronic spatial extent (au):  <R**2>=            586.1312
 Charge=              0.0000 electrons
 Dipole moment (field-independent basis, Debye):
    X=              0.8917    Y=             -0.4708    Z=              1.6277  Tot=              1.9147
 Quadrupole moment (field-independent basis, Debye-Ang):
   XX=            -37.2983   YY=            -36.8835   ZZ=            -38.3391
   XY=              2.0489   XZ=             -3.3046   YZ=             -1.1258
 Traceless Quadrupole moment (field-independent basis, Debye-Ang):
   XX=              0.2087   YY=              0.6235   ZZ=             -0.8321
   XY=              2.0489   XZ=             -3.3046   YZ=             -1.1258
 Octapole moment (field-independent basis, Debye-Ang**2):
  XXX=            -16.5453  YYY=             -5.3136  ZZZ=             -3.4279  XYY=             -5.5706
  XXY=             -6.4237  XXZ=              3.4940  XZZ=             -3.5574  YZZ=              0.7124
  YYZ=              0.6046  XYZ=              1.7525
 Hexadecapole moment (field-independent basis, Debye-Ang**3):
 XXXX=           -335.5914 YYYY=           -215.5172 ZZZZ=           -195.0758 XXXY=             20.4569
 XXXZ=             -9.7915 YYYX=              1.7567 YYYZ=             -1.2146 ZZZX=             -2.3024
 ZZZY=              1.8767 XXYY=            -89.8989 XXZZ=            -95.5782 YYZZ=            -67.4313
 XXYZ=             -0.8395 YYXZ=             -0.0618 ZZXY=              1.2617
 N-N= 2.598308771361D+02 E-N=-1.240434554405D+03  KE= 3.068245376935D+02
  Exact polarizability:  62.069  -0.830  55.437  -2.686  -0.888  55.152
 Approx polarizability:  66.871  -1.448  60.217  -2.414   0.418  61.197
                          Isotropic Fermi Contact Couplings
        Atom                 a.u.       MegaHertz       Gauss      10(-4) cm-1
     1  C(13)             -0.01947     -21.88400      -7.80875      -7.29972
     2  C(13)              0.04515      50.75819      18.11179      16.93111
     3  H(1)              -0.01412     -63.09256     -22.51300     -21.04541
     4  H(1)              -0.01377     -61.55167     -21.96317     -20.53143
     5  O(17)              0.03457     -20.95714      -7.47803      -6.99055
     6  O(17)              0.00489      -2.96727      -1.05880      -0.98978
     7  H(1)              -0.00009      -0.40514      -0.14456      -0.13514
     8  C(13)              0.02456      27.61551       9.85390       9.21154
     9  H(1)               0.00363      16.22441       5.78927       5.41188
    10  H(1)              -0.00047      -2.09733      -0.74838      -0.69960
    11  H(1)               0.00051       2.25983       0.80636       0.75380
    12  C(13)             -0.00035      -0.39272      -0.14013      -0.13100
    13  H(1)              -0.00015      -0.65043      -0.23209      -0.21696
    14  H(1)               0.00070       3.11829       1.11269       1.04015
    15  H(1)              -0.00019      -0.85732      -0.30591      -0.28597

and in fact, we’re likely to have this type of block repeated many times over a file.

So what do we do with data like this? Well one option is to use copy everything out by hand. If you’ve got thousands of calculations, I’m gonna say that’s a no go.

Instead, unfortunately, we’re gonna be stuck doing what’s called parsing (etym.: from the French parser or _“to weep tears of frustration”). Parsing is a very, very broad topic and there are tons of ways to do it. We’re gonna briefly hit on two of the big ones in python

  1. File searching
  2. String splitting

The first of these isn’t so terrible, although this will get a little bit technical. (Don’t worry if you don’t follow totally, most of the parsers you’ll need have already been written and you just need to know how to extend them) We’ll assume that block (and many others like it) is inside a file called my_amazing_calc.log. Let’s also say we want to get that the line with dipole moment from it. If we could take the subsection of the string between the line

 Dipole moment (field-independent basis, Debye):

and

 Quadrupole moment (field-independent basis, Debye-Ang):

we’d be most of the way there. So how would we do it? Well first we need to know how to search a file in python. We’re going to work off the assumption that the file is very big and we don’t want to make python read it all at once. And so the way we do this is by using the .find method on a python memory-mapped file. (I told you this would get technical) Basically this allows us to pretend a file is a string. Once we’ve found the our opening block, we track where in the file that was (using .tell), and then look for the closing tag. Once we’ve got both of those, we can read what was between them.

We’ll wrap this up in a little class to make it easier to manage

import mmap
class FileSearcher:
    """
    Represents a file from which we'll stream blocks of data by finding tags and parsing what's between them
    """
    def __init__(self, file, encoding="utf-8"):
        self.file = file
        self.encoding = encoding # files all have encodings, usually the data is utf-8 or ASCII (the two are intercompatible)
    def __enter__(self): # allows us to use the `with` keyword
        self.fstream = open(self.file, mode='rb')
        self.stream = mmap.mmap(self.fstream.fileno(), 0)
        return self
    def __exit__(self, exc_type, exc_val, exc_tb): # allows us to use the `with` keyword
        self.stream.close()
        self.fstream.close()
    def read_block(self, opening, closing):
        """reads the block between `opening` and `closing`"""
        open_tag = opening.encode(self.encoding) # we need to encode our tag the same way we
        open_pos = self.stream.find(open_tag)
        new_pos = open_pos + len(open_tag)
        self.stream.seek(new_pos) # move to the end of the tag and search from there
        close_tag = closing.encode(self.encoding)
        close_pos = self.stream.find(close_tag)
        # read between the positions, then move to the end of the tag
        block = self.stream.read(new_pos - close_pos)
        new_pos = close_pos + len(close_tag)
        self.stream.seek(new_pos)
        return block.decode(self.encoding)

which we can then use like

>>> with FileSearcher('my_amazing_calc.log') as file:
>>>    block = file.read_block(
>>>        "Dipole moment (field-independent basis, Debye):",
>>>        "Quadrupole moment (field-independent basis, Debye-Ang):"
>>>        )
>>> print(block)
"""
    X=              0.8917    Y=             -0.4708    Z=              1.6277  Tot=              1.9147
 """

If we want to be safe about things, e.g. if the strings we’re looking for aren’t in the file, we’ll need to add a number of safety checks. If you want to see what that looks like, we have an example here.

Our block reader is also set up so that we can get multiple strings out just by using the same read_block multiple times withing the same with block.

At this point, we get to do some nice string manipulation to get the actual dipole moment out. The way we’ll do this is by splitting the string every time we see =. Python strings are very kind and support a .split method to make this easy.

So we’d do

>>> dip_block = """
>>>    X=              0.8917    Y=             -0.4708    Z=              1.6277  Tot=              1.9147
>>> """
>>> dip_chunks = dip_block.split("="); print(dip_chunks)
["""
    X""", "              0.8917    Y", "             -0.4708    Z", "              1.6277  Tot", """              1.9147
 """]

and we’ll note first that we only want the middle three of these and second that for each of these we can then split on the whitespace and pick the second component. So to get out the real dipole components we’d do

>>> real_dip_chunks = [chunk.split()[1] for chunk in dip_chunks[1:-2]]; print(real_dip_chunks)
["0.8917", "-0.4708", "1.6277"]

and finally to get our dipole moment we use the python float function to convert these to python numbers like

>>> dip_mom = [float(num) for num in real_dip_chunks]; print(dip_mom)
[0.8917, -0.4708, 1.6277]

Now imagine you want a different property from that log file. You get to figure this process out all over again for that one. And for the next. And the next.

Hopefully this makes it clear why we’ve got such a preference for clean data formats.

Just keep in mind, if you can’t use NumPy, just use JSON. But really, if you can use NumPy use NumPy.

A Note on Regular Expressions

Python has support for regular expressions (or regex). These are incredibly powerful and basically are a pattern language for strings. We wrote up a little regular expression builder so that if you the string from above

"""
    X=              0.8917    Y=             -0.4708    Z=              1.6277  Tot=              1.9147
 """

you can write out it out as a regular expression like

RegexPattern((
    "X=", Whitespace, Capturing(Number), Whitespace,
    "Y=", Whitespace, Capturing(Number), Whitespace,
    "Z=", Whitespace, Capturing(Number)
))

which will convert to the rather less-readable regular expression

"X=\\s+(\\d+)\\s+Y=\\s+(\\d+)\\s+Z=\\s+(\\d+)"

and allow you to get the three numbers out. If you’d like to learn more, we’re happy to share.

Grep

If you’re working outside of python, you can also use the program grep to do searches. This can be highly efficient, but grep does its pattern matching based on regular expressions, too. If you do want to use grep, you use it like

grep <pattern> <file>

and grep will spit out the matches for you. Some people like to just use grep and manual cleaning in Excel to get their data out of files. We’re pretty lazy, so we tend to stick to the programmatic way, as we can write the code once and keep on reusing it.

A Final note on Data Formats

Python is a powerful language with native support for a surprising number of file types. So if you find yourself with a lot of data in a weird format, before you spend days writing a parser for it, take a few minutes and google if python has any kind of native support for it or if there’s some simple library out there to help you out. Similarly, if you’re stuck, OpenBabel is always a good option to convert your data into a format you can use.

Next: Getting Data out of your Program
Previous: Data Formats

Got questions? Ask them on the McCoy Group Stack Overflow


Edit on GitHub