Getting Data out of Your Code

We’ve dealt with the process of getting data into your programs, but sometimes you need to get it out, too. Maybe your calculation took a long time and you don’t want to have to rerun it. Maybe you’d like to use the results of this calculation as an intermediate in other ones. The possibilities are myriad. And so how do we want to do this?

Well, as we’ve been saying, if you can, use NumPy. If you can’t that’s where things get interesting.

A Note on np.savetxt()

As we noted before, sometimes life throws a wrench in your plans to do the right thing, and someone (usually using Fortran) needs tabular data. If that’s the case, the NumPy savetxt function is going to be your friend.

Just Use JSON

The theme of pretty much everything we’ve been saying up until now is that if you can’t use NumPy, just use JSON. So what does this look like in practice?

Well let’s imagine we ran a discrete variable representation calculation and we’ve got a bunch of stuff we want to save

dvr_name = 'my_dvr'
dvr_type = 'ColbertMiller1D'
dvr_range = [-1, 1]
dvr_points = 500
dvr_grid = ... # an vector from the calculation
dvr_pe = ... # a potential energy matrix
dvr_ke = ... # a kinetic energy matrix
dvr_engs = ... # vector of energies
dvr_wfns = ... # matrix of wavefunction values

Now we want to dump all of this to a JSON file.

Unfortunately, NumPy has no built-in way to write stuff to JSON, so we can’t just say

import json

with open("my_dvr_results.json", "w") as res:
    json.dump(
        dict(
            name = dvr_name,
            type = dvr_type,
            range = dvr_range,
            points = dvr_points,
            grid = dvr_grid,
            potential_energy = dvr_pe,
            kinetic_energy = dvr_ke,
            energies = dvr_engs,
            wavefunctions = dvr_wfns
            ),
        res
    )

instead we need to do one of three things, convert to regular python lists, write a hybrid format, or use HDF5.

array.tolist()

The first, and the one we recommend for most cases, is to use the .tolist() function supported by NumPy arrays to convert them into regular python lists. JSON can deal with regular python lists.

That would look like

import json

with open("my_dvr_results.json", "w") as res:
    json.dump(
        dict(
            name = dvr_name,
            type = dvr_type,
            range = dvr_range,
            points = dvr_points,
            grid = dvr_grid.tolist(),
            potential_energy = dvr_pe.tolist(),
            kinetic_energy = dvr_ke.tolist(),
            energies = dvr_engs.tolist(),
            wavefunctions = dvr_wfns.tolist()
            ),
        res
    )

Hybrid Formats

The second option is to use a hybrid format. This is going to be most useful when we’re working with very large arrays and the memory penalty of tolist is too high. These formats will store all metadata in JSON format, but allow arrays to be written out in a more efficient format, which NumPy has support for.

Here’s one example of how this can work. Our layout will be

{
    ...,
    arrays: [
        {key: 'array_1', dtype: 'float64', shape: [120, 5, 3]},
        {key: 'array_2', dtype: 'int32', shape: [500]},
        ...
    ]
}
flattened array 1 
flattened array 2
...

where we’ve written the metadata out over multiple lines, but for ease of processing we’ll actually leave it as one line.

We can perform this like

import json

meta = {
    'name': 'my_data',
    'range': [-1, 1],
    'npoints': len(grid),
    'arrays': [
        {'key':'grid', 'shape':grid.shape, 'dtype':grid.dtype.name},
        {'key':'energies', 'shape':engs.shape, 'dtype':engs.dtype.name},
        {'key':'wavefunctions', 'shape':wfns.shape, 'dtype':wfns.dtype.name}
    ]
}

with open('my_dvr_results.jsnp') as file:
    json.dump(meta, file)
    for array in [grid, engs, wfns]: # must align with what's in the meta data
        file.write("\n")
        array.flatten().tofile(file, sep=" ")
with open('my_dvr_results.jsnp') as file: # just so we can see what's inside
    print(file.read())
{"name": "my_data", "range": [-1, 1], "npoints": 50, "arrays": [{"key": "grid", "shape": [50], "dtype": "float64"}, {"key": "energies", "shape": [5], "dtype": "float64"}, {"key": "wavefunctions", "shape": [50, 5], "dtype": "float64"}]}
0.21901924207629442 0.2845811641290318 0.35014308618176915 0.41570500823450646 0.4812669302872438 0.5468288523399811 0.6123907743927185 0.6779526964454559 0.7435146184981932 0.8090765405509306 0.874638462603668 0.9402003846564053 1.0057623067091426 1.0713242287618798 1.1368861508146173 1.2024480728673548 1.268009994920092 1.3335719169728293 1.3991338390255668 1.464695761078304 1.5302576831310415 1.5958196051837787 1.6613815272365162 1.7269434492892535 1.7925053713419907 1.8580672933947282 1.9236292154474655 1.989191137500203 2.05475305955294 2.1203149816056777 2.185876903658415 2.251438825711152 2.3170007477638896 2.382562669816627 2.448124591869364 2.5136865139221016 2.579248435974839 2.6448103580275766 2.7103722800803136 2.775934202133051 2.8414961241857886 2.9070580462385256 2.972619968291263 3.0381818903440005 3.103743812396738 3.169305734449475 3.2348676565022125 3.30042957855495 3.365991500607687 3.4315534226604245
0.009001295964747488 0.027003887894241883 0.04500647982373659 0.06300907175323109 0.08101166368272575
-1.6297059196185867e-18 -2.1238603683495063e-17 -1.9600717185203036e-16 -1.3731601478408739e-15 -8.218185917980731e-15 -4.227914540010922e-17 -5.084342248937425e-16 -4.332144517493014e-15 -3.002680951714255e-14 -1.7897819993578514e-13 -9.059740244995207e-16 -1.0543155141422734e-14 -8.611109629014152e-14 -5.698475466077956e-13 -3.2404706891417078e-12 -1.7171629954202116e-14 -1.9092558963231228e-13 -1.4889229044500666e-12 -9.40214519164098e-12 -5.098064790049858e-11 -2.846932711821115e-13 -3.018161112932004e-12 -2.2424111219488243e-11 -1.3478927106839047e-10 -6.95067122466062e-10 -4.129171172995085e-12 -4.164009671710429e-11 -2.940051363784702e-10 -1.6777664190098841e-09 -8.205015616928294e-09 -5.239333858522271e-11 -5.012599259980382e-10 -3.3540088695643536e-09 -1.8117105418786575e-08 -8.376076604111632e-08 -5.815883186681261e-10 -5.263432211396559e-09 -3.327149414785951e-08 -1.6954856246750568e-07 -7.384017445826375e-07 -5.647821592968985e-09 -4.8192586037549805e-08 -2.867863628349973e-07 -1.3735044168554768e-06 -5.611658815615956e-06 -4.79812813409137e-08 -3.846084822713215e-07 -2.146043916755955e-06 ... 1.3735044166976815e-06 -5.61165881554646e-06 -5.815882009044431e-10 5.2634320539981354e-09 -3.3271493829883616e-08 1.695485625130424e-07 -7.38401744783632e-07 -5.239331244898199e-11 5.012601383956994e-10 -3.354008387025528e-09 1.8117105545311923e-08 -8.37607662871476e-08 -4.129388444613319e-12 4.164022492814449e-11 -2.940047539953715e-10 1.6777664070224127e-09 -8.205015798783275e-09 -2.848981448658778e-13 3.018074203228939e-12 -2.2424040720969595e-11 1.347890895594096e-10 -6.95067270195125e-10 -1.720736448735197e-14 1.9076352665703666e-13 -1.4890047052317808e-12 9.402089435454086e-12 -5.098067554187757e-11 -9.335616305004049e-16 1.0374985115028238e-14 -8.613933684963118e-14 5.699183398999688e-13 -3.2405077703636008e-12 -7.337023374629885e-17 3.9588334188743134e-16 -4.302858625647169e-15 3.011676891027034e-14 -1.7903129531564725e-13 1.423766939174147e-17 -1.5833142601610604e-17 -1.778352319409878e-16 1.4410574036154364e-15 -8.225059630431985e-15

When we ingest this, we’ll do so using NumPy’s fromfile method as well as the readline method of a file.

This will look like

import json, numpy as np

with open('my_dvr_results.jsnp') as file:
    meta_line = file.readline()
    file_data = json.loads(meta_line)
    for arr_spec in file_data['arrays']:
        shp = arr_spec['shape']
        num_elements = int(np.prod(shp))
        arr = np.fromfile(file, dtype=arr_spec['dtype'], count=num_elements, sep=" ")
        file_data[arr_spec['key']] = arr.reshape(shp)
print(file_data) # just so we can confirm the right stuff is coming out
{'name': 'my_data',
 'range': [-1, 1],
 'npoints': 50,
 'arrays': [{'key': 'grid', 'shape': [50], 'dtype': 'float64'},
  {'key': 'energies', 'shape': [5], 'dtype': 'float64'},
  {'key': 'wavefunctions', 'shape': [50, 5], 'dtype': 'float64'}],
 'grid': array([0.21901924, 0.28458116, 0.35014309, 0.41570501, 0.48126693,
        0.54682885, 0.61239077, 0.6779527 , 0.74351462, 0.80907654,
        0.87463846, 0.94020038, 1.00576231, 1.07132423, 1.13688615,
        1.20244807, 1.26800999, 1.33357192, 1.39913384, 1.46469576,
        1.53025768, 1.59581961, 1.66138153, 1.72694345, 1.79250537,
        1.85806729, 1.92362922, 1.98919114, 2.05475306, 2.12031498,
        2.1858769 , 2.25143883, 2.31700075, 2.38256267, 2.44812459,
        2.51368651, 2.57924844, 2.64481036, 2.71037228, 2.7759342 ,
        2.84149612, 2.90705805, 2.97261997, 3.03818189, 3.10374381,
        3.16930573, 3.23486766, 3.30042958, 3.3659915 , 3.43155342]),
 'energies': array([0.0090013 , 0.02700389, 0.04500648, 0.06300907, 0.08101166]),
 'wavefunctions': array([[-1.62970592e-18, -2.12386037e-17, -1.96007172e-16,
         -1.37316015e-15, -8.21818592e-15],
        [-4.22791454e-17, -5.08434225e-16, -4.33214452e-15,
         -3.00268095e-14, -1.78978200e-13],
        [-9.05974024e-16, -1.05431551e-14, -8.61110963e-14,
         -5.69847547e-13, -3.24047069e-12],
        [-1.71716300e-14, -1.90925590e-13, -1.48892290e-12,
         -9.40214519e-12, -5.09806479e-11],
        ...
        [-1.72073645e-14,  1.90763527e-13, -1.48900471e-12,
          9.40208944e-12, -5.09806755e-11],
        [-9.33561631e-16,  1.03749851e-14, -8.61393368e-14,
          5.69918340e-13, -3.24050777e-12],
        [-7.33702337e-17,  3.95883342e-16, -4.30285863e-15,
          3.01167689e-14, -1.79031295e-13],
        [ 1.42376694e-17, -1.58331426e-17, -1.77835232e-16,
          1.44105740e-15, -8.22505963e-15]])}

Another option, and one that will also help cut down on the size of all of the files, is to use the fact that all modern languages support reading .zip files, either natively or, for Fortran, through a DLL. Rather than put all of these arrays in the same .txt file, we can put them in separate ones which can be loaded with NumPy’s loadtxt method and whose names are defined in a data.json file inside the ZIP. Here’s a sample implementation for writing one of these hybrid files

import zipfile, json, numpy as np, tempfile as tf, os

meta = {
    'name': 'my_data',
    'range': [-1, 1],
    'npoints': len(grid),
    'arrays': [
        {'key':'grid', 'shape':grid.shape, 'dtype':grid.dtype.name},
        {'key':'energies', 'shape':engs.shape, 'dtype':engs.dtype.name},
        {'key':'wavefunctions', 'shape':wfns.shape, 'dtype':wfns.dtype.name}
    ]
}

with zipfile.ZipFile('my_dvr_results.zip', 'w') as res:
    with tf.TemporaryDirectory() as tmp_dir:
        meta_dump = os.path.join(tmp_dir, 'data.json')
        with open(meta_dump, 'w') as dump:
            json.dump(meta, dump)
        res.write(meta_dump, arcname='data.json')
        for array, spec in zip([grid, engs, wfns], meta['arrays']):
            arr_dump = os.path.join(tmp_dir, spec['key'])
            np.savetxt(arr_dump, array.flatten(), newline=" ")
            res.write(arr_dump, arcname=spec['key'])

and here’s how you’d read the file that was built

import zipfile, json, numpy as np

with zipfile.ZipFile('my_dvr_results.zip') as file:
    with file.open('data.json') as header:
        file_data = json.load(header)
    for arr_spec in file_data['arrays']:
        name = arr_spec['key']
        shp = arr_spec['shape']
        num_elements = int(np.prod(shp))
        with file.open(name) as arr_dat:
            arr = np.loadtxt(arr_dat, dtype=arr_spec['dtype'])
            file_data[name] = arr.reshape(shp)

just for interest’s sake, memory-wise the first format seems to be a bit more compact, but the latter obviously has the benefit that it is easier to decompose. The second can also potentially benefit from changing up the compression and compressionlevel on the ZipFile object.

HDF5

If you’ve got to move between platforms (esp. if you’ve got to use Fortran or C++) and JSON isn’t cutting it, HDF5 might worth the plunge.

This isn’t Data Presentation

Note that we’ve focused on data export, not presentation. In general, you don’t want to mix the two. A table for a paper shouldn’t be how you save your results. If you do want to learn about data presentation, we’ve got a section on plotting that you should check out.

Next: NumPy Data Files
Previous: Getting Data into your Program

Got questions? Ask them on the McCoy Group Stack Overflow


Edit on GitHub