July 27, 2018
In “Python Data Science and the Rocket MultiValue Database ( Part 1 of 3 )“ I provided an introduction to Numpy, and showed how to convert a Numpy array to a u2py.DynArray.
In this section I will go a bit further and show:
- How to write the numpy data to a Rocket MultiValue file
- How to read back the MultiValue data, and instantiate a Numpy array
- Introduce you to Pandas
- How to move a Numpy array to a Pandas Data Frame
How to write the Numpy data to a Rocket MultiValue file
Before we can begin, we need to create a Rocket MultiValue file, and create some dictionaries. Note for brevity, I will copy dictionaries items into the account rather than creating them by hand. ( In part three of the series, I complete our journey on creating a Python class for persisting your data science objects into the MultiValue Database )
UniData Example:
CREATE.FILE U2DS 3 11
Create file D_U2DS, modulo/3,blocksize/1024
Hash type = 0
Create file U2DS, modulo/11,blocksize/1024
Hash type = 3
Added "@ID", the default record for UniData to DICT U2DS.
:COPY FROM DICT VOC TO DICT U2DS F1 F2 F3 F4
4 records copied
Note that you will need to modify the new dictionary items to be defined as MultiValued. ( Change attribute 7 from S to M )
Now that you have a file, let’s go into Python, and build a Numpy Array then store it to the MultiValue File.
: PYTHON
python> import u2py
python> import numpy as np
python> import pandas as pd
Start with a Numpy Array ( built with simple sample data )
python> theData = [ [ 101, 102, 103, 104 ], [ 201, 202, 203, 204], [ 301, 302, 303, 304 ], [ 401, 402, 403, 404 ] ]
python> theData
[[101, 102, 103, 104], [201, 202, 203, 204], [301, 302, 303, 304], [401, 402, 403, 404]]
python> myArray = np.array( theData )
python> myArray
array([[101, 102, 103, 104],
[201, 202, 203, 204],
[301, 302, 303, 304],
[401, 402, 403, 404]])
We can modify our 4×4 Numpy Array prior to persisting it to our MultiValue database.
python> np.transpose(myArray)
array([[101, 201, 301, 401],
[102, 202, 302, 402],
[103, 203, 303, 403],
[104, 204, 304, 404]])
For our example we will put the transposed array data back into a Python nested list
python> asNestedList = np.transpose(myArray).tolist()
python> asNestedList
[[101, 201, 301, 401], [102, 202, 302, 402], [103, 203, 303, 403], [104, 204, 30
4, 404]]
Here I’ll write the data to the MultiValue file.
Since the u2py.DynArray can be instantiated from a Python nested list, we can create a dynamic array, and store it in the file we created earlier.
python> rec = u2py.DynArray(asNestedList)
python> rec
<u2py.DynArray value=b'101xfd201xfd301xfd401xfe102xfd202xfd302xfd402xfe1
03xfd203xfd303xfd403xfe104xfd204xfd304xfd404'>
python> file = u2py.File("U2DS")
python> file.write("mike", rec)
Now I’ll verify the data made it to the file.
u2py.run("LIST U2DS F1 F2 F3 F4")
LIST U2DS F1 F2 F3 F4 10:48:56 Jul 06 2018 1
U2DS...... F1........ F2............. F3............. F4.............
mike 101 102 103 104
201 202 203 204
301 302 303 304
401 402 403 404
1 record listed
How to read back the MultiValue data, and instantiate a numpy array
The next step in our example is to extract the data from the MultiValue database for use in more Data Science Processing.
python> myDynArray = file.read("mike")
python> myDynArray
<u2py.DynArray value=b'101xfd201xfd301xfd401xfe102xfd202xfd302xfd402xfe1
03xfd203xfd303xfd403xfe104xfd204xfd304xfd404'>
python> myNestedList = myDynArray.to_list()
As mentioned earlier, you can instantiate a numpy array from a nested list.
python> npArray = np.array(myNestedList)
python> npArray
array([['101', '201', '301', '401'],
['102', '202', '302', '402'],
['103', '203', '303', '403'],
['104', '204', '304', '404']],
dtype='<U3')
Introduction to Pandas
Pandas is an open source Python module used in Data Science. It can easily import data into an easy-to-use data structure which allows you to perform operations on large data sets.
Since we have started our discussion with Numpy Arrays, we will instantiate our Pandas Data Frame from the Numpy Array:
Note that while numpy handles the array of information, Pandas allows you to define the column headers.
python> pdDataFrame = pd.DataFrame(npArray, columns=['f1','f2','f3','f4'])
python> pdDataFrame
f1 f2 f3 f4
0 101 201 301 401
1 102 202 302 402
2 103 203 303 403
3 104 204 304 404
You now have a Pandas Data frame to examine. Note that the Numpy array is just the values portion of the Data Frame, and can be used the same as the numpy array, and return a nested Python List.
python> pdDataFrame.values
array([['101', '201', '301', '401'],
['102', '202', '302', '402'],
['103', '203', '303', '403'],
['104', '204', '304', '404']], dtype=object)
Note that we can also get the values as a Nested List:
python> pdDataFrame.values.tolist()
[['101', '201', '301', '401'], ['102', '202', '302', '402'], ['103', '203', '303
', '403'], ['104', '204', '304', '404']]
We can also extract the column names in the same way:
python> pdDataFrame.columns.tolist()
['f1', 'f2', 'f3', 'f4']
In “Python Data Science and the Rocket MultiValue Database ( Part 3 of 3 )“ I will show some of the things you can do with Pandas, and create a simple object for managing the storage and retrieval of the data to a Rocket MultiValue Database.
#Python