D3 and mvBase

View Only

Large JSON record to d3

Ted Hurlbut posted 07-15-2022 15:14

All,

Currently we are extracting out JSON data character by character and saving the data in d3. That works fine for small records. But, now we are looking at converting a record with the size of 3,147,865. Character by character manipulation of a string is very cumbersome. Is there a JSON to d3 program out there that could handle this or does someone have any ideas?

Ted Hurlbut

ROCKETEER Brian Cram posted 07-18-2022 09:19

Hi, Ted. My JSON parser has the same issue. I know of no JSON to D3 converter out there.

I had the same problem with an XML to D3 converter I wrote years ago. A D3 engineer was attempting to use it and found it slow as well. It's probably the way that I'm parsing the string, which is to take the leading character of the string, then chop that off of the string creating a new string, rinse, repeat. I was going to try to create something that incremented the string position retrieving the character at position X ( like WORKSTRING[X,1] ) to see if that sped things up. I'll give that a try. If that doesn't speed things up, maybe there's a way to put the string in real memory and use some C function to find the initial memory address and increment the address. I'll run that past Engineering.

ROCKETEER Brian Cram posted 07-18-2022 09:22

After I pressed "Post" I had another thought. Is this JSON string stored in the Unix file system? If so, I have another idea: Use the %open and %read stuff to read the JSON object a line ( or few characters ) at a time and parse those strings. Are you familiar with those %C and similar functions?

Thomas Iwanowicz posted 07-19-2022 09:01

We are using jq (on linux) to parse from PICK, with the saved json in a linux directory.

https://stedolan.github.io/jq/

You'll have to read the manual, but here is an example of pulling in a set of messages and reading data from it to store or use in D3

execute \!jq '.messages | length' /path/to/file/\:jsonfile capturing numMessages
for i = 1 to numMessages
execute \!jq '.messages[\:i - 1:\].body' /path/to/file/\:jsonfile capturing messageBody
execute \!jq '.messages[\:i - 1:\].direction' /path/to/file/\:jsonfile capturing messageDirection
execute \!jq '.messages[\:i - 1:\].to' /path/to/file/\:jsonfile capturing messageTo
execute \!jq '.messages[\:i - 1:\].from' /path/to/file/\:jsonfile capturing messageFrom
execute \!jq '.messages[\:i - 1:\].status' /path/to/file/\:jsonfile capturing messageStatus
execute \!jq '.messages[\:i - 1:\].date_sent' /path/to/file/\:jsonfile capturing messageDateSent
execute \!jq '.messages[\:i - 1:\].sid' /path/to/file/\:jsonfile capturing messageSid
next i

Ted Hurlbut posted 07-19-2022 11:33

Thanks Brian for the input.

Our initial processing of a record took over 3 hours. We did modify the program to dynamically create a dimension array and then break up the record into 5000 byte increments. We did have to do some additional processing to merge 2 elements of the array when necessary. By making these changes, we were able to do the entire record in 40 seconds which included writing over 10,000 d3 records to 8 different files.

ROCKETEER Brian Cram posted 07-19-2022 13:16

I forgot to mention that dimensioned array thing. I've done that before with large strings, was part of the solution I was envisioning that also used the "read line" %c functions ( forget which one it is ). Anyhow, glad it's sorted.

ROCKETEER Brian Cram posted 07-19-2022 15:57

Here are two programs that show how to read/write large O/S files a line at a time. These happen to be D3 Windows examples, but you get the idea:

* readBigTxtFile
*
include dm,bp,unix.h fcntl.h
fname = 'c:\users\public\downloads\stream.txt'
stream = (char*)%fopen(fname, 'r')
if stream = 0 then stop "Cannot open ":fname
blen = 128
*
loop
gosub GetNextLine
until line = "" do
convert char(13) to "" in line
crt line
for x = 1 to blen; crt seq(line[x,1])'r#4':; next x; crt
* crt "Do your line parsing and system updates here"
repeat
w = %fclose((char*)stream)
stop
*
GetNextLine:
char line[blen]
pointer = (char*)%fgets(line, blen, (char*)stream)
if pointer = 0 then line = "";* else line = field(line, char(10), 1)
return
*
* End of source

======================================================

* writeBigTxtFile
*
include dm,bp,unix.h fcntl.h
fname = 'c:\users\public\downloads\customers.txt'
stream = (char*)%fopen(fname, 'w')
if stream = 0 then stop "Cannot open ":fname
open 'customers' to f.customers else abort 201,'customers'
select f.customers
*
loop
readnext id else id = "\\eof"
until id = "\\eof" do
read line from f.customers, id then
convert char(9) to "" in line
convert @am:@vm to char(9):" " in line
gosub PutNextLine
end
repeat
w = %fclose((char*)stream)
stop
*
PutNextLine:
line := char(13):char(10)
error = %fputs(line, (char*)stream)
if error < 0 then crt "write error"
return
*
* End of source

Lance McMillin posted 07-19-2022 19:49

Hi Ted,
It is a coincidence that I was just recently asked to look at json files and to possibly parse them.
I modified some other code I use to read large files to process json files.
The code should place each key-value pair, object delimiter and array delimiter on a separate line in dynamic array for further processing.
There could be other string swaps needed to delineate each item to it's own line.
My goal with this was to get the key-value objects into their own attributes.
Once that is done you should be able to navigate down through the array keeping track of which object and/or array level you are at.
If you want the key's and value's in separate attributes, just add another string swap to convert ":" to ":crlf"

Of course you can create and convert to a dimensioned array for faster processing.

The file I processed was just over 18MB and it processed to a dynamic array with just under 400K attributes in just under 2 seconds.

I would be interested to see the code you are using to convert to D3 files.

Hope this works for you.

****** THIS PROGAM MUST BE COMPILED WITH THE "COMPILE" COMMAND ******

INCLUDE DM,BP,UNIX.H FCNTL.H ;! Need for "C" funtions to work correctly.

output_folder = '/u/home/lance/json/'
json_filespec = output_folder : 'cigna.json'
outfilename = 'cignatest.json'

Initialize:
a = @AM
v = @VM
lf = CHAR(10) ;! LINE FEED character
cr = CHAR(13) ;! CARRIAGE RETURN character
buflen = 20000 ;! Serial Read buffer length (Tried values of 10k, 15k and 30k - 20k seems to be the best)
exit_flag = 0 ;! Control exit from LOOP statement
extra = '' ;! Hold partial record information between reads
groupptr = 0 ;! Pointer using to store info to dimension array. Info that could go back to calling program.
ln = 0 ;! Temporary line counter used in debugging
results = '' ;! Hold return data to calling program
stime = SYSTEM( 12 )

DIM groups( 80000 ) ;! Create dimensioned array. Temporary holder before saving data to file
MAT groups = '' ;! Initialize dimentioned array

Read_File:
file_handle = %OPEN( json_filespec, O$RDONLY ) ;! Open file

bytesread = 0
IF file_handle LT 0 THEN ;! Check for bad file handle
results = 'ERROR: OPEN file error in sub (PRE_PROCESS_EPIC_FILE): ' json_filespec ;! Check for file process error
END ELSE
LOOP UNTIL exit_flag ;! Beginning of READ loop
readbuffer = SPACE( buflen ) ;! Initialize readbuffer
readbytes = %read( file_handle, readbuffer, buflen ) ;! Perform read
readbuffer = SWAP( readbuffer, lf, '' ) ;! Remove all line feed characters in readbuffer
readbuffer = SWAP( readbuffer, cr, '' ) ;! Remove all carriage retrun characters in readbuffer
BEGIN CASE
CASE readbytes = -1 ;! Problem with read
results = 'ERROR: Read error: ' : SYSTEM( 0 )
exit_flag = 1 ;! Set exit flag to exit loop
CASE readbytes EQ buflen ;! Process current read buffer - NOT EOF
lines = extra : readbuffer ;! Prepend remaining previous readbuffer to current readbuffer
GOSUB Convert2Attribute
linecount = COUNT( lines, a ) ;! Get count lines based on attributes (closed braces)
extrapos = INDEX( lines, a, linecount ) + 1 ;! Determine start point of partial message group
extra = lines[ extrapos, 10000] ;! Save last partial/whole read buffer
lines[ extrapos, 10000 ] = '' ;! Clear portion of lines now held in extra for next pass
groupptr += 1
groups( groupptr ) = lines
CASE readbytes NE buflen ;! Process last read buffer - EOF
lines = extra : readbuffer[ 1, readbytes ] ;! Prepend remaining previous data setment to current data segment
GOSUB Convert2Attribute
linecount = COUNT( lines, a ) ;! Get count of lines
groupptr += 1
groups( groupptr ) = lines
exit_flag = 1 ;! Set exit flag to exit loop
END CASE
REPEAT
END

results = groups ;! Convert to dynamic array

FixLastLine:
last_attr = DCOUNT( results, a ) ;! Get attribute of last line
last = TRIM( results< last_attr > ) ;! Remove any leading/trailing/redundant spaces
DEL results< last_attr > ;! Remove last line
FOR ctr = 1 TO LEN( last ) ;! Loop through length of last line
results< -1 > = last[ ctr, 1 ] ;! Add new line for each character in last line
NEXT ctr

etime = SYSTEM( 12 )

Display2Screen:
*ln = DCOUNT( results, a )
*FOR x = 1 TO ln
*CRT x, results< x >[ 1, 100 ]
*NEXT x

WriteResults:
CRT 'Unique Lines: ' : DCOUNT( results, a ), 'Elapsed Time: ' : (etime - stime)/1000
OPEN output_folder TO TempFolder ELSE STOP
WRITE results ON TempFolder, outfilename

STOP

Convert2Attribute:
! These conversions were created based on a test json file I was working with.
! It is possible that there may be other stings that need to be converted to get each
! key-value pair, object delimiter and array delimiter on their own line

lines = SWAP( lines, \","\, \",\ : a : \"\ )
lines = SWAP( lines, \{"\, \{\ : a : \"\ )
lines = SWAP( lines, \":[{\, \":\ : a : \[\ : a : \{\ )
lines = SWAP( lines, \":{\, \":\ : a : \{\ )
lines = SWAP( lines, \"}],"\, \"\ : a : \}\ : a : \],\ : a : \"\ )
lines = SWAP( lines, \"}},{\, \"\ : a : \}\ : a : \},\ : a : \{\ )
lines = SWAP( lines, \"}\, \"\ : a : \}\ )

RETURN

ROCKETEER Brian Cram posted 07-20-2022 11:51

I like Lance's example. The salient thing to take away is the concept of the %OPEN and %READ, which is where he's getting his speed by processing the string a few bytes at a time and not setting up a large string variable in the program to parse. We all already know the benefit of dimensioned arrays over dynamic arrays.

One thing to remind everybody: You can re-dimension and array that's already in use. Say you did DIM VAR(1000) and you come to the point where you're going to populate the 1001st element. Simply DIM VAR(2000) and continue. Variables work in DIM statements as well: ARR.MAX += 100; DIM VAR(ARR.MAX).

PARTNER Robert Hattori posted 07-20-2022 13:32

Just wanted to second Brian's post about re-dimensioning. I used to break input by the original dimension size and then stitch things together (or process that chunk of input), but have since converted some processes to re-dimension so I could work with the entire input all at once in a single array when processing.

One thing to note (but wouldn't really apply if you dimension to the exact size of your input): when you re-dim with data already in your array, you can't obviously do MAT myArray = "" to initialize as it would wipe out the contents. If you need to initialize, I usually just do a for/next loop from prevArraySize+1 to newArraySize and set myArray(i) = "".

--Robert

Ted Hurlbut posted 07-21-2022 08:47

Thank you all for your input on our JSON issue. Great feedback and suggestions. I remember WAY back when we were not able to dimension on the fly. That was great improvement. Many things have improved over the past 40 plus years.

D3 and mvBase

Large JSON record to d3

Contact Us

Quick Links