Questions on using large XML files with Net Express

Forum|Forum|12 years ago
February 15, 2013
0 replies
0 views

D

Dominique Sacre
Rocketeer

Problem:

Using Net Express I-O syntax to read an XML document which is in the form:

<containing-doc>

</containing-doc>,

where there can be 100,000 or more instances of the <childdocn> element.

When this file is opened for input, and a READ filename is executed, the entire document will be read into memory in the form of an XML DOM tree.

If the document being read is very large, say 60MB , this READ operation could take up to an hour or so, and it may actually use up all the available virtual memory on your computer.

This would result in a Windows error message about the system's Virtual Memory being too low and the program would then crash.

Resolution:

There are a few ways that you can work around this problem.

1. Increase the amount of Virtual Memory that is available to your system. This can be done under Control Panel>System>Advanced>Performance>Settings. This is probably the least viable solution as it would still take a very long time to open and read the contents of the entire file at once.

2. If you have control over the XML document format, then remove the <containing-doc> start and end tags and leave only the <childdloc1></childdoc1><childdoc2></childdoc2>... elements within the document.

In this format, a READ filename operation would only read one <childdocn> element at a time, so you can handle the input data in a sequential manner. You can continue the READ filename operation until you reach the end of file which would be determined by an XML-FILE-STATUS value of -7.

If you need to have all child elements in memory at once, so that you can update them, or do a READ NEXT KEY IS to read through all elements using an alternate key, then you cannot use this method. You would have to use the <containing-doc> element and read all the <childdocn> elements at once.

3. You can use the CBL_OPEN_FILE and CBL_READ_FILE byte stream routines to read the data directly from the XML document in managable blocks of 100KB or more. You can then use INSPECT and STRING statements to parse the data read to find the first <childdocn> start tag and the last </childdocn> end tag in the data buffer, and then extract this information into another buffer, such as XML-BUFFER PIC X(100000).

In the SELECT statement for your XML enabled file you can have ASSIGN TO ADDRESS OF XML-BUFFER. Then, when you open the file, you can do a READ NEXT through each of the <childdocn> elements in the XML-BUFFER until you reach end of file. When you have finished processing this buffer of data, you would close the XML enabled file, and reposition the starting offset for the next CBL_READ_FILE to point to the byte immediately following the last </childdocn> end tag that you extracted into XML-BUFFER. You can then perform the CBL_READ_FILE operation and repeat this process until the end of the XML document has been reached.

This method should be used when you do not have any control of the format of the XML input document.

If you need to have all child elements in memory at once, so that you can update them, or do a READ NEXT KEY IS to read through all elements using an alternate key, then you cannot use this method. You would have to use the <containing-doc> element, and read all the <childdocn> elements at once.

4. Use an XML document splitter program that would allow you to create many smaller XML documents from the single large one. The splitter would have to save the header information of the original XML document so that it could replicate this information in each of the smaller files.

Your program could then read each small file in succession, and process its data individually from the other files. This would limit the amount of memory that is required at one time because the program would only need to create a small XML DOM tree.

Old KB# 7057

Problem:

Resolution:

Recent badge winners

Sign up

Please log in or register:

Welcome to the Rocket Forum!

Please log in or register:

Scanning file for viruses.

This file cannot be downloaded