Rocket U2 | UniVerse & UniData

 View Only
  • 1.  5g file better to be hashed or dynamic

    Posted 20 days ago

    I have a 5 gig  file that has 7 million records. the record ids are like this : 7060834*11, 5425655*84, 1931679*19. Should this be a hashed file or a dynamic file? Is there a set of rules that should be used to help with the choice?

    [Gary] [Rhodes]
    [Universe Developer]
    [NPW Companies]
    [Hialeah] [FL] [USA]

  • 2.  RE: 5g file better to be hashed or dynamic

    Posted 19 days ago

    Aren't dynamic files also hashed?

    I think a good general rule to follow when deciding is how much change is expected in the size of the file.

    Marcus Rhodes

  • 3.  RE: 5g file better to be hashed or dynamic

    Posted 19 days ago

    Static and dynamic files use the same code and file structure, where dynamic files move the overflow area from the end of the file to a separate file, the OVER.30. The dynamic file has an additional record header flag for the "last record in the primary group" for the code to switch to the OVER.30 file to continue processing the group. There is also a record header flag indicating the "first record in the overflow group".

    Selecting the wrong group size can be problematic for dynamic files, as well as static hashed files. UniVerse stores oversize records, those that do not fit within the group (along with the record header) in a series of group blocks in the overflow area. For dynamic files, the LARGE.RECORD value, default of 70 percent of the group size, reduces the effective size of the group when determining whether the record will fit in a group.

    Referring back to "The Hitchhiker's Guide to the UniVerse", we see that oversize records can take 3 to 5 times longer to process than records that fit in an appropriately larger group. You must balance that against the time reading a larger group for a small record.

    Part of the reason oversize record processing can take 3 times longer is that the blocks write 3 times. The whole oversize record is tossed on the free chain, and then the needed blocks for the new record are allocated, and then the record writes into those blocks. (Note that there is a small window where the original record only exists in the free chain. If the updated record is larger and needs to allocate a new group block, and that fails in a disk full situation, the record can vanish from the file! This was more likely in the old days with expensive and smaller disks, but a bug report was closed and ignored.

    Writing a number of overflow blocks can aggravate the AIX jfs2 I/O subsystem operation tuned by j2_nPagesPerWriteBehindCluster. The default value of 32 4K blocks in a cluster views the file as a series of 128K clusters. The behavior assumes sequential activity on a file. Assisting the sync daemon by flushing dirty disk blocks that are unlikely to be used by the application again, any write to a cluster queues dirty blocks in the next-lower cluster to the flush operation. You cannot access these blocks until they have been flushed and are clean again. That can add significant elapsed time from access conflict, as well as increase CPU time for applications actively waiting for a block to become available.)

    One result of oversize writes using the free list is that you hold a lock on group zero during this write. That means only one oversize write in a file can happen at one time! 

    Another factor in oversize record handling is that the allocation sequence for oversize record blocks end up from high to low file addresses. This defeats any operating system read-ahead processing.

    The default separation for dynamic files is 2K. The resulting LARGE.RECORD size in a 64BIT file of 1619 bytes in a 2K group is often too small for good performance. Starting with UniVerse 11.3, this can be controlled at file creation time. Prior to 11.3, RESIZE could alter dynamic file separation.

    The GENERAL and SEQ.NUM hashing algorithms are similar to type 18 and 2 hashing algorithms with adjustments to deal with the hashing working on a modulo of the next power of 2 above the current modulus. As you know, an even number for the modulus is sub optimal for hashing, and a power of 2 is worse as you can find a lot of harmonics with record keys and the hashing algorithm in conjunction with a modulo of a power of 2. In the early days when the hashing algorithms WERE 2 and 18, I found a poorly performing file where only one in six groups contained records! 

    A way to test is to create a test dynamic file (use MINIMUM.MODULUS to minimize splitting overhead during testing). You could copy the records to the new file but the file may contain sensitive data. Make an interim file with record ID and LEN(RECORD) as the record data. Then you can use this to write records of these lengths of spaces into your test dynamic file and observe the result with ANALYZE.FILE test STATS.

    Mark A Baldridge
    Principal Consultant
    Thought Mirror
    Nacogdoches, Texas United States

  • 4.  RE: 5g file better to be hashed or dynamic

    Posted 19 days ago

    As Marcus has pointed out, the degree of change is one (important) factor to consider. Another is - do you have system maintenance periods available where you can resize the file without impacting on users?

    I would also consider whether you could make this a distributed file. This is a logical file made up of a number of smaller physical files. This is particularly useful where you run (some / most) queries against the individual part files, and you only query the full distributed file when you need all the data. My guess is that you would need to re-key all the items to move to a distributed file structure - which may not be the easiest thing to do in a production environment.



    Brian Speirs
    Senior Analyst - Information Systems
    Rush Flat Ltd
    Wellington NZ