How to determine smallest modulo for a hashed file

Forum|Forum|4 years ago
July 2, 2021
3 replies
1 view

Nelson Schroth
Participating Frequently

I am extracting incremental data from production files to be tarred up and ftped to another physical site. I have created small type 18 "work" files for each one of these and want them to be as small as possible to conserve on disk space and ftp time. After they are transported to the other physical site, they would be used to update the master file(s) via a COPY command, which I would not think would be greatly impacted by a tightly compacted file.

I have been using HASH-AID and ANALYZE-FILE to see what each chosen group distribution looks like, but am not sure what I am shooting for. When we analyze files for production usage, I strive to keep the group distribution skewed to the left (25%, 50%) buckets. Since this need is purely size related, should I be looking for the modulo that skews heavily towards the right (200%) without hitting the "full" bucket?

I shall appreciate any feedback.

Nelson

------------------------------
Nelson Schroth
president
C3 Completeshop LLC
------------------------------

+2

Manu Fernandes
Inspiring
Forum|Forum|4 years ago
July 5, 2021

I am extracting incremental data from production files to be tarred up and ftped to another physical site. I have created small type 18 "work" files for each one of these and want them to be as small as possible to conserve on disk space and ftp time. After they are transported to the other physical site, they would be used to update the master file(s) via a COPY command, which I would not think would be greatly impacted by a tightly compacted file.

I have been using HASH-AID and ANALYZE-FILE to see what each chosen group distribution looks like, but am not sure what I am shooting for. When we analyze files for production usage, I strive to keep the group distribution skewed to the left (25%, 50%) buckets. Since this need is purely size related, should I be looking for the modulo that skews heavily towards the right (200%) without hitting the "full" bucket?

I shall appreciate any feedback.

Nelson

------------------------------
Nelson Schroth
president
C3 Completeshop LLC
------------------------------

Hi
Usually, at start we do this :
Separator = ((int(estimate recordsize + key) *4)/512) +1... Round to 4,8,16,32,...
Modulo = estimate number of record / 4
Type = 18

After load the file we check the type/modulo/separator with HASH.HELP and resize on adviced values (take care to have a prime number for modulo).
You can do it by automation.

To transport a u2 static file compress it (zip) you avoid empty space.

I hope this help

------------------------------
Manu Fernandes
------------------------------

Like

+2

Manu Fernandes
Inspiring
Forum|Forum|4 years ago
July 5, 2021

I am extracting incremental data from production files to be tarred up and ftped to another physical site. I have created small type 18 "work" files for each one of these and want them to be as small as possible to conserve on disk space and ftp time. After they are transported to the other physical site, they would be used to update the master file(s) via a COPY command, which I would not think would be greatly impacted by a tightly compacted file.

I have been using HASH-AID and ANALYZE-FILE to see what each chosen group distribution looks like, but am not sure what I am shooting for. When we analyze files for production usage, I strive to keep the group distribution skewed to the left (25%, 50%) buckets. Since this need is purely size related, should I be looking for the modulo that skews heavily towards the right (200%) without hitting the "full" bucket?

I shall appreciate any feedback.

Nelson

------------------------------
Nelson Schroth
president
C3 Completeshop LLC
------------------------------

To achieve the transfert of you data I suggest to use at source uvbackup/zip and unzip/Uvrestore.

Two oence.

------------------------------
Manu Fernandes
------------------------------

Like

M

Mike Bojaczko
Participating Frequently
Forum|Forum|4 years ago
August 28, 2021

I am extracting incremental data from production files to be tarred up and ftped to another physical site. I have created small type 18 "work" files for each one of these and want them to be as small as possible to conserve on disk space and ftp time. After they are transported to the other physical site, they would be used to update the master file(s) via a COPY command, which I would not think would be greatly impacted by a tightly compacted file.

I have been using HASH-AID and ANALYZE-FILE to see what each chosen group distribution looks like, but am not sure what I am shooting for. When we analyze files for production usage, I strive to keep the group distribution skewed to the left (25%, 50%) buckets. Since this need is purely size related, should I be looking for the modulo that skews heavily towards the right (200%) without hitting the "full" bucket?

I shall appreciate any feedback.

Nelson

------------------------------
Nelson Schroth
president
C3 Completeshop LLC
------------------------------

Hi Nelson,

Did you figure it out? About the time you posted, I was just delving into some of my own testing since im a recovering dynamic file junky.

For my tests, I used a hashed source file and 4 hashed test files named HFX, HF2, HF4, HF8, and HF16 respectively. All files were type 2 and 64bit. The numbers in the file names are the unique separations. The source file modulo was based on estimated number of records into the future multiplied by the average record size bytes divided by the separation that matches my disk block size which is 4096 aka separation = 8. After that, i used the TCL PRIME command to get the next highest prime number. The test files started with the same estimated bytes as the source file using the same formula with each file specific separation. So, the physical sizes were very close in byte size but the modulos and separations were different. I was mostly interested in the optimal separation in my case. Initially, all the files had plenty of free space, about 40%, to avoid starting off overflowed.

There were two test catagories that ran several Universe and OS file operation tests including SELECT, SELECT WITH, COPY, ls, du, compress, and uncompress. The tests ran via a custom VOC paragraph that ran other paragraphs three times in a row in order to asertain the average run times. Test 2 used a 25% smaller estimated byte size. Measurements were done using the TCL DATE command before and after each file command. The downside of this method is that it's limited to seconds. If the test record count was too small, all the tests appeared to run at the same speed. So, my record counts and physical file sizes were rather large compared to what im used to working with. The large files and data set required a lot of time to run. So, a quiet system was essential and hard to find at the same time.

Many of the Universe file command tests ran best with a separation that was double the os disk block size which is 8192 bytes aka separation =16 in this case. The most costly operation with the biggest differences between separations was the COPY command. I measured the costs of the tests between the first and second place test winners using a percentage gain. The COPY percentage gain between first and second was greater then all the others combined. Coincidentally, the separation that performed the best with COPY is 8 aka 4096 which matches the block size.

I'm thinking that unused space in a compressed hashed file is not really an issue to pick at too much.
Also, i found that running TCL "phantom HASH.AID filename 2,18,1 * * " will give a &PH& record that includes ANALYZE FILE output for types 2-18 and populates info in HASH.AID.FILE records that you can LIST or LPTR.

Happy Trails!

------------------------------
Mike Bojaczko
PROGRAMMER ANALYST
United States
------------------------------

Like

Recent badge winners

Sign up

Please log in or register:

Welcome to the Rocket Forum!

Please log in or register:

Scanning file for viruses.

This file cannot be downloaded