Skip to main content
Gidday all, 

If the first and only release of Git for z/OS you installed on your system is 2.26 or higher you can ignore this post and exit now.  But if you installed releases prior to this, you COULD have problems when moving to 2.26. 

Prior to 2.26 the z/OS version of Git encoded its database in ISO8859-1 (aka ASCII) but from 2.26 and above the database is now encoded in UTF-8.   This means repos created with the older releases have to be re-encoded.  Rocket has updated Git to do the re-encoding automatically but depending on what characters are used in your repo, the their automatic re-encoding can create false results as seen here. 

Below is the output of a "git diff" between the master branch with a selected commit e45b11. This was using Git for z/OS release 2.14.


Below is the output of the same command run after the master branch has been re-encoded as recommended by Rocket's Git for z/OS 2.26 migration guide. (Note that the output has been truncated, it's actually 805 lines)


The output should be the same, but because commit e45b11 has not been re-encoded the "git diff" shows the difference between the newly encoded master branch and the ISO8859-1 encoded commit e45b11 too.

I believe the only way to guarantee consistency with your repo is to re-encode every commit  !!!  But.........

When converting from IBM-1047 (EBCDIC) to ISO8859-1 (ASCII) and from IBM-1047 to UTF-8 there are 128 characters that convert to exactly the same characters. e.g. An EBCDIC "A" (X'C1') converts to ASCII "A" (X'41') and UTF-8 "A" (X'41'). If your repo consists only of these 128 characters then re-encoding will produce the same result. In other words, you don't have to re-encode your repo.

If however, your repo uses one or more of the other 128 characters, the mapping of IBM-1047 to ISO8859-1 and IBM-1047 to UTF-8 is very different. e.g. An EBCDIC X'04' converts to ASCII X'9C' but in UTF-8 in converts to X'C29C' (yes - 2 bytes) and this is why your repo needs to be re-encoded.

The best way to see if your repo is affected is to search for the characters that map differently in UTF-8. I have attached a file consisting 128 "srchfor" statements that can be used in the "Statements Dsn" parameter in ISPF 3.15 to search your repo for these characters. If any of these characters are found then I recommend a total re-encoding of your repository.

So how do you re-encode the entire repo, well based on my research and experimentation I'm recommending the use of the "git fast-export"/"git fast-import" commands.

The fast-export unloads the entire repo into a single file.
The fast-import reads the file recreating the entire repo encoded in UTF-8.

As long as the single file doesn't change the commit IDs will the same as the IDs in the original repo.

The steps required are:

1) CD to repo that needs re-encoding ("cd /u/user/currdir")
2) Do the git fast-export ("git fast-export --all >/tmp/fastexport-output")
3) Create new directory ("mkdir newdir")
4) CD to the directory ("cd newdir")
2) Initilize a git repo ("git init")
5) Perform fast-import ("git fast-import </tmp/fastexport-output")
6) Create the working directory ("git checkout master")
7) Reconnect any remote connections ("git remote add origin <REMOTE_URL>")

I did this on my largest repo processed by Git for z/OS which has been going since May 2017 and the unload file was 750MB so make sure you have plenty of room available for you temporary file that's created by fast-export.

I performed a "git log" on both repos after the import and compared the result. Got a 100% match.

I hope this is a help to anyone who has yet to do the conversion.





------------------------------
Gary Freestone
Systems Programmer
Kyndryl Inc
Mt Helen Australia
------------------------------
Gidday all, 

If the first and only release of Git for z/OS you installed on your system is 2.26 or higher you can ignore this post and exit now.  But if you installed releases prior to this, you COULD have problems when moving to 2.26. 

Prior to 2.26 the z/OS version of Git encoded its database in ISO8859-1 (aka ASCII) but from 2.26 and above the database is now encoded in UTF-8.   This means repos created with the older releases have to be re-encoded.  Rocket has updated Git to do the re-encoding automatically but depending on what characters are used in your repo, the their automatic re-encoding can create false results as seen here. 

Below is the output of a "git diff" between the master branch with a selected commit e45b11. This was using Git for z/OS release 2.14.


Below is the output of the same command run after the master branch has been re-encoded as recommended by Rocket's Git for z/OS 2.26 migration guide. (Note that the output has been truncated, it's actually 805 lines)


The output should be the same, but because commit e45b11 has not been re-encoded the "git diff" shows the difference between the newly encoded master branch and the ISO8859-1 encoded commit e45b11 too.

I believe the only way to guarantee consistency with your repo is to re-encode every commit  !!!  But.........

When converting from IBM-1047 (EBCDIC) to ISO8859-1 (ASCII) and from IBM-1047 to UTF-8 there are 128 characters that convert to exactly the same characters. e.g. An EBCDIC "A" (X'C1') converts to ASCII "A" (X'41') and UTF-8 "A" (X'41'). If your repo consists only of these 128 characters then re-encoding will produce the same result. In other words, you don't have to re-encode your repo.

If however, your repo uses one or more of the other 128 characters, the mapping of IBM-1047 to ISO8859-1 and IBM-1047 to UTF-8 is very different. e.g. An EBCDIC X'04' converts to ASCII X'9C' but in UTF-8 in converts to X'C29C' (yes - 2 bytes) and this is why your repo needs to be re-encoded.

The best way to see if your repo is affected is to search for the characters that map differently in UTF-8. I have attached a file consisting 128 "srchfor" statements that can be used in the "Statements Dsn" parameter in ISPF 3.15 to search your repo for these characters. If any of these characters are found then I recommend a total re-encoding of your repository.

So how do you re-encode the entire repo, well based on my research and experimentation I'm recommending the use of the "git fast-export"/"git fast-import" commands.

The fast-export unloads the entire repo into a single file.
The fast-import reads the file recreating the entire repo encoded in UTF-8.

As long as the single file doesn't change the commit IDs will the same as the IDs in the original repo.

The steps required are:

1) CD to repo that needs re-encoding ("cd /u/user/currdir")
2) Do the git fast-export ("git fast-export --all >/tmp/fastexport-output")
3) Create new directory ("mkdir newdir")
4) CD to the directory ("cd newdir")
2) Initilize a git repo ("git init")
5) Perform fast-import ("git fast-import </tmp/fastexport-output")
6) Create the working directory ("git checkout master")
7) Reconnect any remote connections ("git remote add origin <REMOTE_URL>")

I did this on my largest repo processed by Git for z/OS which has been going since May 2017 and the unload file was 750MB so make sure you have plenty of room available for you temporary file that's created by fast-export.

I performed a "git log" on both repos after the import and compared the result. Got a 100% match.

I hope this is a help to anyone who has yet to do the conversion.





------------------------------
Gary Freestone
Systems Programmer
Kyndryl Inc
Mt Helen Australia
------------------------------
This is essentially what we did. However, I am surprised that you didn't have issues with the UTF-8 2-byte characters and the length of your blobs and commit texts in the export file. The length will be longer when going from ISO8859-1 to UTF-8 for characters over x'7F'.

To change the lengths, we wrote a program that parses the export file, calculates the new lengths and imports it again. 

With this method above you could save some disk room by piping stdout of fast-export to fast-import.

Lastly, this will change the hashes of any files that contain characters over x'7F' and therefore all commit hashes for all the following commits.

------------------------------
Adam Martin Britt
IT-Architect
BEC
Randers SV Denmark
------------------------------
This is essentially what we did. However, I am surprised that you didn't have issues with the UTF-8 2-byte characters and the length of your blobs and commit texts in the export file. The length will be longer when going from ISO8859-1 to UTF-8 for characters over x'7F'.

To change the lengths, we wrote a program that parses the export file, calculates the new lengths and imports it again. 

With this method above you could save some disk room by piping stdout of fast-export to fast-import.

Lastly, this will change the hashes of any files that contain characters over x'7F' and therefore all commit hashes for all the following commits.

------------------------------
Adam Martin Britt
IT-Architect
BEC
Randers SV Denmark
------------------------------
Hi Adam, 

All the files in my working directory are encoded in IBM-1047 and when Git 2.14 moved them into it's cache they were encoded in ISO8859-1 .  When I ran the fast-export it read the ISO8859-1 encoded files and created an export file encoded in IBM-1047  (although it was not tagged that way - it wasn't tagged at all).   There were no extra UTF-8 bytes yet because so far I haven't done anything with UTF-8. 

The exported file was input to the fast-import without change.   After the import, Internally the files that needed to be encoded with 2 byte characters should be bigger.    The commit hashes of the newly imported repo are exactly the same, as I didn't change fast-export's output file.   

I certainly understand why the commit hashes are different if you did edit the fast-export output file.  You can't change the input to Git and expect the same hashes, as this would go against Git's integrity. 

I'm interested in knowing what Github did when you pushed the repo with totally different commit hashes  ?

------------------------------
Gary Freestone
Systems Programmer
Kyndryl Inc
Mt Helen Australia
------------------------------
Hi Adam, 

All the files in my working directory are encoded in IBM-1047 and when Git 2.14 moved them into it's cache they were encoded in ISO8859-1 .  When I ran the fast-export it read the ISO8859-1 encoded files and created an export file encoded in IBM-1047  (although it was not tagged that way - it wasn't tagged at all).   There were no extra UTF-8 bytes yet because so far I haven't done anything with UTF-8. 

The exported file was input to the fast-import without change.   After the import, Internally the files that needed to be encoded with 2 byte characters should be bigger.    The commit hashes of the newly imported repo are exactly the same, as I didn't change fast-export's output file.   

I certainly understand why the commit hashes are different if you did edit the fast-export output file.  You can't change the input to Git and expect the same hashes, as this would go against Git's integrity. 

I'm interested in knowing what Github did when you pushed the repo with totally different commit hashes  ?

------------------------------
Gary Freestone
Systems Programmer
Kyndryl Inc
Mt Helen Australia
------------------------------
Hi Gary,

That makes sense with the code pages and makes for an easy fix. It didn't seem to work that way for us, but it could be our own limitations. Our situation is a little worse because our default codepage in USS in IBM-1047, but everything else on the mainframe is IBM-277. SSH sessions, OMVS sessions or the hex values of a file are all display differenty, with the last one being the only thing I can trust (as long as the input to the hex viewer wasn't iconv'd before). USS is converting stuff behind the scenes, and it makes it difficult to understand what is going on. It was configured this way a couple of decades ago, so there is no easy way to change it without changing thousands of configuration files in USS. 

The commit hashes will necessarily change if the internal storage of blobs (source code) has changed, which it does here if you have ISO8859-1 characters over x'7F'. If all of the hashes are the same, then the export/import did nothing. You can verify the internal encoding in git by unzipping the blob object and looking at its hex values for a file with ISO8859-1 characters over x'7F'. Files with ISO8895-1 characters under x'7F' are indistinguishable to files with UTF-8 characters under x'7F'. 

We don't use github, but the problem you mention still stands. In our case, we did a force push and everyone who uses the repo needed to re-clone. In our situation, this was manageable without informing most of our users because of the tooling we've built around it, and the stage of our migration to git, but I realize that this is unique to us.

------------------------------
Adam Martin Britt
IT-Architect
BEC
Randers SV Denmark
------------------------------
Hi Gary,

That makes sense with the code pages and makes for an easy fix. It didn't seem to work that way for us, but it could be our own limitations. Our situation is a little worse because our default codepage in USS in IBM-1047, but everything else on the mainframe is IBM-277. SSH sessions, OMVS sessions or the hex values of a file are all display differenty, with the last one being the only thing I can trust (as long as the input to the hex viewer wasn't iconv'd before). USS is converting stuff behind the scenes, and it makes it difficult to understand what is going on. It was configured this way a couple of decades ago, so there is no easy way to change it without changing thousands of configuration files in USS. 

The commit hashes will necessarily change if the internal storage of blobs (source code) has changed, which it does here if you have ISO8859-1 characters over x'7F'. If all of the hashes are the same, then the export/import did nothing. You can verify the internal encoding in git by unzipping the blob object and looking at its hex values for a file with ISO8859-1 characters over x'7F'. Files with ISO8895-1 characters under x'7F' are indistinguishable to files with UTF-8 characters under x'7F'. 

We don't use github, but the problem you mention still stands. In our case, we did a force push and everyone who uses the repo needed to re-clone. In our situation, this was manageable without informing most of our users because of the tooling we've built around it, and the stage of our migration to git, but I realize that this is unique to us.

------------------------------
Adam Martin Britt
IT-Architect
BEC
Randers SV Denmark
------------------------------
Thanks for your update.

------------------------------
Gary Freestone
Systems Programmer
Kyndryl Inc
Mt Helen Australia
------------------------------