Open-source Languages & Tools for z/OS

 View Only
  • 1.  Problem with "git show" not returning the correct output

    PARTNER
    Posted 04-07-2022 03:17
    A command like this "git show a325acf:TEXT/TEST.txt  >/tmp/test.txt"  should extract from commit a325acf the file TEXT/TEST.txt and save it to /tmp/test.txt. My TEXT/TEST.txt file happens to contain all hex characters from X'00' to X'FF'  (excluding X'15') .  If I perform a SuperC comparison between my file and the file extracted file (/tmp/test.txt)  from the git show command I'd expect the files to be the same.  But that is not what I am getting (see extract below)
    Some of the the characters, on the white lines, are now two bytes and not one. 

    Now I understand that converting IBM-1047 to UTF-8 will translate some characters from a single byte to a two byte representation,  but I would expect that when translating from UTF-8 encoding to IBM-1047 the two bytes would be translated back to one. 

    This is certainly what is happening when to perform a "git checkout" or "git restore"  but not with the "git show".   



    ------------------------------
    Gary Freestone
    Systems Programmer
    Kyndryl Inc
    Mt Helen VIC AU
    ------------------------------


  • 2.  RE: Problem with "git show" not returning the correct output

    ROCKETEER
    Posted 04-08-2022 12:12
    Edited by Harrison Kaiser 04-08-2022 12:15

    Hi Gary,

    I posted a lengthy reply here previously but it seems to have not made it through.

    Here it is in a nutshell:
    If you are using the zos-working-tree-encoding attribute then the file is EBCDIC in the working-tree but UTF-8 in the index. This enables you to use tools like Jira and Bitbucket and Github which all expect ASCII or a superset like ISO8859-1 or UTF-8, and it enables you to see those files on other non-ebcdic platforms. git-show is ignoring that zos-working-tree encoding. This isn't a bug because zos-working-tree-encoding is almost nearly an alias for a git feature working-tree-encoding which is a community git feature with the same behavior. git show also ignores that encoding and for a good reason: it is designed to show the so-called "blobs" in the index, which don't have an encoding tag associated with them but are in your case UTF-8. The working-tree-encoding feature kicks in only when checking out a file into the working tree.

    You have a few options if you'd like to have git show work as you expect: you can not use the zos-working-tree-encoding feature at all. Git won't tag files for you at all, they will remain untagged and will need to be untagged when you add them or commit them. Otherwise, git will tell you that it can't commit tagged files without having the zos-working-tree-attribute set, and for good reason: it won't be able to restore those tags without it. That would make the index be EBCDIC encoded which would then have the knock-on effect that tools like JIRA and Bitbucket won't be able to properly view those files, but git show would then work. I do not recommend this at all. 

    Your second option is to move to a UTF-8 encoding for your files. This may not be possible, but it could vastly simplify code management across multiple platforms and languages. UTF-8 is becoming a nearly universal standard. For source code management it would have the downside of there being a moment in time when you converted, it would be important to ensure that you don't lose any information due to a bad conversion. This might not be possible for you, I don't know your constraints. A result of this would be that git show works as you expected: the blob in the index would match the working tree and they would both be UTF-8.

    Lastly, you could use the working-tree-encoding, flawed as it is, as is.

    Let us know how you decide to use it.



    ------------------------------
    Harrison Kaiser
    Software Engineer I
    Rocket Internal - All Brands
    ------------------------------



  • 3.  RE: Problem with "git show" not returning the correct output

    PARTNER
    Posted 04-11-2022 11:34
    Harrison, 

    Thanks for the information. I understand what you are saying.  I do disagree with the "UTF-8 is becoming a nearly universal standard" well at least on the mainframe, elsewhere yes, but not on z/OS. Here there must be billions if not trillions of lines of code all living in the world of EBCDIC with RECFM=FB and LRECL=80.

    Also, I don't believe git show is giving me UTF-8.   As explained in my testing below:

    I have a file consisting of 7 characters of EBCDIC  C'ABC.123' or X'C1C2C304F1F2F3'  (the . is an undisplayable character, namely X'04').    If I push this file to github and bring it back down to my PC the UTF-8 encoding of the file is  X'414243C29C313233'  or ASCII characters C'ABC..123'   the one byte of EBCDIC X'04' has been encoding in UTF-8 as X'C29C' .    

    If the z/OS version of "git show" is suppose to show a UTF-8 encoded file then I would expect these characters  X'414243C29C313233'   but that is not what I am seeing.  The "git show" is giving X'C1C2C36204F1F2F3'  which is some EBCDIC bastardized version of UTF-8.  We can see the first and last 3 characters have been converted to EBCDIC but the X'04'  which is encoded as XC29C' in UTF-8 on my PC is now X'6204'.   To me this is not UTF-8 encoding, and its not EBCDIC,  it's a Frankenstein file that I can't do anything with.  

    As an experiment I created a UTF-8 encoded file on z/OS consisting of X'414243C29C313233'  and performed an "iconv -f utf-8 -t IBM-1047" and I got the original 7 character file of X'C1C2C304F1F2F3' .   If I run "iconv -f utf-8 -t IBM-1047" on the franken file I end up with 3 characters being X'C24237'.

    I guess what I am saying is, if "git show" is going to supply UTF-8 file then is should be a real UTF-8 file.  If its going to supply a "EBCDICized" version of the objects then it should do the a full conversion and not leave it in an unusable state. 




    ------------------------------
    Gary Freestone
    Systems Programmer
    Kyndryl Inc
    Mt Helen VIC AU
    ------------------------------



  • 4.  RE: Problem with "git show" not returning the correct output

    ROCKETEER
    Posted 04-11-2022 13:10
    Edited by Harrison Kaiser 04-11-2022 13:19
    Hi Gary,

    First off you are correct about UTF-8 only becoming a "universal standard" off the mainframe. I should have been more explicit about that. I definitely have strong personal opinions about what encoding source code should be in after 2+ years of dealing with issues exactly like this one, but at the same time, I recognize that moving from EBCDIC to UTF-8 for 800 million lines of COBOL code is not the easiest thing in the world.

    Here is what appears to be happening: git show's output is going through an autoconversion from ISO-8859-1 to IBM-1047, even though the output is UTF-8.

    Let's assume that C'414243C29C313233' is the actual binary stored in the index. Which it appears to be since that's what shows up on your PC. If we improperly assume binary to be ISO-8859-1 can convert it to IBM-1047 then we get the same Frankenstein output.

    C2 ISO-8859-1 -> 62 EBCDIC
    9C ISO-8859-1 -> 04 EBCDIC

    This might occur because if autoconversion is on and making the assumption that the output is ISO-8859-1. You may be able to force a UTF-8 output from git show by turning off autoconversion:

    `_BPXK_AUTOCVT=OFF git show...`

    Without that autoconversion, the content displayed on the USS terminal will be UTF-8 and thus not legible since the tty expects EBCDIC. If you redirect that output to a UTF-8 file or through iconv -f UTF-8 -t ISO-8859-1. I would expect that you get back your EBCDIC file. The community edition of git has no conception of the possibility of autoconversion on its outputs. Since I didn't work on our port of git personally I don't know what, if anything we added to handle autoconv in this case. If you are able to control autoconv in the manner described above then it's likely we didn't add anything. If you can't control it then this is almost certainly a bug we introduced. In either case, I'll be making a ticket for us internally so we can resolve this issue.

    Thanks for writing back to clarify my understanding of git show.

    If you are unable to force the output of git to be UTF-8 then let us know, as that would certainly be a bug.

    ------------------------------
    Harrison Kaiser
    Software Engineer I
    Rocket Internal - All Brands
    ------------------------------



  • 5.  RE: Problem with "git show" not returning the correct output

    PARTNER
    Posted 04-12-2022 20:26
    Edited by Gary Freestone 04-12-2022 20:26
    Hi Harrison, 

    No such luck forcing the output to UTF-8  as seen in the attached output.   The output is the same with and without _BPXK_AUTOCVT=OFF  


    Regards,

    ------------------------------
    Gary Freestone
    Systems Programmer
    Kyndryl Inc
    Mt Helen VIC AU
    ------------------------------



  • 6.  RE: Problem with "git show" not returning the correct output

    ROCKETEER
    Posted 04-13-2022 15:34
    Thanks for the feedback. This is definitely a bug that we will be looking into.

    ------------------------------
    Harrison Kaiser
    Software Engineer I
    Rocket Internal - All Brands
    ------------------------------



  • 7.  RE: Problem with "git show" not returning the correct output

    PARTNER
    Posted 06-07-2022 02:15
    Hi @Harrison Kaiser,

    Any update on this issue ?​​

    ------------------------------
    Gary Freestone
    Systems Programmer
    Kyndryl Inc
    Mt Helen VIC AU
    ------------------------------