Skip to main content

Hi all

I'm trying to make sense of the encoding of strings coming and going between Basic and Python.

Our Universe database stores its data using OEM encoding, so to pass data to Python we do a series of transformations inside Basic to convert them to UTF-8.

Now I want to pass this OEM strings, as they are, directly into Python and do the conversion there, but I'm having some trouble.

ALL.CHAR.STR=''
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15)
* rest of characters...
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(241):CHAR(242):CHAR(243):CHAR(244):CHAR(245):CHAR(246):CHAR(247):CHAR(248):CHAR(249):CHAR(250):CHAR(251):CHAR(252):CHAR(253):CHAR(254):CHAR(255)

RESPUESTA=PyCallFunction('encoding_test','test_encoding', ALL.CHAR.STR)

On the Python side I'm doing this

def test_encoding(universe_str: str) -> str:
    for c in universe_str:
        print(f'{ord(c)}')
    return 'OK'

This prints the following values:

1
2
3
4
...
125
126
127
65533
65533
...
65533
65533

How can I get the characters in Python and convert them to UTF-8?



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi all

I'm trying to make sense of the encoding of strings coming and going between Basic and Python.

Our Universe database stores its data using OEM encoding, so to pass data to Python we do a series of transformations inside Basic to convert them to UTF-8.

Now I want to pass this OEM strings, as they are, directly into Python and do the conversion there, but I'm having some trouble.

ALL.CHAR.STR=''
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15)
* rest of characters...
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(241):CHAR(242):CHAR(243):CHAR(244):CHAR(245):CHAR(246):CHAR(247):CHAR(248):CHAR(249):CHAR(250):CHAR(251):CHAR(252):CHAR(253):CHAR(254):CHAR(255)

RESPUESTA=PyCallFunction('encoding_test','test_encoding', ALL.CHAR.STR)

On the Python side I'm doing this

def test_encoding(universe_str: str) -> str:
    for c in universe_str:
        print(f'{ord(c)}')
    return 'OK'

This prints the following values:

1
2
3
4
...
125
126
127
65533
65533
...
65533
65533

How can I get the characters in Python and convert them to UTF-8?



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Héctor,

Since you're going to Python it might be easier and more efficient to do the conversion there.  Strings in Python are already Unicode so there might already be a conversion going on in PyCallFunction. If it is, it's a simple matter of using the Python string encode method:

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string  = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string   = normal_unicode_python_string.encode('ascii')



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Héctor,

Since you're going to Python it might be easier and more efficient to do the conversion there.  Strings in Python are already Unicode so there might already be a conversion going on in PyCallFunction. If it is, it's a simple matter of using the Python string encode method:

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string  = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string   = normal_unicode_python_string.encode('ascii')



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

I should have mentioned that there's a second parameter you can pass to encode and decode to tell it to just ignore any errors and strip them out.

.encode(string, 'ignore') or .decode(string, 'ignore')

If you want to know about the errors you can use the normal Python try and except error trapping.



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Héctor,

Since you're going to Python it might be easier and more efficient to do the conversion there.  Strings in Python are already Unicode so there might already be a conversion going on in PyCallFunction. If it is, it's a simple matter of using the Python string encode method:

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string  = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string   = normal_unicode_python_string.encode('ascii')



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Joe

My problem is that the data that I'm getting in Python is not what I expect.

If I try to encode the string as UTF-8 this is what I get:

b'\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f
 !"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\\x7f\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd'

As soon as I run off of the ASCII space I only get \\xef\\xbf\\xbd



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Joe

My problem is that the data that I'm getting in Python is not what I expect.

If I try to encode the string as UTF-8 this is what I get:

b'\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f
 !"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\\x7f\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd'

As soon as I run off of the ASCII space I only get \\xef\\xbf\\xbd



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Hector,

That confused me for a bit. Then I figured out what you are trying to do. It looks like you've passed a string with all the ascii characters from 1 to 255. Python tried to convert them to utf-8 but characters 128 to 255 don't have utf-8 defined equivalents. The Python converter has two options to deal with those characters, ignore or replace. If you pick ignore they're just stripped out. If you pick replace, it replaces each character with the unicode error subsitute character which happens to be "\\xef\\xfb\\xbd". That's what you're seeing in your string. All the high number characters converted to that sequence

You can duplicate this with a short bit of Python:

ascii_string = bytes(range(256))
utf8_encoded_string = ascii_string.decode('utf-8', 'replace').encode('utf-8')
print (utf8_encoded_string)
You'll get y our string above. If you change "replace" with "ignore" those characters will be stripped out.
Since you mentioned you're dealing with "OEM encoding" I'm to assume you mean you're using the original IBM OEM character set?  If that's what you're doing you'd have to translate your original test string character by character with the symbol that matches the original OEM symbol. That means if you want to translate character 130, "é", you'd have to convert it to unicode character \\x00E9. 
It would be a pain to match all of those up since they're scattered around. Lucky for us we've got AI helpers that can do the grunt work.  Here's a sample program that will take a byte string and translate it into the original IBM OEM symbols in UTF-8:
(I'm not sure how this would look if it's emailed. The forum software scrambles it pretty good on my mail client. If you have any issues, try accessing it directly from Rocket's forum website.)

# Define a mapping from IBM PC character set to Unicode
ibm_pc_to_unicode = {
0: '\\x00', # NUL - U+0000
1: '\\x01', # SOH - U+0001
2: '\\x02', # STX - U+0002
3: '\\x03', # ETX - U+0003
4: '\\x04', # EOT - U+0004
5: '\\x05', # ENQ - U+0005
6: '\\x06', # ACK - U+0006
7: '\\x07', # BEL - U+0007
8: '\\x08', # BS - U+0008
9: '\\t', # HT - U+0009
10: '\\n', # LF - U+000A
11: '\\x0B', # VT - U+000B
12: '\\x0C', # FF - U+000C
13: '\\r', # CR - U+000D
14: '\\x0E', # SO - U+000E
15: '\\x0F', # SI - U+000F
16: '\\x10', # DLE - U+0010
17: '\\x11', # DC1 - U+0011
18: '\\x12', # DC2 - U+0012
19: '\\x13', # DC3 - U+0013
20: '\\x14', # DC4 - U+0014
21: '\\x15', # NAK - U+0015
22: '\\x16', # SYN - U+0016
23: '\\x17', # ETB - U+0017
24: '\\x18', # CAN - U+0018
25: '\\x19', # EM - U+0019
26: '\\x1A', # SUB - U+001A
27: '\\x1B', # ESC - U+001B
28: '\\x1C', # FS - U+001C
29: '\\x1D', # GS - U+001D
30: '\\x1E', # RS - U+001E
31: '\\x1F', # US - U+001F
32: ' ', # Space - U+0020
33: '!', # ! - U+0021
34: '"', # " - U+0022
35: '#', # # - U+0023
36: '$', # $ - U+0024
37: '%', # % - U+0025
38: '&', # & - U+0026
39: "'", # ' - U+0027
40: '(', # ( - U+0028
41: ')', # ) - U+0029
42: '*', # * - U+002A
43: '+', # + - U+002B
44: ',', # , - U+002C
45: '-', # - - U+002D
46: '.', # . - U+002E
47: '/', # / - U+002F
48: '0', # 0 - U+0030
49: '1', # 1 - U+0031
50: '2', # 2 - U+0032
51: '3', # 3 - U+0033
52: '4', # 4 - U+0034
53: '5', # 5 - U+0035
54: '6', # 6 - U+0036
55: '7', # 7 - U+0037
56: '8', # 8 - U+0038
57: '9', # 9 - U+0039
58: ':', # : - U+003A
59: ';', # ; - U+003B
60: '<', # < - U+003C
61: '=', # = - U+003D
62: '>', # > - U+003E
63: '?', # ? - U+003F
64: '@', # @ - U+0040
65: 'A', # A - U+0041
66: 'B', # B - U+0042
67: 'C', # C - U+0043
68: 'D', # D - U+0044
69: 'E', # E - U+0045
70: 'F', # F - U+0046
71: 'G', # G - U+0047
72: 'H', # H - U+0048
73: 'I', # I - U+0049
74: 'J', # J - U+004A
75: 'K', # K - U+004B
76: 'L', # L - U+004C
77: 'M', # M - U+004D
78: 'N', # N - U+004E
79: 'O', # O - U+004F
80: 'P', # P - U+0050
81: 'Q', # Q - U+0051
82: 'R', # R - U+0052
83: 'S', # S - U+0053
84: 'T', # T - U+0054
85: 'U', # U - U+0055
86: 'V', # V - U+0056
87: 'W', # W - U+0057
88: 'X', # X - U+0058
89: 'Y', # Y - U+0059
90: 'Z', # Z - U+005A
91: '[', # [ - U+005B
92: '\\\\', # \\ - U+005C
93: ']', # ] - U+005D
94: '^', # ^ - U+005E
95: '_', # _ - U+005F
96: '`', # ` - U+0060
97: 'a', # a - U+0061
98: 'b', # b - U+0062
99: 'c', # c - U+0063
100: 'd', # d - U+0064
101: 'e', # e - U+0065
102: 'f', # f - U+0066
103: 'g', # g - U+0067
104: 'h', # h - U+0068
105: 'i', # i - U+0069
106: 'j', # j - U+006A
107: 'k', # k - U+006B
108: 'l', # l - U+006C
109: 'm', # m - U+006D
110: 'n', # n - U+006E
111: 'o', # o - U+006F
112: 'p', # p - U+0070
113: 'q', # q - U+0071
114: 'r', # r - U+0072
115: 's', # s - U+0073
116: 't', # t - U+0074
117: 'u', # u - U+0075
118: 'v', # v - U+0076
119: 'w', # w - U+0077
120: 'x', # x - U+0078
121: 'y', # y - U+0079
122: 'z', # z - U+007A
123: '{', # { - U+007B
124: '|', # | - U+007C
125: '}', # } - U+007D
126: '~', # ~ - U+007E
127: '\\x7F', # DEL - U+007F (Delete)
128: 'Ç', # Ç - U+00C7
129: 'ü', # ü - U+00FC
130: 'é', # é - U+00E9
128: 'Ç', # Ç - U+00C7
129: 'ü', # ü - U+00FC
130: 'é', # é - U+00E9
131: 'â', # â - U+00E2
132: 'ä', # ä - U+00E4
133: 'à', # à - U+00E0
134: 'å', # å - U+00E5
135: 'ç', # ç - U+00E7
136: 'ê', # ê - U+00EA
137: 'ë', # ë - U+00EB
138: 'è', # è - U+00E8
139: 'ï', # ï - U+00EF
140: 'î', # î - U+00EE
141: 'ì', # ì - U+00EC
142: 'Ä', # Ä - U+00C4
143: 'Å', # Å - U+00C5
144: 'É', # É - U+00C9
145: 'æ', # æ - U+00E6
146: 'Æ', # Æ - U+00C6
147: 'ô', # ô - U+00F4
148: 'ö', # ö - U+00F6
149: 'ò', # ò - U+00F2
150: 'û', # û - U+00FB
151: 'ù', # ù - U+00F9
152: 'ÿ', # ÿ - U+00FF
153: 'Ö', # Ö - U+00D6
154: 'Ü', # Ü - U+00DC
155: '¢', # ¢ - U+00A2
156: '£', # £ - U+00A3
157: '¥', # ¥ - U+00A5
158: '₧', # ₧ - U+20A7
159: 'ƒ', # ƒ - U+0192
160: 'á', # á - U+00E1
161: 'í', # í - U+00ED
162: 'ó', # ó - U+00F3
163: 'ú', # ú - U+00FA
164: 'ñ', # ñ - U+00F1
165: 'Ñ', # Ñ - U+00D1
166: 'ª', # ª - U+00AA
167: 'º', # º - U+00BA
168: '¿', # ¿ - U+00BF
169: '⌐', # ⌐ - U+2310
170: '¬', # ¬ - U+00AC
171: '½', # ½ - U+00BD
172: '¼', # ¼ - U+00BC
173: '¡', # ¡ - U+00A1
174: '«', # « - U+00AB
175: '»', # » - U+00BB
176: '░', # ░ - U+2591
177: '▒', # ▒ - U+2592
178: '▓', # ▓ - U+2593
179: '│', # │ - U+2502
180: '┤', # ┤ - U+2524
181: '╡', # ╡ - U+2561
182: '╢', # ╢ - U+2562
183: '╖', # ╖ - U+2556
184: '╕', # ╕ - U+2555
185: '╣', # ╣ - U+2563
186: '║', # ║ - U+2551
187: '╗', # ╗ - U+2557
188: '╝', # ╝ - U+255D
189: '╜', # ╜ - U+255C
190: '╛', # ╛ - U+255B
191: '┐', # ┐ - U+2510
192: '└', # └ - U+2514
193: '┴', # ┴ - U+2534
194: '┬', # ┬ - U+252C
195: '├', # ├ - U+251C
196: '─', # ─ - U+2500
197: '┼', # ┼ - U+253C
198: '╞', # ╞ - U+255E
199: '╟', # ╟ - U+255F
200: '╚', # ╚ - U+255A
201: '╔', # ╔ - U+2554
202: '╩', # ╩ - U+2569
203: '╦', # ╦ - U+2566
204: '╠', # ╠ - U+2560
205: '═', # ═ - U+2550
206: '╬', # ╬ - U+256C
207: '╧', # ╧ - U+2567
208: '╨', # ╨ - U+2568
209: '╤', # ╤ - U+2564
210: '╥', # ╥ - U+2565
211: '╙', # ╙ - U+2559
212: '╘', # ╘ - U+2558
}

unicode_string = ''.join(ibm_pc_to_unicode.get(char, chr(char)) for char in ascii_string)

# Print the resulting Unicode string
print(unicode_string)


------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Hector,

That confused me for a bit. Then I figured out what you are trying to do. It looks like you've passed a string with all the ascii characters from 1 to 255. Python tried to convert them to utf-8 but characters 128 to 255 don't have utf-8 defined equivalents. The Python converter has two options to deal with those characters, ignore or replace. If you pick ignore they're just stripped out. If you pick replace, it replaces each character with the unicode error subsitute character which happens to be "\\xef\\xfb\\xbd". That's what you're seeing in your string. All the high number characters converted to that sequence

You can duplicate this with a short bit of Python:

ascii_string = bytes(range(256))
utf8_encoded_string = ascii_string.decode('utf-8', 'replace').encode('utf-8')
print (utf8_encoded_string)
You'll get y our string above. If you change "replace" with "ignore" those characters will be stripped out.
Since you mentioned you're dealing with "OEM encoding" I'm to assume you mean you're using the original IBM OEM character set?  If that's what you're doing you'd have to translate your original test string character by character with the symbol that matches the original OEM symbol. That means if you want to translate character 130, "é", you'd have to convert it to unicode character \\x00E9. 
It would be a pain to match all of those up since they're scattered around. Lucky for us we've got AI helpers that can do the grunt work.  Here's a sample program that will take a byte string and translate it into the original IBM OEM symbols in UTF-8:
(I'm not sure how this would look if it's emailed. The forum software scrambles it pretty good on my mail client. If you have any issues, try accessing it directly from Rocket's forum website.)

# Define a mapping from IBM PC character set to Unicode
ibm_pc_to_unicode = {
0: '\\x00', # NUL - U+0000
1: '\\x01', # SOH - U+0001
2: '\\x02', # STX - U+0002
3: '\\x03', # ETX - U+0003
4: '\\x04', # EOT - U+0004
5: '\\x05', # ENQ - U+0005
6: '\\x06', # ACK - U+0006
7: '\\x07', # BEL - U+0007
8: '\\x08', # BS - U+0008
9: '\\t', # HT - U+0009
10: '\\n', # LF - U+000A
11: '\\x0B', # VT - U+000B
12: '\\x0C', # FF - U+000C
13: '\\r', # CR - U+000D
14: '\\x0E', # SO - U+000E
15: '\\x0F', # SI - U+000F
16: '\\x10', # DLE - U+0010
17: '\\x11', # DC1 - U+0011
18: '\\x12', # DC2 - U+0012
19: '\\x13', # DC3 - U+0013
20: '\\x14', # DC4 - U+0014
21: '\\x15', # NAK - U+0015
22: '\\x16', # SYN - U+0016
23: '\\x17', # ETB - U+0017
24: '\\x18', # CAN - U+0018
25: '\\x19', # EM - U+0019
26: '\\x1A', # SUB - U+001A
27: '\\x1B', # ESC - U+001B
28: '\\x1C', # FS - U+001C
29: '\\x1D', # GS - U+001D
30: '\\x1E', # RS - U+001E
31: '\\x1F', # US - U+001F
32: ' ', # Space - U+0020
33: '!', # ! - U+0021
34: '"', # " - U+0022
35: '#', # # - U+0023
36: '$', # $ - U+0024
37: '%', # % - U+0025
38: '&', # & - U+0026
39: "'", # ' - U+0027
40: '(', # ( - U+0028
41: ')', # ) - U+0029
42: '*', # * - U+002A
43: '+', # + - U+002B
44: ',', # , - U+002C
45: '-', # - - U+002D
46: '.', # . - U+002E
47: '/', # / - U+002F
48: '0', # 0 - U+0030
49: '1', # 1 - U+0031
50: '2', # 2 - U+0032
51: '3', # 3 - U+0033
52: '4', # 4 - U+0034
53: '5', # 5 - U+0035
54: '6', # 6 - U+0036
55: '7', # 7 - U+0037
56: '8', # 8 - U+0038
57: '9', # 9 - U+0039
58: ':', # : - U+003A
59: ';', # ; - U+003B
60: '<', # < - U+003C
61: '=', # = - U+003D
62: '>', # > - U+003E
63: '?', # ? - U+003F
64: '@', # @ - U+0040
65: 'A', # A - U+0041
66: 'B', # B - U+0042
67: 'C', # C - U+0043
68: 'D', # D - U+0044
69: 'E', # E - U+0045
70: 'F', # F - U+0046
71: 'G', # G - U+0047
72: 'H', # H - U+0048
73: 'I', # I - U+0049
74: 'J', # J - U+004A
75: 'K', # K - U+004B
76: 'L', # L - U+004C
77: 'M', # M - U+004D
78: 'N', # N - U+004E
79: 'O', # O - U+004F
80: 'P', # P - U+0050
81: 'Q', # Q - U+0051
82: 'R', # R - U+0052
83: 'S', # S - U+0053
84: 'T', # T - U+0054
85: 'U', # U - U+0055
86: 'V', # V - U+0056
87: 'W', # W - U+0057
88: 'X', # X - U+0058
89: 'Y', # Y - U+0059
90: 'Z', # Z - U+005A
91: '[', # [ - U+005B
92: '\\\\', # \\ - U+005C
93: ']', # ] - U+005D
94: '^', # ^ - U+005E
95: '_', # _ - U+005F
96: '`', # ` - U+0060
97: 'a', # a - U+0061
98: 'b', # b - U+0062
99: 'c', # c - U+0063
100: 'd', # d - U+0064
101: 'e', # e - U+0065
102: 'f', # f - U+0066
103: 'g', # g - U+0067
104: 'h', # h - U+0068
105: 'i', # i - U+0069
106: 'j', # j - U+006A
107: 'k', # k - U+006B
108: 'l', # l - U+006C
109: 'm', # m - U+006D
110: 'n', # n - U+006E
111: 'o', # o - U+006F
112: 'p', # p - U+0070
113: 'q', # q - U+0071
114: 'r', # r - U+0072
115: 's', # s - U+0073
116: 't', # t - U+0074
117: 'u', # u - U+0075
118: 'v', # v - U+0076
119: 'w', # w - U+0077
120: 'x', # x - U+0078
121: 'y', # y - U+0079
122: 'z', # z - U+007A
123: '{', # { - U+007B
124: '|', # | - U+007C
125: '}', # } - U+007D
126: '~', # ~ - U+007E
127: '\\x7F', # DEL - U+007F (Delete)
128: 'Ç', # Ç - U+00C7
129: 'ü', # ü - U+00FC
130: 'é', # é - U+00E9
128: 'Ç', # Ç - U+00C7
129: 'ü', # ü - U+00FC
130: 'é', # é - U+00E9
131: 'â', # â - U+00E2
132: 'ä', # ä - U+00E4
133: 'à', # à - U+00E0
134: 'å', # å - U+00E5
135: 'ç', # ç - U+00E7
136: 'ê', # ê - U+00EA
137: 'ë', # ë - U+00EB
138: 'è', # è - U+00E8
139: 'ï', # ï - U+00EF
140: 'î', # î - U+00EE
141: 'ì', # ì - U+00EC
142: 'Ä', # Ä - U+00C4
143: 'Å', # Å - U+00C5
144: 'É', # É - U+00C9
145: 'æ', # æ - U+00E6
146: 'Æ', # Æ - U+00C6
147: 'ô', # ô - U+00F4
148: 'ö', # ö - U+00F6
149: 'ò', # ò - U+00F2
150: 'û', # û - U+00FB
151: 'ù', # ù - U+00F9
152: 'ÿ', # ÿ - U+00FF
153: 'Ö', # Ö - U+00D6
154: 'Ü', # Ü - U+00DC
155: '¢', # ¢ - U+00A2
156: '£', # £ - U+00A3
157: '¥', # ¥ - U+00A5
158: '₧', # ₧ - U+20A7
159: 'ƒ', # ƒ - U+0192
160: 'á', # á - U+00E1
161: 'í', # í - U+00ED
162: 'ó', # ó - U+00F3
163: 'ú', # ú - U+00FA
164: 'ñ', # ñ - U+00F1
165: 'Ñ', # Ñ - U+00D1
166: 'ª', # ª - U+00AA
167: 'º', # º - U+00BA
168: '¿', # ¿ - U+00BF
169: '⌐', # ⌐ - U+2310
170: '¬', # ¬ - U+00AC
171: '½', # ½ - U+00BD
172: '¼', # ¼ - U+00BC
173: '¡', # ¡ - U+00A1
174: '«', # « - U+00AB
175: '»', # » - U+00BB
176: '░', # ░ - U+2591
177: '▒', # ▒ - U+2592
178: '▓', # ▓ - U+2593
179: '│', # │ - U+2502
180: '┤', # ┤ - U+2524
181: '╡', # ╡ - U+2561
182: '╢', # ╢ - U+2562
183: '╖', # ╖ - U+2556
184: '╕', # ╕ - U+2555
185: '╣', # ╣ - U+2563
186: '║', # ║ - U+2551
187: '╗', # ╗ - U+2557
188: '╝', # ╝ - U+255D
189: '╜', # ╜ - U+255C
190: '╛', # ╛ - U+255B
191: '┐', # ┐ - U+2510
192: '└', # └ - U+2514
193: '┴', # ┴ - U+2534
194: '┬', # ┬ - U+252C
195: '├', # ├ - U+251C
196: '─', # ─ - U+2500
197: '┼', # ┼ - U+253C
198: '╞', # ╞ - U+255E
199: '╟', # ╟ - U+255F
200: '╚', # ╚ - U+255A
201: '╔', # ╔ - U+2554
202: '╩', # ╩ - U+2569
203: '╦', # ╦ - U+2566
204: '╠', # ╠ - U+2560
205: '═', # ═ - U+2550
206: '╬', # ╬ - U+256C
207: '╧', # ╧ - U+2567
208: '╨', # ╨ - U+2568
209: '╤', # ╤ - U+2564
210: '╥', # ╥ - U+2565
211: '╙', # ╙ - U+2559
212: '╘', # ╘ - U+2558
}

unicode_string = ''.join(ibm_pc_to_unicode.get(char, chr(char)) for char in ascii_string)

# Print the resulting Unicode string
print(unicode_string)


------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Joe

My trouble is not on the Python side, but on BASIC.

I have data in OEM encoding on BASIC that I need to pass to Python, and the PyCall function is converting the OEM data to UTF-8 incorrectly. I'd like to know how to pass this characters to Python without losing any information.



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Joe

My trouble is not on the Python side, but on BASIC.

I have data in OEM encoding on BASIC that I need to pass to Python, and the PyCall function is converting the OEM data to UTF-8 incorrectly. I'd like to know how to pass this characters to Python without losing any information.



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hector,

Have you tried the PyByteCallFunction, this will send the data as a byte string and not a string?



------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Hector,

Have you tried the PyByteCallFunction, this will send the data as a byte string and not a string?



------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Hi Mike,

I don't see PyByteCallFunction documented anywhere. I downloaded the latest Universe Python pdf but there's no mention of it. Do you have newer documentation?

Hi Hector,

From what I see, the Universe python interface is trying to automatically convert the ASCII strings from Basic to Unicode. The encoding method it's using must not understand characters 128+ so it's converting them to the Unicode "unknown character" sequence mentioned above.

The documentation mentions a "config.encoding" variable that defines the type of encoding used. Whatever that is, it's not a good match. The documentation says you can change it but i don't see any instructions as to how and I can't test it. It might be as simple as putting "config.encoding=xxxx" somewhere in your code. The u2py.py routine use "config.encoding" in a number of places.

If you import this in your code " from _u2py import *" it looks like you'll get the config object and can look at it. (When I try it I get a "U2 Python Package is not licensed.")

It looks like the encoding you want to use is "cp437".  On my system the encoding files are all stored in /usr/uv/python/lib/python3.4/encodings.

Maybe you can play with those  and figure it out.

Joe G.



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Mike,

I don't see PyByteCallFunction documented anywhere. I downloaded the latest Universe Python pdf but there's no mention of it. Do you have newer documentation?

Hi Hector,

From what I see, the Universe python interface is trying to automatically convert the ASCII strings from Basic to Unicode. The encoding method it's using must not understand characters 128+ so it's converting them to the Unicode "unknown character" sequence mentioned above.

The documentation mentions a "config.encoding" variable that defines the type of encoding used. Whatever that is, it's not a good match. The documentation says you can change it but i don't see any instructions as to how and I can't test it. It might be as simple as putting "config.encoding=xxxx" somewhere in your code. The u2py.py routine use "config.encoding" in a number of places.

If you import this in your code " from _u2py import *" it looks like you'll get the config object and can look at it. (When I try it I get a "U2 Python Package is not licensed.")

It looks like the encoding you want to use is "cp437".  On my system the encoding files are all stored in /usr/uv/python/lib/python3.4/encodings.

Maybe you can play with those  and figure it out.

Joe G.



------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Sorry about that, the PyByteCallFunction is in UniData.  UniVerse is a bit different.   Another thing to check is if you are using NLS and what codepage you are using if it is on.



------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Sorry about that, the PyByteCallFunction is in UniData.  UniVerse is a bit different.   Another thing to check is if you are using NLS and what codepage you are using if it is on.



------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Hi Mike

We are not using NLS.

The encoding of the Universe data is cp1252.

My questions are:
- Is there a way to pass data in "raw" cp1252 format to Python?

- Is there a way to convert cp1252 data to UTF-8 in BASIC?

Now we are parsing the cp1252 data and converting it to UTF-8 manually, but it's a time consuming process, as we have to convert back and forth in every Python call.



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Sorry about that, the PyByteCallFunction is in UniData.  UniVerse is a bit different.   Another thing to check is if you are using NLS and what codepage you are using if it is on.



------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Joe:

I tried your approach but when I try to access config.encoding the python environment crashes



------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------