Encoding string from Universe to Python

Forum|Forum|1 year ago
January 12, 2024
10 replies
4 views

Héctor Cortiguera
Participating Frequently

Hi all

I'm trying to make sense of the encoding of strings coming and going between Basic and Python.

Our Universe database stores its data using OEM encoding, so to pass data to Python we do a series of transformations inside Basic to convert them to UTF-8.

Now I want to pass this OEM strings, as they are, directly into Python and do the conversion there, but I'm having some trouble.

ALL.CHAR.STR=''
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15)
* rest of characters...
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(241):CHAR(242):CHAR(243):CHAR(244):CHAR(245):CHAR(246):CHAR(247):CHAR(248):CHAR(249):CHAR(250):CHAR(251):CHAR(252):CHAR(253):CHAR(254):CHAR(255)

RESPUESTA=PyCallFunction('encoding_test','test_encoding', ALL.CHAR.STR)

On the Python side I'm doing this

def test_encoding(universe_str: str) -> str:
    for c in universe_str:
        print(f'{ord(c)}')
    return 'OK'

This prints the following values:

How can I get the characters in Python and convert them to UTF-8?

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Joe Goldthwaite
Participating Frequently
Forum|Forum|1 year ago
January 17, 2024

Hi all

I'm trying to make sense of the encoding of strings coming and going between Basic and Python.

Our Universe database stores its data using OEM encoding, so to pass data to Python we do a series of transformations inside Basic to convert them to UTF-8.

Now I want to pass this OEM strings, as they are, directly into Python and do the conversion there, but I'm having some trouble.

ALL.CHAR.STR=''
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(1):CHAR(2):CHAR(3):CHAR(4):CHAR(5):CHAR(6):CHAR(7):CHAR(8):CHAR(9):CHAR(10):CHAR(11):CHAR(12):CHAR(13):CHAR(14):CHAR(15)
* rest of characters...
ALL.CHAR.STR=ALL.CHAR.STR:CHAR(241):CHAR(242):CHAR(243):CHAR(244):CHAR(245):CHAR(246):CHAR(247):CHAR(248):CHAR(249):CHAR(250):CHAR(251):CHAR(252):CHAR(253):CHAR(254):CHAR(255)

RESPUESTA=PyCallFunction('encoding_test','test_encoding', ALL.CHAR.STR)

On the Python side I'm doing this

def test_encoding(universe_str: str) -> str:
    for c in universe_str:
        print(f'{ord(c)}')
    return 'OK'

This prints the following values:

How can I get the characters in Python and convert them to UTF-8?

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Héctor,

Since you're going to Python it might be easier and more efficient to do the conversion there. Strings in Python are already Unicode so there might already be a conversion going on in PyCallFunction. If it is, it's a simple matter of using the Python string encode method:

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string = normal_unicode_python_string.encode('ascii')

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Joe Goldthwaite
Participating Frequently
Forum|Forum|1 year ago
January 17, 2024

Hi Héctor,

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string = normal_unicode_python_string.encode('ascii')

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

I should have mentioned that there's a second parameter you can pass to encode and decode to tell it to just ignore any errors and strip them out.

.encode(string, 'ignore') or .decode(string, 'ignore')

If you want to know about the errors you can use the normal Python try and except error trapping.

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Héctor Cortiguera
Author
Participating Frequently
Forum|Forum|1 year ago
January 18, 2024

Hi Héctor,

utf8_string = universe_str.encode('utf-8')

If it's coming in as a byte string you can first convert it to Unicode and then back to UTF-8 like this:

unicode_string = universe_str.decode('ascii')

utf8_string = unicode_string.encode('utf-8')

If you have a normal python string you want to convert back to ascii to pass back to Universe:

ascii_string = normal_unicode_python_string.encode('ascii')

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Joe

My problem is that the data that I'm getting in Python is not what I expect.

If I try to encode the string as UTF-8 this is what I get:

b'\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f
 !"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\\x7f\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd'

As soon as I run off of the ASCII space I only get \\xef\\xbf\\xbd

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Joe Goldthwaite
Participating Frequently
Forum|Forum|1 year ago
January 23, 2024

Hi Joe

My problem is that the data that I'm getting in Python is not what I expect.

If I try to encode the string as UTF-8 this is what I get:

b'\\x01\\x02\\x03\\x04\\x05\\x06\\x07\\x08\\t\\n\\x0b\\x0c\\r\\x0e\\x0f\\x10\\x11\\x12\\x13\\x14\\x15\\x16\\x17\\x18\\x19\\x1a\\x1b\\x1c\\x1d\\x1e\\x1f
 !"#$%&\\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
\\x7f\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd
\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd\\xef\\xbf\\xbd'

As soon as I run off of the ASCII space I only get \\xef\\xbf\\xbd

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hi Hector,

That confused me for a bit. Then I figured out what you are trying to do. It looks like you've passed a string with all the ascii characters from 1 to 255. Python tried to convert them to utf-8 but characters 128 to 255 don't have utf-8 defined equivalents. The Python converter has two options to deal with those characters, ignore or replace. If you pick ignore they're just stripped out. If you pick replace, it replaces each character with the unicode error subsitute character which happens to be "\\xef\\xfb\\xbd". That's what you're seeing in your string. All the high number characters converted to that sequence

You can duplicate this with a short bit of Python:

ascii_string = bytes(range(256))

utf8_encoded_string = ascii_string.decode('utf-8', 'replace').encode('utf-8')

print (utf8_encoded_string)

You'll get y our string above. If you change "replace" with "ignore" those characters will be stripped out.

Since you mentioned you're dealing with "OEM encoding" I'm to assume you mean you're using the original IBM OEM character set? If that's what you're doing you'd have to translate your original test string character by character with the symbol that matches the original OEM symbol. That means if you want to translate character 130, "é", you'd have to convert it to unicode character \\x00E9.

It would be a pain to match all of those up since they're scattered around. Lucky for us we've got AI helpers that can do the grunt work. Here's a sample program that will take a byte string and translate it into the original IBM OEM symbols in UTF-8:

(I'm not sure how this would look if it's emailed. The forum software scrambles it pretty good on my mail client. If you have any issues, try accessing it directly from Rocket's forum website.)

# Define a mapping from IBM PC character set to Unicode

ibm_pc_to_unicode = {

0: '\\x00', # NUL - U+0000

1: '\\x01', # SOH - U+0001

2: '\\x02', # STX - U+0002

3: '\\x03', # ETX - U+0003

4: '\\x04', # EOT - U+0004

5: '\\x05', # ENQ - U+0005

6: '\\x06', # ACK - U+0006

7: '\\x07', # BEL - U+0007

8: '\\x08', # BS - U+0008

9: '\\t', # HT - U+0009

10: '\\n', # LF - U+000A

11: '\\x0B', # VT - U+000B

12: '\\x0C', # FF - U+000C

13: '\\r', # CR - U+000D

14: '\\x0E', # SO - U+000E

15: '\\x0F', # SI - U+000F

16: '\\x10', # DLE - U+0010

17: '\\x11', # DC1 - U+0011

18: '\\x12', # DC2 - U+0012

19: '\\x13', # DC3 - U+0013

20: '\\x14', # DC4 - U+0014

21: '\\x15', # NAK - U+0015

22: '\\x16', # SYN - U+0016

23: '\\x17', # ETB - U+0017

24: '\\x18', # CAN - U+0018

25: '\\x19', # EM - U+0019

26: '\\x1A', # SUB - U+001A

27: '\\x1B', # ESC - U+001B

28: '\\x1C', # FS - U+001C

29: '\\x1D', # GS - U+001D

30: '\\x1E', # RS - U+001E

31: '\\x1F', # US - U+001F

32: ' ', # Space - U+0020

33: '!', # ! - U+0021

34: '"', # " - U+0022

35: '#', # # - U+0023

36: '$', # $ - U+0024

37: '%', # % - U+0025

38: '&', # & - U+0026

39: "'", # ' - U+0027

40: '(', # ( - U+0028

41: ')', # ) - U+0029

42: '*', # * - U+002A

43: '+', # + - U+002B

44: ',', # , - U+002C

45: '-', # - - U+002D

46: '.', # . - U+002E

47: '/', # / - U+002F

48: '0', # 0 - U+0030

49: '1', # 1 - U+0031

50: '2', # 2 - U+0032

51: '3', # 3 - U+0033

52: '4', # 4 - U+0034

53: '5', # 5 - U+0035

54: '6', # 6 - U+0036

55: '7', # 7 - U+0037

56: '8', # 8 - U+0038

57: '9', # 9 - U+0039

58: ':', # : - U+003A

59: ';', # ; - U+003B

60: '<', # < - U+003C

61: '=', # = - U+003D

62: '>', # > - U+003E

63: '?', # ? - U+003F

64: '@', # @ - U+0040

65: 'A', # A - U+0041

66: 'B', # B - U+0042

67: 'C', # C - U+0043

68: 'D', # D - U+0044

69: 'E', # E - U+0045

70: 'F', # F - U+0046

71: 'G', # G - U+0047

72: 'H', # H - U+0048

73: 'I', # I - U+0049

74: 'J', # J - U+004A

75: 'K', # K - U+004B

76: 'L', # L - U+004C

77: 'M', # M - U+004D

78: 'N', # N - U+004E

79: 'O', # O - U+004F

80: 'P', # P - U+0050

81: 'Q', # Q - U+0051

82: 'R', # R - U+0052

83: 'S', # S - U+0053

84: 'T', # T - U+0054

85: 'U', # U - U+0055

86: 'V', # V - U+0056

87: 'W', # W - U+0057

88: 'X', # X - U+0058

89: 'Y', # Y - U+0059

90: 'Z', # Z - U+005A

91: '[', # [ - U+005B

92: '\\\\', # \\ - U+005C

93: ']', # ] - U+005D

94: '^', # ^ - U+005E

95: '_', # _ - U+005F

96: '`', # ` - U+0060

97: 'a', # a - U+0061

98: 'b', # b - U+0062

99: 'c', # c - U+0063

100: 'd', # d - U+0064

101: 'e', # e - U+0065

102: 'f', # f - U+0066

103: 'g', # g - U+0067

104: 'h', # h - U+0068

105: 'i', # i - U+0069

106: 'j', # j - U+006A

107: 'k', # k - U+006B

108: 'l', # l - U+006C

109: 'm', # m - U+006D

110: 'n', # n - U+006E

111: 'o', # o - U+006F

112: 'p', # p - U+0070

113: 'q', # q - U+0071

114: 'r', # r - U+0072

115: 's', # s - U+0073

116: 't', # t - U+0074

117: 'u', # u - U+0075

118: 'v', # v - U+0076

119: 'w', # w - U+0077

120: 'x', # x - U+0078

121: 'y', # y - U+0079

122: 'z', # z - U+007A

123: '{', # { - U+007B

124: '|', # | - U+007C

125: '}', # } - U+007D

126: '~', # ~ - U+007E

127: '\\x7F', # DEL - U+007F (Delete)

128: 'Ç', # Ç - U+00C7

129: 'ü', # ü - U+00FC

130: 'é', # é - U+00E9

128: 'Ç', # Ç - U+00C7

129: 'ü', # ü - U+00FC

130: 'é', # é - U+00E9

131: 'â', # â - U+00E2

132: 'ä', # ä - U+00E4

133: 'à', # à - U+00E0

134: 'å', # å - U+00E5

135: 'ç', # ç - U+00E7

136: 'ê', # ê - U+00EA

137: 'ë', # ë - U+00EB

138: 'è', # è - U+00E8

139: 'ï', # ï - U+00EF

140: 'î', # î - U+00EE

141: 'ì', # ì - U+00EC

142: 'Ä', # Ä - U+00C4

143: 'Å', # Å - U+00C5

144: 'É', # É - U+00C9

145: 'æ', # æ - U+00E6

146: 'Æ', # Æ - U+00C6

147: 'ô', # ô - U+00F4

148: 'ö', # ö - U+00F6

149: 'ò', # ò - U+00F2

150: 'û', # û - U+00FB

151: 'ù', # ù - U+00F9

152: 'ÿ', # ÿ - U+00FF

153: 'Ö', # Ö - U+00D6

154: 'Ü', # Ü - U+00DC

155: '¢', # ¢ - U+00A2

156: '£', # £ - U+00A3

157: '¥', # ¥ - U+00A5

158: '₧', # ₧ - U+20A7

159: 'ƒ', # ƒ - U+0192

160: 'á', # á - U+00E1

161: 'í', # í - U+00ED

162: 'ó', # ó - U+00F3

163: 'ú', # ú - U+00FA

164: 'ñ', # ñ - U+00F1

165: 'Ñ', # Ñ - U+00D1

166: 'ª', # ª - U+00AA

167: 'º', # º - U+00BA

168: '¿', # ¿ - U+00BF

169: '⌐', # ⌐ - U+2310

170: '¬', # ¬ - U+00AC

171: '½', # ½ - U+00BD

172: '¼', # ¼ - U+00BC

173: '¡', # ¡ - U+00A1

174: '«', # « - U+00AB

175: '»', # » - U+00BB

176: '░', # ░ - U+2591

177: '▒', # ▒ - U+2592

178: '▓', # ▓ - U+2593

179: '│', # │ - U+2502

180: '┤', # ┤ - U+2524

181: '╡', # ╡ - U+2561

182: '╢', # ╢ - U+2562

183: '╖', # ╖ - U+2556

184: '╕', # ╕ - U+2555

185: '╣', # ╣ - U+2563

186: '║', # ║ - U+2551

187: '╗', # ╗ - U+2557

188: '╝', # ╝ - U+255D

189: '╜', # ╜ - U+255C

190: '╛', # ╛ - U+255B

191: '┐', # ┐ - U+2510

192: '└', # └ - U+2514

193: '┴', # ┴ - U+2534

194: '┬', # ┬ - U+252C

195: '├', # ├ - U+251C

196: '─', # ─ - U+2500

197: '┼', # ┼ - U+253C

198: '╞', # ╞ - U+255E

199: '╟', # ╟ - U+255F

200: '╚', # ╚ - U+255A

201: '╔', # ╔ - U+2554

202: '╩', # ╩ - U+2569

203: '╦', # ╦ - U+2566

204: '╠', # ╠ - U+2560

205: '═', # ═ - U+2550

206: '╬', # ╬ - U+256C

207: '╧', # ╧ - U+2567

208: '╨', # ╨ - U+2568

209: '╤', # ╤ - U+2564

210: '╥', # ╥ - U+2565

211: '╙', # ╙ - U+2559

212: '╘', # ╘ - U+2558

}

unicode_string = ''.join(ibm_pc_to_unicode.get(char, chr(char)) for char in ascii_string)

# Print the resulting Unicode string

print(unicode_string)

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Héctor Cortiguera
Author
Participating Frequently
Forum|Forum|1 year ago
January 25, 2024

Hi Hector,

You can duplicate this with a short bit of Python:

ascii_string = bytes(range(256))

utf8_encoded_string = ascii_string.decode('utf-8', 'replace').encode('utf-8')

print (utf8_encoded_string)

You'll get y our string above. If you change "replace" with "ignore" those characters will be stripped out.

(I'm not sure how this would look if it's emailed. The forum software scrambles it pretty good on my mail client. If you have any issues, try accessing it directly from Rocket's forum website.)

# Define a mapping from IBM PC character set to Unicode

ibm_pc_to_unicode = {

0: '\\x00', # NUL - U+0000

1: '\\x01', # SOH - U+0001

2: '\\x02', # STX - U+0002

3: '\\x03', # ETX - U+0003

4: '\\x04', # EOT - U+0004

5: '\\x05', # ENQ - U+0005

6: '\\x06', # ACK - U+0006

7: '\\x07', # BEL - U+0007

8: '\\x08', # BS - U+0008

9: '\\t', # HT - U+0009

10: '\\n', # LF - U+000A

11: '\\x0B', # VT - U+000B

12: '\\x0C', # FF - U+000C

13: '\\r', # CR - U+000D

14: '\\x0E', # SO - U+000E

15: '\\x0F', # SI - U+000F

16: '\\x10', # DLE - U+0010

17: '\\x11', # DC1 - U+0011

18: '\\x12', # DC2 - U+0012

19: '\\x13', # DC3 - U+0013

20: '\\x14', # DC4 - U+0014

21: '\\x15', # NAK - U+0015

22: '\\x16', # SYN - U+0016

23: '\\x17', # ETB - U+0017

24: '\\x18', # CAN - U+0018

25: '\\x19', # EM - U+0019

26: '\\x1A', # SUB - U+001A

27: '\\x1B', # ESC - U+001B

28: '\\x1C', # FS - U+001C

29: '\\x1D', # GS - U+001D

30: '\\x1E', # RS - U+001E

31: '\\x1F', # US - U+001F

32: ' ', # Space - U+0020

33: '!', # ! - U+0021

34: '"', # " - U+0022

35: '#', # # - U+0023

36: '$', # $ - U+0024

37: '%', # % - U+0025

38: '&', # & - U+0026

39: "'", # ' - U+0027

40: '(', # ( - U+0028

41: ')', # ) - U+0029

42: '*', # * - U+002A

43: '+', # + - U+002B

44: ',', # , - U+002C

45: '-', # - - U+002D

46: '.', # . - U+002E

47: '/', # / - U+002F

48: '0', # 0 - U+0030

49: '1', # 1 - U+0031

50: '2', # 2 - U+0032

51: '3', # 3 - U+0033

52: '4', # 4 - U+0034

53: '5', # 5 - U+0035

54: '6', # 6 - U+0036

55: '7', # 7 - U+0037

56: '8', # 8 - U+0038

57: '9', # 9 - U+0039

58: ':', # : - U+003A

59: ';', # ; - U+003B

60: '<', # < - U+003C

61: '=', # = - U+003D

62: '>', # > - U+003E

63: '?', # ? - U+003F

64: '@', # @ - U+0040

65: 'A', # A - U+0041

66: 'B', # B - U+0042

67: 'C', # C - U+0043

68: 'D', # D - U+0044

69: 'E', # E - U+0045

70: 'F', # F - U+0046

71: 'G', # G - U+0047

72: 'H', # H - U+0048

73: 'I', # I - U+0049

74: 'J', # J - U+004A

75: 'K', # K - U+004B

76: 'L', # L - U+004C

77: 'M', # M - U+004D

78: 'N', # N - U+004E

79: 'O', # O - U+004F

80: 'P', # P - U+0050

81: 'Q', # Q - U+0051

82: 'R', # R - U+0052

83: 'S', # S - U+0053

84: 'T', # T - U+0054

85: 'U', # U - U+0055

86: 'V', # V - U+0056

87: 'W', # W - U+0057

88: 'X', # X - U+0058

89: 'Y', # Y - U+0059

90: 'Z', # Z - U+005A

91: '[', # [ - U+005B

92: '\\\\', # \\ - U+005C

93: ']', # ] - U+005D

94: '^', # ^ - U+005E

95: '_', # _ - U+005F

96: '`', # ` - U+0060

97: 'a', # a - U+0061

98: 'b', # b - U+0062

99: 'c', # c - U+0063

100: 'd', # d - U+0064

101: 'e', # e - U+0065

102: 'f', # f - U+0066

103: 'g', # g - U+0067

104: 'h', # h - U+0068

105: 'i', # i - U+0069

106: 'j', # j - U+006A

107: 'k', # k - U+006B

108: 'l', # l - U+006C

109: 'm', # m - U+006D

110: 'n', # n - U+006E

111: 'o', # o - U+006F

112: 'p', # p - U+0070

113: 'q', # q - U+0071

114: 'r', # r - U+0072

115: 's', # s - U+0073

116: 't', # t - U+0074

117: 'u', # u - U+0075

118: 'v', # v - U+0076

119: 'w', # w - U+0077

120: 'x', # x - U+0078

121: 'y', # y - U+0079

122: 'z', # z - U+007A

123: '{', # { - U+007B

124: '|', # | - U+007C

125: '}', # } - U+007D

126: '~', # ~ - U+007E

127: '\\x7F', # DEL - U+007F (Delete)

128: 'Ç', # Ç - U+00C7

129: 'ü', # ü - U+00FC

130: 'é', # é - U+00E9

128: 'Ç', # Ç - U+00C7

129: 'ü', # ü - U+00FC

130: 'é', # é - U+00E9

131: 'â', # â - U+00E2

132: 'ä', # ä - U+00E4

133: 'à', # à - U+00E0

134: 'å', # å - U+00E5

135: 'ç', # ç - U+00E7

136: 'ê', # ê - U+00EA

137: 'ë', # ë - U+00EB

138: 'è', # è - U+00E8

139: 'ï', # ï - U+00EF

140: 'î', # î - U+00EE

141: 'ì', # ì - U+00EC

142: 'Ä', # Ä - U+00C4

143: 'Å', # Å - U+00C5

144: 'É', # É - U+00C9

145: 'æ', # æ - U+00E6

146: 'Æ', # Æ - U+00C6

147: 'ô', # ô - U+00F4

148: 'ö', # ö - U+00F6

149: 'ò', # ò - U+00F2

150: 'û', # û - U+00FB

151: 'ù', # ù - U+00F9

152: 'ÿ', # ÿ - U+00FF

153: 'Ö', # Ö - U+00D6

154: 'Ü', # Ü - U+00DC

155: '¢', # ¢ - U+00A2

156: '£', # £ - U+00A3

157: '¥', # ¥ - U+00A5

158: '₧', # ₧ - U+20A7

159: 'ƒ', # ƒ - U+0192

160: 'á', # á - U+00E1

161: 'í', # í - U+00ED

162: 'ó', # ó - U+00F3

163: 'ú', # ú - U+00FA

164: 'ñ', # ñ - U+00F1

165: 'Ñ', # Ñ - U+00D1

166: 'ª', # ª - U+00AA

167: 'º', # º - U+00BA

168: '¿', # ¿ - U+00BF

169: '⌐', # ⌐ - U+2310

170: '¬', # ¬ - U+00AC

171: '½', # ½ - U+00BD

172: '¼', # ¼ - U+00BC

173: '¡', # ¡ - U+00A1

174: '«', # « - U+00AB

175: '»', # » - U+00BB

176: '░', # ░ - U+2591

177: '▒', # ▒ - U+2592

178: '▓', # ▓ - U+2593

179: '│', # │ - U+2502

180: '┤', # ┤ - U+2524

181: '╡', # ╡ - U+2561

182: '╢', # ╢ - U+2562

183: '╖', # ╖ - U+2556

184: '╕', # ╕ - U+2555

185: '╣', # ╣ - U+2563

186: '║', # ║ - U+2551

187: '╗', # ╗ - U+2557

188: '╝', # ╝ - U+255D

189: '╜', # ╜ - U+255C

190: '╛', # ╛ - U+255B

191: '┐', # ┐ - U+2510

192: '└', # └ - U+2514

193: '┴', # ┴ - U+2534

194: '┬', # ┬ - U+252C

195: '├', # ├ - U+251C

196: '─', # ─ - U+2500

197: '┼', # ┼ - U+253C

198: '╞', # ╞ - U+255E

199: '╟', # ╟ - U+255F

200: '╚', # ╚ - U+255A

201: '╔', # ╔ - U+2554

202: '╩', # ╩ - U+2569

203: '╦', # ╦ - U+2566

204: '╠', # ╠ - U+2560

205: '═', # ═ - U+2550

206: '╬', # ╬ - U+256C

207: '╧', # ╧ - U+2567

208: '╨', # ╨ - U+2568

209: '╤', # ╤ - U+2564

210: '╥', # ╥ - U+2565

211: '╙', # ╙ - U+2559

212: '╘', # ╘ - U+2558

}

unicode_string = ''.join(ibm_pc_to_unicode.get(char, chr(char)) for char in ascii_string)

# Print the resulting Unicode string

print(unicode_string)

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Hi Joe

My trouble is not on the Python side, but on BASIC.

I have data in OEM encoding on BASIC that I need to pass to Python, and the PyCall function is converting the OEM data to UTF-8 incorrectly. I'd like to know how to pass this characters to Python without losing any information.

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Mike Rajkowski
Rocketeer
Forum|Forum|1 year ago
January 25, 2024

Hi Joe

My trouble is not on the Python side, but on BASIC.

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Hector,

Have you tried the PyByteCallFunction, this will send the data as a byte string and not a string?

------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Joe Goldthwaite
Participating Frequently
Forum|Forum|1 year ago
January 25, 2024

Hector,

Have you tried the PyByteCallFunction, this will send the data as a byte string and not a string?

------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Hi Mike,

I don't see PyByteCallFunction documented anywhere. I downloaded the latest Universe Python pdf but there's no mention of it. Do you have newer documentation?

Hi Hector,

From what I see, the Universe python interface is trying to automatically convert the ASCII strings from Basic to Unicode. The encoding method it's using must not understand characters 128+ so it's converting them to the Unicode "unknown character" sequence mentioned above.

The documentation mentions a "config.encoding" variable that defines the type of encoding used. Whatever that is, it's not a good match. The documentation says you can change it but i don't see any instructions as to how and I can't test it. It might be as simple as putting "config.encoding=xxxx" somewhere in your code. The u2py.py routine use "config.encoding" in a number of places.

If you import this in your code " from _u2py import *" it looks like you'll get the config object and can look at it. (When I try it I get a "U2 Python Package is not licensed.")

It looks like the encoding you want to use is "cp437". On my system the encoding files are all stored in /usr/uv/python/lib/python3.4/encodings.

Maybe you can play with those and figure it out.

Joe G.

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Mike Rajkowski
Rocketeer
Forum|Forum|1 year ago
January 25, 2024

Hi Mike,

I don't see PyByteCallFunction documented anywhere. I downloaded the latest Universe Python pdf but there's no mention of it. Do you have newer documentation?

Hi Hector,

If you import this in your code " from _u2py import *" it looks like you'll get the config object and can look at it. (When I try it I get a "U2 Python Package is not licensed.")

It looks like the encoding you want to use is "cp437". On my system the encoding files are all stored in /usr/uv/python/lib/python3.4/encodings.

Maybe you can play with those and figure it out.

Joe G.

------------------------------
Joe Goldthwaite
Consultant
Phoenix AZ US
------------------------------

Sorry about that, the PyByteCallFunction is in UniData. UniVerse is a bit different. Another thing to check is if you are using NLS and what codepage you are using if it is on.

------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Héctor Cortiguera
Author
Participating Frequently
Forum|Forum|1 year ago
February 2, 2024

Sorry about that, the PyByteCallFunction is in UniData. UniVerse is a bit different. Another thing to check is if you are using NLS and what codepage you are using if it is on.

------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Hi Mike

We are not using NLS.

The encoding of the Universe data is cp1252.

My questions are:
- Is there a way to pass data in "raw" cp1252 format to Python?

- Is there a way to convert cp1252 data to UTF-8 in BASIC?

Now we are parsing the cp1252 data and converting it to UTF-8 manually, but it's a time consuming process, as we have to convert back and forth in every Python call.

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Héctor Cortiguera
Author
Participating Frequently
Forum|Forum|1 year ago
February 2, 2024

Sorry about that, the PyByteCallFunction is in UniData. UniVerse is a bit different. Another thing to check is if you are using NLS and what codepage you are using if it is on.

------------------------------
Mike Rajkowski
MultiValue Product Evangelist
Rocket Internal - All Brands
US
------------------------------

Joe:

I tried your approach but when I try to access config.encoding the python environment crashes

------------------------------
Héctor Cortiguera
Quiter Servicios Informaticos SL
------------------------------

Recent badge winners

Sign up

Please log in or register:

Welcome to the Rocket Forum!

Please log in or register:

Scanning file for viruses.

This file cannot be downloaded