Thursday, October 29, 2015

How to check your Operating System's code page for a particular character

I have been involved in several migration projects where we had to deal with Western European characters, which have invalid representation in the database, and needs to be converted.

Like I showed in my post about usage of the chr function, you sometimes need to find the decimal and/or hexadecimal value for a particular character.

But how can you verify that your operating system actually agrees; how would your Operating System translate a character passed from a terminal, e.g. putty?

First, make sure your putty terminal has its translation-settings set to the character set of the server you are logging into: Right-click your putty terminal upper frame, and pick "Change Settings" from the menu. Go to the "translation" settings and then select the correct character set from the drop-down menu "Remote character set".

On Linux platforms, a simple way to check how a character would be translated would be to use the hexdump utility. Thanks to Gray Watson for demonstrating this!

man hexdump
The hexdump utility is a filter which displays the specified files, or the standard input, if no files are specified, in a user specified format.

Let's try to hexdump the character ø, and see what the internal hexadecimal representation is:
echo "ø" | hexdump -v -e '"x" 1/1 "%02X" " "'
xC3 xB8 x0A 

The prefix x in front of the values represents hexadecimal values, so the important part here is "C3 and B8" - in a multibyte character set this represent the Scandinavian character ø ( I must admit, I never figured out what the 0A represents. Anyone?)

Another way is to use the "od" utility:
man od

od - dump files in octal and other formats
       -x     same as -t x2, select hexadecimal 2-byte units


Using the -x flag straight off will give the hexadecimal value in 2-byte units:
echo "ø" | od -x
0000000 b8c3 000a
0000003

This time, the values are cast around, and should be read backwards for meaning. I have not found an explanation to why od does this. Anyone?

However, if you use the -t x notation instead,:
echo "ø" | od -t x1
0000000 c3 b8 0a
0000003
The values come out as a DBA expects; c3b8 corresponds to decimal value 50104 which in turn represent the Scandinavian letter ø.

( And again, I never figured out what the 0a represents. Anyone?)





No comments:

Post a Comment