Welcome!

PowerBuilder Authors: Chris Pollach, Yeshim Deniz, Jayaram Krishnaswamy, Kevin Benedict, Avi Rosenthal

Related Topics: PowerBuilder

PowerBuilder: Article

Understanding Unicode

Universal character encoding

Pocket PowerBuilder and the upcoming version 10 of PowerBuilder are Unicode versions. But what does this mean? What is Unicode and why do we need it? This is what I'll discuss this month. But be warned, the main body of the Unicode standard is over 1,000 pages long, counting indexes and appendices, and there's even more supplemental information - addenda, data tables, related substandards, implementation notes, and so on, but I'll try to give you a brief overview, so let's start looking into the world of Unicode.

Why Unicode?
You might not see it, but we live in a world of different languages. There are about 7,000 different languages that are all spoken somewhere on this blue planet (yes, I mean the spoken ones and not programming languages like PB, Java, or C#), each with countless dialects and regional variations. So when it comes to computers and the different languages that should be used in one application, we will have a problem. This is where Unicode jumps in.

Let's start with a small experiment regarding how we might accomplish a task. We want to be able to represent textual information in a computer. To do this, let's make a list of all the characters we have to display and assign each a unique number (for example, A is 1, B is 2, C is 3, and so on). When choosing a number it's useful to make the numeric order of the character codes follow the alphabetical order of the letters to be able to sort them, and it makes sense to put all the digits in one contiguous group and the letters in another one, so that we can easily check whether a character is a letter. Anyway, with this list we are able to represent words with a sequence of numbers. Using this scheme, the word "beer" would be represented or stored in a computer as 2-5-5-18. This is fine for the English language, but there will be problems when you move to another language. For example, look at the French phrase "à bientôt" (which means "soon"). What should we do with the accented characters? We have two basic choices:

  1. Assign a different number to each accented version of a given letter. The problem with this approach is that if you try to compare two strings, you may lose sight of the fact that a and à are the same letter, possibly causing it to do the wrong thing, if there is no code that knows that à is just an accented version of a.
  2. Treat the accent marks as independent characters and give them their own numbers. If we choose the second approach, the letter a keeps its identity, but you then have to decide where the code for the accent goes in the sequence relative to the code for the letter a, and what tells a system that the accent belongs on top of the a and not some other letter.
Let's leave the European languages and have a closer look at Hebrew. We see immediately that in this language most of the letters have marks on them, and they might appear on any letter. Assigning a unique code to every letter/mark combination is not possible anymore. Also a mix of different languages on one simple line raises up new problems. The sentence 

is a mix of English characters with some Hebrew words embedded in the middle of it. The problem is that the Hebrew letters don't run from left to right, but from right to left. That is, the first letter in the Hebrew phrase is the one farthest to the right. So we can't really store the characters in the order in which they should be displayed on the screen; we have to store either the English or the Hebrew characters backward. If we store the characters in the order they're read or typed, we need to specify how they are to be arranged when the text is displayed or printed.

The ordering thing can be even more complex in other languages; for example, in the Hindi language, letters are knotted together representing so-called syllables. The syllables run from left to right across the page like any English text would, but the arrangement of the marks within a syllable can be very complex and might not follow the order in which the sounds they correspond to are actually spoken, or the characters themselves are typed.

What Unicode Is
The most common character set is US-ASCII (American Standard Code for Information Interchange), which has 32 (nonprintable) control characters and 96 (mostly printable) other characters, for a total of 128. These 128 characters can be encoded in 7 bits of data, so each 8-bit byte representing one of these characters has the lower 7 bits set to the appropriate value for the given character and the 8th (high) bit set to zero. US-ASCII is therefore considered a single-byte 7-bit character set. We learned already that many European languages have accented characters (like the German ü, the French ç and é, the Danish Ø, and the Spanish ñ). Such languages are commonly represented by character sets whose lower half (i.e., values of 0-127) are identical to those of US-ASCII, and whose upper half (i.e., values of 128-255) represent these accented characters. There are a lot of character sets out there, depending on which language you'd like to use (just create a new ASA database and you'll see that you're able to choose one of a whole bunch of different character sets).

Unicode wants to be the solution to this chaos. It allows us to present and store all of these different languages and characters using only one encoding scheme. It uses a fixed-length, character-encoding scheme that includes characters from almost all of the living languages of the world. The default encoding form is 16-bit, that is, each character is 16-bits (two bytes) wide, and is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character. While the resulting 65,000-plus code elements are sufficient for encoding most of the characters of the major languages of the world, the Unicode standard also provides an extension mechanism that allows the encoding of as many as one million more characters. The extension mechanism uses a pair of high and low surrogate characters to encode one extended or supplementary character. The first (or high) surrogate character has a code value between U+D800 and U+DBFF, and the second (or low) surrogate character has a code value between U+DC00 and U+DFFF.

Unicode aims to be a universal character encoding standard, providing unique, unambiguous representations for every character in virtually every writing system and language in the world. The most recent version of Unicode extends the language support and provides representations for more than 90,000 characters.

Note that Unicode is nothing new. Microsoft implemented it into its products a long time ago; it first showed up in Windows NT. The entire NT line supports Unicode throughout the kernel. This means that under Windows 2000, for instance, operating system strings such as filenames, registry keys, and window title bar captions are all Unicode strings internally. In the world of Windows development, Unicode means UTF-16. Windows developers use the terms wide string (meaning a string of 16-bit characters) and Unicode string interchangeably.

Will it influence development under (Pocket) PowerBuilder? Well, yes, if you are writing a program that needs to call a WinAPI function, such as the GetUserName() function, which returns the user logged into the Windows system. There are actually two slightly different versions of GetUserName() available to you. GetUserNameA() accepts the filename as an ANSI string. GetUserNameW() accepts the filename as a wide string. That's it. These will be the only changes you'll face when you start using the Unicode versions of PowerBuilder.

Where Unicode Won't Help
It's also important to keep in mind what Unicode isn't. First, Unicode is a standard scheme for representing plain text in computers and data communication. It's not a scheme for representing rich text, for example. But what is plain text? Plain text is only the words, sentences, and numbers. For example, rich text is plain text plus information about the text, especially information on the text's visual presentation (a word is bold or italics), the structure of a document (a piece of text is a section header or footnote), or the language. Rich text may also include nontext items that travel with the text, such as pictures. It can be difficult to draw a line between what qualifies as plain text and therefore should be encoded in Unicode, and what is really rich text. The basic rule is that plain text contains all of the information necessary to carry the semantic meaning of the text - the letters, spaces, digits, punctuation, and so forth. If removing it would make the text unintelligible, then it's plain text.

Unicode can only be one part of a complete solution for software internationalization. A complete solution would mean that the application we are writing can be used for various international markets (localized) without modifying the executable code anymore. We already know that the Unicode standard doesn't include everything necessary to produce this kind of software. Unicode is only a solution to one particular problem. In writing such a piece of software, we are now able to represent text that consists of a mix of different languages without getting tripped up in dealing with the multiple encoding standards. I agree that this issue is an important one and was not really solvable in the past, but it's not the only problem we run into when we want to distribute our application all over the world. There are more traps that we have to find a solution for.

  • We will have to translate any text in the user interface into the user's language and maybe alter the screen layouts to reflect the size of the translated text. In addition, there may be a need to change icons and graphics to be meaningful (or not offensive). These are things that Unicode will not do for you; there's still some development involved to accomplish these tasks.
  • An old problem for PowerBuilder developers is the way in which numbers, dates, and times are presented to or entered by the user (remember the decimal point problem in different languages - it's a comma in Europe) or the order of the various parts of a date (dd.mm.yyyy is common in Europe). A complete change within your application might be needed if you switch to Chinese, which uses a completely different system for writing numbers, or when you have an application used in Israel, where a completely different calendar system is used.
  • Simple things like sorting a list into alphabetical order may produce different orders for the same list depending on which language is used, as the term alphabetical order is a very language-specific concept.
Summary
This article first explained why it's not so easy to represent or store different languages on a computer. The solution to this problem is Unicode, which is the latest of several attempts to create universal character encoding. Its primary approach to this issue is to increase the size of the possible encoding space by increasing the number of bits used to encode each character. Most other character encodings provide enough space to encode a maximum of 256 characters. With Unicode it's possible to display about 90,000 different characters at the moment. For every character or group of characters, someone had to sit down and decide whether it was the same as or different from the other characters, and there are rules about what it means, how it should look in various situations, how it's arranged on a line of text with other characters, what other characters are similar but different, how various text-processing operations should treat it, and so on.

The Unicode standard is large and complex, and the last version published as a book is 1,072 pages long; you'll find more information about it at www.unicode.org. Another good book on Unicode is Unicode Demystified, by Richard Gillam. And now I wish you happy Unicoding with Pocket PowerBuilder and PowerBuilder 10.

More Stories By Berndt Hamboeck

Berndt Hamboeck is a senior consultant for BHITCON (www.bhitcon.net). He's a CSI, SCAPC8, EASAC, SCJP2, and started his Sybase development using PB5. You can reach him under [email protected]

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.