char
values represent Unicode characters. Unicode is a 16-bit character encoding that supports the world's major languages. You can learn more about the Unicode standard at the Unicode Consortium Web site .Few text editors currently support Unicode text entry. The text editor we used to write this section's code examples supports only ASCII characters, which are limited to 7 bits. To indicate Unicode characters that cannot be represented in ASCII, such as ö, we used the
\uXXXX
escape sequence. Each X
in the escape sequence is a hexadecimal digit. The following example shows how to indicate the ö character with an escape sequence:String str = "\u00F6"; char c = '\u00F6'; Character letter = new Character('\u00F6');
OutputStreamWriter
using it and asking for its canonical name:OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream()); System.out.println(out.getEncoding());
This section discusses the APIs you use to translate non-Unicode text into Unicode. Before using these APIs, you should verify that the character encoding you wish to convert into Unicode is supported. The list of supported character encodings is not part of the Java programming language specification. Therefore the character encodings supported by the APIs may vary with platform. To see which encodings the Java Development Kit supports, see the Supported Encodings document.
The material that follows describes two techniques for converting non-Unicode text to Unicode. You can convert non-Unicode byte arrays into
String
objects, and vice versa. Or you can translate between streams of Unicode characters and byte streams of non-Unicode text.Unicode Escapes
A compiler for the Java programming language ("Java compiler") first recognizes Unicode escapes in its input, translating the ASCII characters
\u
followed by four hexadecimal digits to the UTF-16 code unit of
the indicated hexadecimal value, and passing all other characters unchanged. Representing supplementary characters requires two consecutive Unicode escapes. This translation step results in a sequence of Unicode input characters.
UnicodeInputCharacter:
UnicodeEscape
RawInputCharacter
UnicodeEscape:
\
UnicodeMarker HexDigit HexDigit HexDigit HexDigitUnicodeMarker:
u
UnicodeMarker
u
RawInputCharacter:
any Unicode character
HexDigit: one of
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F
The
\
, u
, and hexadecimal digits here are all ASCII characters.
In addition to the processing implied by the grammar, for each raw input character that is a backslash
\
, input processing must consider how many other \
characters contiguously precede it, separating it from a non-\
character or the start of the input stream. If this number is even, then the \
is eligible to begin a Unicode escape; if the number is odd, then the \
is not eligible to begin a Unicode escape.
For example, the raw input "
\\u2297=\u2297
" results in the eleven characters " \ \ u 2 2 9 7 =
⊗ "
(\u2297
is the Unicode encoding of the character ⊗).
If an eligible
\
is not followed by u
, then it is treated as a RawInputCharacter and remains part of the escaped Unicode stream.
If an eligible
\
is followed by u
, or more than one u
, and the last u
is not followed by four hexadecimal digits, then a compile-time error occurs.
For example, the raw input
\u005cu005a
results in the six characters \ u 0 0 5 a
, because 005c
is the Unicode value for \
. It does not result in the character Z, which is Unicode character 005a
, because the \
that resulted from the \u005c
is not interpreted as the start of a further Unicode escape.
The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra
u
- for example, \uxxxx
becomes \uuxxxx
- while simultaneously converting non-ASCII characters in the source text to Unicode escapes containing a single u
each.
This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple
u
's are present to a sequence of Unicode characters with one fewer u
, while simultaneously converting each escape sequence with a single u
to the corresponding single Unicode character.
A Java compiler should use the
\uxxxx
notation as an output format to display Unicode characters when a suitable font is not available.
No comments:
Post a Comment