Unicode input

The Unicode logo

Unicode input is the insertion of a specific Unicode character on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. Unicode is similar to ASCII but provides many more options and encodes many more signs.[1]

A Unicode input system needs to provide a large repertoire of characters, ideally all valid Unicode code points. This is different from a keyboard layout which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.

The KCharSelect character mapping tool shown displaying a subset of the Unicode Mathematical Operators

Unicode numbers

Unicode characters are distinguished by code points, which are conventionally represented by "U+" followed by four, five or six hexadecimal digits, for example U+00AE or U+1D310. Characters in the Basic Multilingual Plane (BMP), containing modern scripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as emoticons, playing cards and many CJK characters) have 5-digit codes.

Availability

An application can display a character only if it can access a font which contains a glyph for the character.[2] Very few fonts have full Unicode coverage; most only contain the glyphs needed to support a few writing systems. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform font substitution, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts.

If an application does not have access to a font supporting a character, the character will usually be shown as a question mark or another generic replacement character, e.g. � or  ⃞ .

Selection from a screen

GNOME Character Map

Many systems provide a way to select Unicode characters visually. ISO/IEC 14755 refers to this as a screen-selection entry method.[3]

Microsoft Windows has provided a Unicode version of the Character Map program, appearing in the consumer edition since XP. This is limited to characters in the Basic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block.[4]

More advanced third-party tools of the same type are also available (a notable freeware example is BabelMap, which supports all Unicode characters).[5]

In macOS the "Emoji & Symbols" (Command+Ctrl+Space) menu can be found in the Edit menu in many programs. This brings up the Characters palette allowing the user to choose any character from a variety of views. The user can also search for the character or Unicode plane by name.[6][7]

On most Linux desktop environments, equivalent tools – such as gucharmap (GNOME) or kcharselect (KDE) – are available.[8]

Decimal input

In some applications on Microsoft Windows, particularly those using the RichEdit control, decimal Unicode code points (for example, 256 for U+0100) are supported with Alt codes.

The text editor Vim allows characters to be specified by two-character mnemonics (confusingly called "digraphs" by Vim developers). The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, dig Gr 9881 associates "Gr" with U+2699 GEAR.

Hexadecimal input

Clause 5.1 of ISO/IEC 14755 describes a Basic method whereby a beginning sequence is followed by the hex number representation of the code point and the ending sequence. On some systems, this is limited to the BMP (characters up to U+FFFF).

In Microsoft Windows

In order to enable a universal (independent of language settings) input method in Windows, one can add a string type (REG_SZ) value called EnableHexNumpad to the registry key HKEY_CURRENT_USER\Control Panel\Input Method and assign the value data 1 to it. Users need to log off/in on Windows 8.1/8.0, Windows 7, and Vista or reboot on earlier systems after editing the registry for this input method to start working. Unicode characters can then be entered by holding down Alt, pressing the + on the numeric keypad, followed by the hexadecimal code – using the numeric keypad for digits from 0 to 9 and letter keys for A to F digits – and then releasing Alt.[2] This may not work for 5-digit hexadecimal codes like U+1F937.

UnicodeInput window

If one prefers not to edit the registry or if, as on many laptops, the numeric keypad is unavailable, the utility UnicodeInput can be downloaded.[9] If one invokes this program when typing text, the window shown on the right appears; entering the hexadecimal value and pressing ↵ Enter then produces the desired character and makes the window disappear.

AutoHotkey scripts support substitution of Unicode characters for keystrokes. For example, the command Send {U+2014} will insert an em dash in a text field in the active window.[10]

Some individual Windows programs or apps already support the input of Unicode. For instance, Word, WordPad and LibreOffice programs (Writer, Calc, etc.) support the following input method: one first enters the character’s hexadecimal code (between two and six hexadecimal digits), then immediately presses Alt+X. For example, entering f1 and then pressing the combination will produce the character ñ. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they will be treated as part of the code to be converted. For example, entering af1 followed by Alt+X will produce ૱ (U+0AF1), but entering a0000f1 followed by Alt+X will produce añ.

In MacOS

In Mac OS 8.5 and later, one can choose the Unicode Hex Input keyboard layout; in OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources. Holding down ⌥ Option, one types the four-digit hexadecimal Unicode code point and the equivalent character appears; one can then release the ⌥ Option key.[11] Characters outside of the BMP exceed the four-digit limit of the Unicode hex input mechanism but can be entered using the search entry box in the Character Viewer (Edit → Emoji & Symbols) or by using surrogate pairs: holding down the ⌥ Option key while entering the first surrogate, the +, the second surrogate, then releasing the Option key.

In X11 (Linux and Unix variants)

The possibility of hexadecimal code input on operating systems using the X Window System depends on the system and applications. Hex input is not implemented in the common X.Org Server.[12] Individual input methods and GUI toolkits can provide hex input independent of the X server.

For example, GTK+ is an ISO/IEC 14755-conformant system[citation needed]. The beginning sequence is Ctrl+⇧ Shift+U and the ending sequence is ↵ Enter or Space. Programs based on GTK+, such as GNOME applications, support Unicode input.

There are two common methods for direct input of Unicode characters:

  • Hold Ctrl+⇧ Shift and type u followed by the hex digits. Then release Ctrl+⇧ Shift.
  • Enter Ctrl+⇧ Shift+u, release, then type the hex digits, and press ↵ Enter (or Space or even on some systems, press and release ⇧ Shift or Ctrl ).

In non-GTK applications, however, there usually is no escape sequence to input arbitrary input characters. For example, Qt and KDE rely on the standard X Input Method (XIM) framework, and do not implement their own solutions.[13] In xterm, these input methods are not supported, but using escape sequences is an alternative. rxvt-unicode implements optional ISO/IEC 14755, enabled by default.

However, regardless of the toolkit used, the Compose key subsystem can be used to configure certain key stroke combinations to input a subset of unicode.

In platform-independent applications

  • In Emacs, Ctrl+x8↵ Enter or Meta+xinsert-char.
  • In LibreOffice 5.1 onwards, type the hexadecimal number of a symbol and press Alt+X.
  • In Opera versions that use the Presto layout engine—i.e. up to and including version 12.xx—, enter the hexadecimal number of the desired symbol or character and then press Ctrl+⇧ Shift+x (alternative shortcut Meta+⇧ Shift++x on macOS).
  • In the Vim editor, in insert mode, the user first types Ctrl+V u (for codepoints up to 4 hex digits long; use Ctrl+V ⇧ Shift+U for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows, Ctrl+Q may be required instead of Ctrl+V.[14])

Unicode in HTML

On the web

In HTML and XML, character codes to be rendered as characters are prefixed by ampersand and number sign (&#), and are followed by a semicolon (;). The code point can be either in decimal or in hexadecimal; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a named entity.

Example: In HTML/XML, the copyright sign © (U+00A9) may be coded as:

  • © (decimal code point)
  • © (hexadecimal code point)
  • © (entity name)

In Thunderbird

The HTML option in the Thunderbird Insert menu allows the insertion of Unicode characters using either decimal or hexadecimal values with the HTML syntax. ©, © and © each is rendered as '©'.

See also

References