Encoder Tool and Resources

WRITE Tool and Resources

Last updated: 13-DEC-2004 22:08

I am currently working on WRITE (Web-Ready Input Text Encoder), an encoder tool that translates text between character sets as defined by custom dictionaries and using Unicode. The arrows indicate the direction of translation (<-> means text can be translated in both directions). The translation process is pretty slow, especially on this server, so be patient. There are currently 3 versions of WRITE:

encoder.php [source] – the original version (24 nov 2004)

encoder2.php – modified so that the CharClick box is loaded from a predefined HTML file rather than dynamically, making each page load approximately twice as fast. (3 dec 2004)

encoder3.php [source] – with an experimental new recursive version of the str_preg_replace function, which turned out to be slower than the original version. (3 dec 2004)

An important note about the encoding algorithm: it only supports single replacement. That is, once one or more characters are replaced due to a rule, these replacement characters cannot be replaced by any other rule. This simplifies the algorithm and also precludes infinite repetition of replacements.

Here are the supported dictionaries at the moment (if anyone wants to implement Japanese or some other encoding, let me know!):

ASCII-SYMBOLS <-> Common Symbols [CC] – Provides an ASCII shorthand for common Latin-1 supplemental characters (Ì, é, Å, ü, etc.) as well as some mathematical operators (∀, ∞, ±, ⊕ …) and other symbols. Supports most of the characters with defined HTML 4.0 named entities.

X-SAMPA <-> IPA [CC] – Supports the standard X-SAMPA ASCII encoding for IPA (International Phonetic Alphabet) symbols in Unicode. For more information:

Shebrew <-> Hebrew (Unicode) [CC] and Shebrew (no vowels) <-> Hebrew (Unicode) [CC] – Provides an ASCII shorthand for representing Hebrew letters and points (vowels). The "no vowels" version ignores any vowels in the input when producing output. This ONLY works for the Unicode encoding of Hebrew characters—it does not support characters encoded using Latin-1 supplemental characters under ISO-8859-8 or similar. Note that Hebrew is a right-to-left language. For more information on encoding Hebrew with Unicode:

[0590-05FF] Hebrew (Unicode)

ISO-8859 Conversion – Convert text from a particular ISO-8859 encoding to Unicode. For more on ISO-8859, see ISO-8859: Alphabet Soup. Supported ISO-8859 encodings are:

French 24 Presentation

The Unicode Standard Version 4.0

Using Unicode characters with HTML

Resources for HTML, XHTML, CSS, Javascript, PHP, and Regular Expressions

10 Million Firefox Downloads in 32 Days!

Mozilla Firefox Browser

Mozilla Firefox version 1.0 was recently released, and I highly recommend it for everyone. It does everything Internet Explorer can do, but is more secure and has many really cool features (such as tabbed browsing, extensions you can install if you want to see a weather summary in the corner of your browser or something, and a nice Calendar program that you can download). It is also better than IE at displaying Unicode characters and conforming to web standards. The installation program will even import your Internet Explorer bookmarks for you, so there's no reason not to download Firefox!