Posts

Computer Science Internationalization - Unicode Encoding & Decoding

Image
Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser. The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding. We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints. Encoding: We will start with encoding Unicode codepoints to UTF-8. The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits. Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D. So, how do we determine where the codepo...

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code . Letʼs perform some, seemingly, very simple tests. Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'. Now test with the following two words. Scorpion Scorpion Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance. Now test you Cool Code with the following two words. 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF. Shoul...

Computer Science Internationalization - Hieroglyphs in Domain Names

Image
I have been aware for a long time that domains such as .com support many human language scripts. Verisign's .com includes support for Hiragana, Gurmukhi, Han, Tibetan, Sinhala, Devanagari, Hangul and many more. But what of Verisign's .com equivalents .コム (Japanese) and .닷컴 (Korean)? Both of these support a multitude of human language scripts. The supported scripts for many, but not all, Domains are listed in the IANA Repository of IDN Practices iana.org/domains/idn-tables . Whilst browsing this repository, I discovered there are sixteen domains, all belonging to Verisign, which support Egyptian Hieroglyphs which I think is totally cool ! Verisign's .com, .コム and .닷컴 all support Egyptian Hieroglyphs. This means one can register domain names such as:- 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.com 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.コム 𓇋𓈖𓏏𓂝𓂋𓈖𓄿𓏏𓇋𓍯𓈖𓄿𓃭.닷컴 It is possible you do not have an Egyptian Hieroglyph font on your device so here are the domain names in image format. Google provide ...

Computer Science Internationalization - Adaptive URL

A URL can consist of a Domain Name and a pathname. In the examples below x.y.z represents the Domain Name, the remainder being the pathname. My experience of the internet is that the pathname is usually written in English or more accurately ASCII. The below ASCII pathname represents a multi-page website in the form of a journey from home to a hotel in Korea. x.y.z/home/bus/airplane/korea/taxi/hotel Websites, such as Google, adapt the language of their text content according to the browser preferred display language (BL). This browser preferred language can be set by the user. Letʼs go one step further than Google and adapt the language of the URL pathname according to the BL. Here is the ASCII pathname rewritten into Chinese, Japanese and Korean. x.y.z/家/公共汽车/飞机/韩国/出租车/饭店 x.y.z/ホーム/バス/飛行機/韓国/タクシー/ホテル x.y.z/홈/버스/비행기/한국/택시/호텔 So, how do we implement these language adaptive URL parthnames? Firstly, we need to programmatically determine the BL. One way of achieving this is to examine the A...

Computer Science Internationalization - EAI

As I stated in schappo.blogspot.co.uk/2017/01/chinese-email-address.html both DataMail and Google mail support Email Address Internationalization (EAI). DataMail provides a complete EAI service which includes both support and creation of Internationalized email addresses. Google Mail provides a partial EAI service, in that, it supports EAI but does not yet provide for creation of internationlized email accounts with internationalized email addresses. Thus organisations using Google Mail have an advantage over those organisations having an ASCII addresses only email service and have a head start in provision of a complete EAI service. Given the Domain name of an organisation, the Unix host command can be used to determine the mail service provider. Here are some of the organisations using Google Mail: 苹果电脑 ~: host spotify.com spotify.com has address 194.132.198.198 spotify.com has address 194.132.197.198 spotify.com has address 194.132.198.149 spotify.com mail is handled by 10 ASPMX3....

Computer Science Internationalization - Unicode Terminal Session

Below is an OSX bash shell command line terminal session. It is a real, working terminal session using basic unix commands. It does, though, look significantly different from a standard terminal session. If you know basic unix commands such as ls and cd , you should/may be able to work out what is happening. 苹果电脑 ~: 妈 我的目录 苹果电脑 ~: 茶 我的目录 苹果电脑 我的目录: 丽 苹果电脑 我的目录: 头 文档一 文档二 文档三 苹果电脑 我的目录: 丽 文档一 文档三 文档二 苹果电脑 我的目录: 词 > 文档四 一 二 三 四 五 六 苹果电脑 我的目录: 词 文档四 一 二 三 四 五 六 苹果电脑 我的目录: 丽 文档一 文档三 文档二 文档四 苹果电脑 我的目录: ⇉ 文档四 文档五 苹果电脑 我的目录: 丽 文档一 文档三 文档二 文档五 文档四 苹果电脑 我的目录: → 文档一 文档六 苹果电脑 我的目录: 丽 文档三 文档二 文档五 文档六 文档四 苹果电脑 我的目录: So, what is happening!? Firstly I am using Unicode characters. If you search the internet you will find many examples of terminal sessions but they will invariably be using ASCII characters only. In my above terminal session I am using Unicode characters, mostly Chinese/Japanese and two arrow symbol characters. Where are the commands such as ls and cd ? I have mapped a...

Chinese Email Address

The latest and hottest news is that I now have a Chinese email address➜ 小山@电邮.在线 😄 小山 is my adopted Chinese name 电邮 means email 在线 means online I acquired my free Chinese email address from DataMail which supports email addresses in twelve languages: العَرَبِيَّة‎‎ Arabic, বাংলা Bengali, 中文 Chinese, English, ગુજરાતી Gujarati, हिन्दी Hindi, मराठी Marathi, ਪੰਜਾਬੀ Punjabi, ру́сский Russian, தமிழ் Tamil, తెలుగు Telugu, اُردُو‎ Urdu. Additionally, DataMail has an impressive family of IDNs (Internationalized Domain Names) with each language having itʼs own IDN. Arabic داده.امارات Bengali ডাটামেল্.ভারত Chinese 电邮.在线 English datamail.in Gujarati ડાટામેલ.ભારત Hindi डाटामेल.भारत Marathi डेटामेल.भारत Punjabi ਡਾਟਾਮੇਲ.ਭਾਰਤ Russian почта.рус Tamil இந.இந்தியா Telugu డేటామెయిల్.భారత్ Urdu ڈاٹامیل.بھارت If you would like your own DataMail email address in one of the above languages then just click one of the above links. The website directs you to download an Android or iOS App. One uses t...