Posts

Showing posts from July, 2017

Computer Science Internationalization - Unicode Encoding & Decoding

Image
Several years ago I devised this visual and fun way to teach and practise encoding and decoding Unicode. I used this method in my International Computing class. This method involves use of pencil and eraser. The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding. We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints. Encoding: We will start with encoding Unicode codepoints to UTF-8. The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits. Now we will write one or more Unicode codepoints on the form. These will be the codepoints we will encode into UTF-8. The codepoints should be written in hexadecimal. I will use the codepoints U+0444 and U+597D. So, how do we determine where the codepo...

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code . LetΚΌs perform some, seemingly, very simple tests. LetΚΌs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'. Now test with the following two words. Scorpion Scorpion Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance. Now test you Cool Code with the following two words. π‘†π‘π‘œπ‘Ÿπ‘π‘–π‘œπ‘› π’πœπ¨π«π©π’π¨π§ If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF. Shoul...