It seems that you're using an outdated browser. Some things may not work as they should (or don't work at all).
We suggest you upgrade newer and better browser like: Chrome, Firefox, Internet Explorer or Opera

×
So recently, I lost a big post on GOG due to A, not saving it elswhere, and B, GOG's forum software erasing certain things in the extended unicode range and everything after it. This includes decorative marking and possibly even section markings, but especially emoji. The forums would be better with dragons.

Why is this?
Post edited August 21, 2015 by Darvond
No posts in this topic were marked as the solution yet. If you can help, add your reply
Most likely it would be inconsistency in different parts of the site... Each and every data block that's TEXT or BLOB in a db can have it's own type for exactly what it's attuned for, so older versions might not be set for UTF-8. Meanwhile the site itself has to have the xml tag at the beginning saying what the output is, so you have the PHP, and every single text field in the DB that has to be set to UTF-8 for it to work...

Some fields obviously don't need this, ASCII will work fine for 99% email address, usernames, password fields (hashed and salted), gift codes, etc.

So where would it specifically need UTF-8? User reviews, titles, chat, forum posts, and wishlists, wishlist comments, and... ummm... that's it?

They could use UTF-8 for the entirety of descriptions and titles for games, but if they didn't before, they'd need to do a whole site-wide change of formatting all at once, which probably means making a new table, converting, or taking the site offline and changing the entries while the DB won't be touched at all so it doesn't have to be down more than an hour...
avatar
rtcvb32: -Interesting, interesting-
Incredibly plausible, but how did you dig that up?
avatar
rtcvb32: -Interesting, interesting-
avatar
Darvond: Incredibly plausible, but how did you dig that up?
I used the combination of tools, which combines a Linux server, PHP and MySQL... [s]LIMP i think....[/s] LAMP... This is back when i did a lot more web work, mostly for private use and trying to make an online game similar to dogwars... which i didn't finish. The only thing i didn't get too far on was Javascripting, and wrote code only to verify username/passwords met the requirements...
avatar
Darvond: This includes decorative marking and possibly even section markings, but especially emoji.
Sounds like code sanitizing. To prevent code injection. The "title field" for example behaves quite strangely ;)
avatar
classicgogger: Sounds like code sanitizing. To prevent code injection. The "title field" for example behaves quite strangely ;)
Perhaps too zealous?
avatar
classicgogger: Sounds like code sanitizing. To prevent code injection. The "title field" for example behaves quite strangely ;)
avatar
Darvond: Perhaps too zealous?
Only if the following bytes could be a quote or double-quote. Sanitation either escapes the characters, or converts them to harmless versions, both for SQL injection and for HTML codes... But that shouldn't be possible, if memory serves me right, the upper 2 bits become fixed and the next 6 bits are open for the extension, while the first byte ends up mostly being a 'x bytes long for this code'... So it shouldn't happen...
Post edited August 21, 2015 by rtcvb32
avatar
rtcvb32: Only if the following bytes could be a quote or double-quote. Sanitation either escapes the characters, or converts them to harmless versions, both for SQL injection and for HTML codes... But that shouldn't be possible, if memory serves me right, the upper 2 bits become fixed and the next 6 bits are open for the extension, while the first byte ends up mostly being a 'x bytes long for this code'... So it shouldn't happen...
Your average emoji isn't made of HTML code, nor would it be valid in any programming situation. :P Typically, they're two bits wide.
avatar
rtcvb32: Only if the following bytes could be a quote or double-quote. Sanitation either escapes the characters, or converts them to harmless versions, both for SQL injection and for HTML codes... But that shouldn't be possible, if memory serves me right, the upper 2 bits become fixed and the next 6 bits are open for the extension, while the first byte ends up mostly being a 'x bytes long for this code'... So it shouldn't happen...
avatar
Darvond: Your average emoji isn't made of HTML code, nor would it be valid in any programming situation. :P Typically, they're two bits wide.
Most, if not all, UTF-8 and UTF-16 characters can be expressed by their decimal or hexadecimal reference, even if most do not have a html entity name. Examples: http://www.w3schools.com/charsets/ref_utf_symbols.asp.
Post edited August 21, 2015 by Maighstir
わたし

Can you read the Japanese text above?
avatar
dtgreene: わたし

Can you read the Japanese text above?
The hiragana characters wa, ta, and shi? Yes.
avatar
rtcvb32: Most likely it would be inconsistency in different parts of the site... Each and every data block that's TEXT or BLOB in a db can have it's own type for exactly what it's attuned for, so older versions might not be set for UTF-8. Meanwhile the site itself has to have the xml tag at the beginning saying what the output is, so you have the PHP, and every single text field in the DB that has to be set to UTF-8 for it to work...

Some fields obviously don't need this, ASCII will work fine for 99% email address, usernames, password fields (hashed and salted), gift codes, etc.

So where would it specifically need UTF-8? User reviews, titles, chat, forum posts, and wishlists, wishlist comments, and... ummm... that's it?

They could use UTF-8 for the entirety of descriptions and titles for games, but if they didn't before, they'd need to do a whole site-wide change of formatting all at once, which probably means making a new table, converting, or taking the site offline and changing the entries while the DB won't be touched at all so it doesn't have to be down more than an hour...
There's no real reason to have database fields that are ascii-only. Algorithms that can handle ascii-set utf-8 without any meaningful slowdown are extremely well-researched at this point.

If our forums have problems with utf-8, it's going to just be clunky forum software.
avatar
jsjrodman: If our forums have problems with utf-8, it's going to just be clunky forum software.
Or maybe overuse/overreliance on using javascript everywhere?
avatar
dtgreene: わたし

Can you read the Japanese text above?
Those aren't exactly advanced in terms of formatting.
avatar
Maighstir: Most, if not all, UTF-8 and UTF-16 characters can be expressed by their decimal or hexadecimal reference, even if most do not have a html entity name. Examples: http://www.w3schools.com/charsets/ref_utf_symbols.asp.
☕ That much I've understood, but have never been able to quite replicate.
Post edited August 22, 2015 by Darvond
The use of different encodings is a developer's nightmare. Every time I encounter a bug that smells like an encoding issue I despair, because I know it will most likely take ages to fix, all because someone else didn't know how to use encodings properly.