An update regarding my database encoding issues

1 Star2 Stars3 Stars4 Stars5 Stars (4 votes, average: 4.00 out of 5)
Loading...
By Oliver (AKA the Admin) on 20 comments
in Categories: Just Talking

not sure it's still a munted clusterfuck, or we're almost out of the woods already

Hello guys, thank you very much for the help and suggestions you gave regarding my database encoding issues!
I haven’t replied to anyone, but there has been lots of feedback, I’m truly grateful :)

(If you missed it, I explain everything in the call for help I posted 2 days ago)

I thought I could post an udpate on the topic. Once again, I’m humbly asking for your opinion and feedback regarding the progress so far :jap:

*

-1- Several suggestions I received were too technical and/or too generic for me. Telling that’s a broad problem requiring to check on everything from Apache to the headers… well, yeah… but I’m skilled in neither of these. Jack of all trades, master of none, I may host websites but it’s not my job and I’m self-taught with what I know.

-2- I received 2 groups of solutions that would work. They’re not 100% satisfactory because they would cure the symptoms of the problem but not address the problem itself, would you see the idea? If I apply them, I suspect that, the next time I must dump and re-import the database of Hentairules from a server to a new server, the problem may happen again.

*

Those solutions that would work are:

*

(A) A massive list of SQL commands to run, UPDATE wp_posts SET post_content = REPLACE, targeting each of the characters that was transformed. Issues: some characters pose a problem (e.g. damn you semicolon), and there’s a strong chance to forget some rarely used bug-characters in the process.

I occasionally replace myself strings like that in the database of Hentairules (unless you ask the waybackmachine, you’ll never have proof I made the same spelling mistake in a certain verb during 5 years, oh the embarrassment), so it’s a procedure I’m not stranger to.

*

(B) the Notepad++ route. I was kindly suggested by a commenter called Andrew M to open the .SQL database dump in Notepad++, click to Encode > Convert to ANSI, and then, afterwards, Encode > Encode in UTF-8.
That method makes almost everything become normal again, save rare exceptions such as the characters in the middle of kaomojis like ?\_(?)_/?

EDIT: oh, god damnit. There’s another issue. I pasted a working kaomoji and got a question mark in its middle when I save my post. On the previous server, I could post kaomojis into my posts, they would work. Screenshot:

I thought it wouldn’t work with me, as the Notepad++ would freeze on me when I asked it to convert even a naked dump of just wp_posts (“only” 170 MB for its sql file).
However, in fact, if after I click to ask for a conversion or encoding, I leave the program alone for several minutes without clicking anymore inside N++, then when I’m back to my PC, the N++ is responding again and agrees to save the file, WOOT!! YESSSSS!!! :kickass:

I’ve tested that wp_posts table, reintegrated into a test database, and indeed, the problem is gone for all the old posts.

… And yet, I’m not sure it is REALLY the final fix I must apply.

Please, tell me if I’m wrong, but my problem, in the past, came from the fact my database was in Latin1 instead of UTF8, or even better, UTF8-mb4, but I attempted to make it work in an UTF-8 environment, right?

So, even though I ask the Notepad++ to convert the file to Notepad++, after I upload the .sql file back to my server and execute it as a query to re-import it, my database engine will be none the wiser and will still think it’s latin1, right?
I confirmed it with phpmyadmin, the test database, its wp_posts was still marked as being latin1.
Thus, the same improper export-import could happen again in the future if I’m unlucky ?

Following that, should I make additional operations, to “officially” make it an UTF-8 or UTF-8-mb4 database?

Or is it now okay, really okay, no future worries, to leave it as latin-1? What would you think?

(If the info is relevant, here is the encoding of my blog tables)

*

Unless someone comes up with the magical SQL query that fixes every problem and leaves me with an UTF-8mb4 database ;)

*

So, yeah, I think at the end of the day, the question would be: do you think I can safely go with the Notepad++ convert+encode route, and feel safe for the future, or must additional operations be performed? In your opinion?

Once again, thank you if you can provide feedback or help, I’m seriously grateful to you guys :)

*

Super late edit: scroll up, control-F “kaomoji”: and there’s the problem I can’t paste kaomojis anymore into the posts without their central character becoming corrupted, I could post them on the previous server that hosted hentairules, and yet, I tripled-checked, still latin1_swedish_ci, same encoding in phpmyadmin on the two servers, old and new.
It can’t be normal, maybe it’s also a symptom, or a clue? O_o

Subscribe
Notify of
guest

20 Comments
oldest
newest most voted
Inline Feedbacks
View all comments
Flinko
Flinko
5 years ago

If you send a list of tables with fields, i can do a sql script that applies the solution on the link you posted before.
Link > https://coderwall.com/p/gjyuwg/mysql-convert-encoding-to-utf8-without-garbled-data

Votuq
Votuq
5 years ago

Hello Oliver

Have you removed the site RSS feed?
I get an error trying to access it from my Reader, and the validator gives a 500 Error.

kkai
kkai
5 years ago

The feed began feeding again. Yey!

kkai
kkai
5 years ago

For me it seemed to, for one batch of posts at least (which I checked against the site to make sure nothing was missing). I’ll keep an eye on it as well.

Votuq
Votuq
5 years ago

It was working yesterday, but now it seems to be invalid.

Votuq
Votuq
5 years ago

The RSS reader I use is quite fast to refuse a half-wrong feed, but I also tried with Thunderbird and it refuses it.
If I open it on Firefox, the last entry is Mujaki Na Kaibutsu, and coincidentally the validator points to an error in the following post (azukiko-single-mother-to-issho-ni-boku-no-mamakatsu-1)

http://www.feedvalidator.org/check.cgi?url=http%3a%2f%2fwww.hentairules.net%2ffeed%2f

FSK
FSK
5 years ago

Do a mySQL dump schema on both databases (the old one and the new version), and post it here. There’s a setting for character set when you create the table. You probably did it wrong/different in the new database.

https://www.mysqltutorial.org/mysql-character-set/

https://electrictoolbox.com/mysqldump-schema-only/

JamieWolf
JamieWolf
5 years ago

Hey Oliver,

not only the charset matters, also the collate (which is what you see in your screenshot). It is in charge of how things are sorted when you do comparison SQL Queries.

But still in both dumps your charset is still either latin1 or utf8 and not the utf8mb4 you want to use ;-)

Mathias https://mathiasbynens.be/notes/mysql-utf8mb4 has written an excelent guide on that matter, but it is very technical.

JamieWolf
JamieWolf
5 years ago

Hey Oliver,

Notepad++ in gerneal is fine, but as you noticed, doing replaces on huge files is not the task it was made for. For such use a linux cli tool called sed, with this little champ you could work on 50GB+ Datafiles ;-)

Still people are correct, when they state you got to check other stuff as well. Currently your HTML Headers are fine, they state utf8. But the php connect command also has to use utf8 on connecting. Last but not least (current WIP problem) you data has to be stored in utf8.

p.s. no magic SQL command, but some that could get quite close.

JamieWolf
JamieWolf
5 years ago
Reply to  JamieWolf

But to answer your questions:
– generally move to utf8mb4
– Notepad++ convert+encode route: fesable but would not recommend. Use tools like sed, iconv etc. which are made for the job
– first fix your data encoding
– second fix maybe old leftover stuff with sed regex replace commands
– third fix table struc charset and collate
– forth import data correctly
– fifth use mysqldump with correct parameters to backup your DB (if possible) and not PHPmyadmin

Have a good day sir and stay healthy!

JamieWolf
JamieWolf
5 years ago

Haha its late for me too. What I mean with fix the data is the big SQL Dump you have (from whatever it is to uft8 encoding, then the wierd leftover stuff). Then create the correct table stucture and import.

I also send you a very technical email :-) which bascially says make a plan, then execute and test

Needa
Needa
5 years ago

Oh my, that kaomoji is cutely lewd! You pervy Oliveey~?

FSK
FSK
5 years ago

Your RSS feed is broken now.