Home Page
  • April 20, 2024, 06:36:46 am *
  • Welcome, Guest
Please login or register.

Login with username, password and session length
Advanced search  

News:

Official site launch very soon, hurrah!


Author Topic: Weird filename encoding issues on windows  (Read 9761 times)

Dakusan

  • Programmer Person
  • Administrator
  • Hero Member
  • *****
  • Posts: 536
    • View Profile
    • Dakusan's Domain
Weird filename encoding issues on windows
« on: February 08, 2016, 12:52:16 am »


So somehow all of the file names in my Rammstein music directory, and some in my Daft Punk, had characters with diacritics replaced with an invalid character. I pasted one of such filenames into a hex editor to evaluate what the problem was. First, I should note that Windows encodes its filenames (and pretty much everything) in UTF16. Everything else in the world (mostly) has settled on UTF8, which is a much better encoding for many reasons. So during some file copy/conversion at some point in the directories’ lifetime, the file names had done a freakish (utf16*)(utf16->utf8) rename, or something to that extent. I had noticed that all I needed to do was to replace the first 2 bytes of the diacritic character with a different byte. Namely “EF 8x” to “Cx”, and the rest of the bytes for the character were fine. So if anyone ever needs it, here is the bash script.


LANG=;
IFS=$'\n'
for i in `find -type f | grep -P '\xEF[\x80-\x8F]'`; do
   FROM="$i";
   TO=$(echo "$i" | perl -pi -e 's/\xEF([\x80-\x8F])/pack("C", ord($1)+(0xC0-0x80))/e');
   echo Renaming "'$FROM'" to "'$TO'"
   mv "$FROM" "$TO"
done

I may need to expand the range beyond the x80-x8F range, but am unsure at this point. I only confirmed the range x82-x83.

Logged