Spaghetti Programming

Searching PostgreSQL Dump Files for Hex Characters

While the contents of this post describe working with PostgreSQL dump files specifically, these same techniques described here can be applied to searching for characters using their hex codes in any type of plain text file.

In the Postgres world, sometimes you will have to load a database dump from a database with one character encoding into a new database with a different encoding and this annoying message appears:

ERROR: invalid byte sequence for encoding "UTF8": 0xfd;

This means that there is some character somewhere in the database dump file that is not encoded for UTF8 (or whatever encoding your database is set to). In the above example the character hex code is given: 0xfd. The offending character(s) will have to be removed or converted to UTF8 encoding. You can try to do something like this using the iconv tool:

iconv -f LATIN1 -t UTF8 name_of_dump_file.dump | psql name_of_database 2>&1 | grep ERROR

This will take the contents of the dump file and try to convert the encoding of the entire file from LATIN1 to UTF8 and then pipe it into the psql command. Usually this or some other combination of from and to encoding parameters will work, but sometimes it doesn't and you have to do some more digging.

This is where the importance of the character hex value given in the error message becomes relevant. You can search the dump file for all instances of that character with grep using its hex value like so:

grep $'\xfd' name_of_dump_file.dump

The magical part of that command (at least for me) is the parameter given to grep: $'\xfd'

That parameter can be broken down into two important parts. The first part is the $ in front of the single quoted string. This is telling the shell to evaluate what is inside the quotes. The second part is the \xfd. The \x means that you are referencing a hex code, and the fd is the actual value of the character in hex. When this is all combined like above you get back the character represented by the hex code. You can even do something like this as a test to see the character printed back to you on the terminal:

echo $'\xfd'

Once you identify the location of the characters in the dump file you can decide on some way to deal with them. You could invert grep to filter them out, use sed to replace them, or something else that I don't know about.

You could also edit the file in a text editor like vim. Caution must be exercised here though because you could have an enormous file when working with database dumps and by default vim like to make swap files. You can add set noswapfile to your .vimrc or open with with -n option to disable swap files. From within vim to search by hex code you can use /\%xfd

Here again the \%x means that you are referencing a hex code and fd is the actual hex code of the character to search for.

And here is a handy ASCII cheat sheet to lookup character hex codes.

+----+-----+----+-----+----+-----+----+-----+
| Hx | Chr | Hx | Chr | Hx | Chr | Hx | Chr |
+----+-----+----+-----+----+-----+----+-----+
| 00 | NUL | 20 | SPC | 40 |  @  | 60 |  `  |
| 01 | SOH | 21 |  !  | 41 |  A  | 61 |  a  |
| 02 | STX | 22 |  "  | 42 |  B  | 62 |  b  |
| 03 | ETX | 23 |  #  | 43 |  C  | 63 |  c  |
| 04 | EOT | 24 |  $  | 44 |  D  | 64 |  d  |
| 05 | ENQ | 25 |  %  | 45 |  E  | 65 |  e  |
| 06 | ACK | 26 |  &  | 46 |  F  | 66 |  f  |
| 07 | BEL | 27 |  '  | 47 |  G  | 67 |  g  |
| 08 | BS  | 28 |  (  | 48 |  H  | 68 |  h  |
| 09 | TAB | 29 |  )  | 49 |  I  | 69 |  i  |
| 0A | LF  | 2A |  *  | 4A |  J  | 6A |  j  |
| 0B | VT  | 2B |  +  | 4B |  K  | 6B |  k  |
| 0C | FF  | 2C |  ,  | 4C |  L  | 6C |  l  |
| 0D | CR  | 2D |  -  | 4D |  M  | 6D |  m  |
| 0E | SO  | 2E |  .  | 4E |  N  | 6E |  n  |
| 0F | SI  | 2F |  /  | 4F |  O  | 6F |  o  |
| 10 | DLE | 30 |  0  | 50 |  P  | 70 |  p  |
| 11 | DC1 | 31 |  1  | 51 |  Q  | 71 |  q  |
| 12 | DC2 | 32 |  2  | 52 |  R  | 72 |  r  |
| 13 | DC3 | 33 |  3  | 53 |  S  | 73 |  s  |
| 14 | DC4 | 34 |  4  | 54 |  T  | 74 |  t  |
| 15 | NAK | 35 |  5  | 55 |  U  | 75 |  u  |
| 16 | SYN | 36 |  6  | 56 |  V  | 76 |  v  |
| 17 | ETB | 37 |  7  | 57 |  W  | 77 |  w  |
| 18 | CAN | 38 |  8  | 58 |  X  | 78 |  x  |
| 19 | EM  | 39 |  9  | 59 |  Y  | 79 |  y  |
| 1A | SUB | 3A |  :  | 5A |  Z  | 7A |  z  |
| 1B | ESC | 3B |  ;  | 5B |  [  | 7B |  {  |
| 1C | FS  | 3C |  <  | 5C |  \  | 7C |  |  |
| 1D | GS  | 3D |  =  | 5D |  ]  | 7D |  }  |
| 1E | RS  | 3E |  >  | 5E |  ^  | 7E |  ~  |
| 1F | US  | 3F |  ?  | 5F |  _  | 7F | DEL |
+----+-----+----+-----+----+-----+----+-----+