How to move Unicode data from Excel/csv file to MySQL table?

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Thu, 06 Aug 2009 21:31:31


I'm hitting a problem with using LOAD DATA to import csv files with non-ASCII
languages to MySQL.
The data is from Excel sheets which I exported with the "Unicode (txt)" option.
This produces a tab-delimited file that is encoded UCS-2 Little Endian (according
to Notepad++). I convert this in Notepad++ either to ANSI or to ANSI/UTF-8
without BOM and then import to MySQL. The result is the same.
- only the first line gets imported
- there seems to be a problem with the detection of the line-end as the last
field also gets the first field of the next line attached to it (and then it
stops)
- non-ASCII characters seem to display correctly (the first line has only two,
though), so the way of encoding the characters is the right one

Example:
a_saefte;10;;nahrungsmittel;10_obst;Ste;juices;Jus;Succhi;Zumos;(Meyve) sulari
is imported as:
a_saefte 10 nahrungsmittel 10_obst Ste juices Jus Succhi Zumos (Meyve) sulari
a_zitrusfruechte

(copied from PHPMyAdmin, international characters may not get encoded correctly
in this posting, forget about that)
a_zitrusfruechte is from the next line but is included in the last field *with*
the linebreak!

Command:
mysql> LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '"' ESCAPED BY '\\' LINES
TERMINATED BY '\r\n';
Query OK, 1 row affected (0.00 sec)
Records: 1 Deleted: 0 Skipped: 0 Warnings: 0

The lines *are* terminated by \r\n. Field order and count match exactly. The
problem happens only when I import "correctly" encoded characters. When I import
a csv delimited file (produced by Excel) all lines get imported but the encoding
is not correct, so that characters and the parts after them are missing.

I don't see any other way to export these data from Excel. Exporting to csv
spoils all characters which are not in the current Windows codepage.

MYSQL ist 5.0.45 on CentOS 5. Kollation of the table cells with international
characters is utf8_unicode_ci.

Can this problem be easily solved or what am I doing wrong? Thanks!

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Jonatha » Thu, 06 Aug 2009 22:17:46


If your data is ecoded like above I am not sure this is the proper command:


Since you do not enclose or escape your fields at all, perhaps you need
to adjust the ENCLOSED BY and ESCAPED BY sequences more like this?

LOAD DATA LOCAL INFILE
'/tmp/Allergene_2.1.ansi.txt'
INTO TABLE
`allergene`
FIELDS TERMINATED BY ';'
ENCLOSED BY ''
ESCAPED BY ''
LINES TERMINATED BY '\r\n';

> The lines *are* terminated by \r\n.

BTW Are you really sure that the line ending after conversion still is \r\n?

Kind regards,

Jonathan

 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Peter H. C » Thu, 06 Aug 2009 22:42:16


So, table's utf8, file's utf8. Is your database connection utf8?

--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so good.'
I said 'Sure, but Microsoft still ships it.'
-- Anonymous
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Fri, 07 Aug 2009 00:18:33

Jonathan schrieb am Wed, 05 Aug 2009 15:17:46 +0200:


Thanks for the tip. Unfortunately: same result.


Yes. Notepad++ shows CRLF at the end and a hex editor shows 0d 0a. It looks
like it cannot identify the \r\n sequence, right?

I've just looked at http://www.yqcomputer.com/
and found the CHARACTER SET option. Unfortunately, it doesn't help either. It
clearly changes the encoding of the characters in the fields, but it doesn't
seem to affect the identification of the \r\n line-break characters. I also
tried CHARACTER SET usc2 on the original file, but this fails completely with
a lot of warnings (not shown).

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Fri, 07 Aug 2009 00:18:34

Peter H. Coffin schrieb am Wed, 5 Aug 2009 08:42:16 -0500:


No.
| character_set_client | latin1
| character_set_connection | latin1
| character_set_database | utf8
| character_set_filesystem | binary
| character_set_results | latin1
| character_set_server | latin1
| character_set_system | utf8
| character_sets_dir | /usr/share/mysql/charsets/
| collation_connection | latin1_swedish_ci
| collation_database | utf8_unicode_ci
| collation_server | latin1_swedish_ci

I have now temporarily changed character_set_client, _connection and -
server. No change :-( I think you cannot expect one, anyway. I think this
only affects *data* and apparently there's only that one line considered
data. It doesn't affect \r\n. \r\n should always be \r\n, right? I wonder
if I just add ...

Ok, I added a "|" before the \r\n

LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES
TERMINATED BY '|';
Query OK, 105 rows affected, 30 warnings (0.00 sec)
Records: 105 Deleted: 0 Skipped: 0 Warnings: 30

And it's all showing up correctly. So, \r\n is not the correct syntax if
the character data is unicode or utf-8. I tried \x0d\x0a, \u000d\u000a,
\000d\000a. Nothing fits. What#s the correct way?

Btw, how can I enable output of those warnings? (Probably field too
small.)

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Fri, 07 Aug 2009 00:57:17

Kai Schaetzl schrieb am Wed, 05 Aug 2009 17:18:34 +0200:


It seems that character sequence gets completely ignored. If I use "|" to
terminate a line then the next record should contain the \r\n plus the real
data in the first field, shouldn't it? But it doesn't, it's just gone.
Another possibility is that it uses the "|" to terminate the line for data
input, then ignores everythign after it and starts in a new line with the
next record.

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Fri, 07 Aug 2009 02:16:06

Kai Schaetzl schrieb am Wed, 05 Aug 2009 17:57:17 +0200:


I spoke too soon, it wasn't visible in PHPMyAdmin. It's visible in my output
and in direct selects. The \r\n gets put in the first field. So, I really need
to know how I can match \r\n in this file.

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Fri, 07 Aug 2009 02:29:54

Kai Schaetzl schrieb am Wed, 05 Aug 2009 19:16:06 +0200:


Ok, got it. First, you have to convert to unix format, so only one
character ends the line, then you use
LOAD DATA LOCAL INFILE '/tmp/Allergene_2.1.ansi.break.txt' INTO TABLE
`allergene` FIELDS TERMINATED BY ';' ENCLOSED BY '' ESCAPED BY '' LINES
TERMINATED BY 0x0a;

(note the missing quotes around the termination character!)
Hope this helps someone else in the same situation.

Kai
--
Conactive Internet Services, Berlin, Germany
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Peter H. C » Fri, 07 Aug 2009 05:37:37


Line termination defaults to line terminated by newline characters, so
you may more reliably get what you want by omitting it entirely.

--
Don't use this code for realtime control, for weapons systems, or for
anything else that may put life or limb at hazard. It isn't man-rated,
it isn't really thing-rated, and we don't claim that it's worth a good
G*dDamn for anything at all, at all.
-- Mike Andrews, on Java compilers
 
 
 

How to move Unicode data from Excel/csv file to MySQL table?

Post by Kai Schaet » Sat, 08 Aug 2009 02:31:21

Peter H. Coffin schrieb am Wed, 5 Aug 2009 15:37:37 -0500:


Oh, yes? I was under the impression that it's not optional. Yeah, you are
right, it's optional. However, it doesn't help much in this case.

It defaults to '\n' only, so I have to remove the \r, anyway. If I then
use nothing or LINES TERMINATED BY '\n' or LINES TERMINATED BY 0x0a it all
works.
Just '\r\n' doesn't work (with \r not stripped yet).
Looks like a bug when utf8/doublebyte characters are present: you can use
only one termination character. Maybe it's already fixed in a newer
version of MySQL.

Anyway, thanks for your help and making me think. :-)


Kai
--
Conactive Internet Services, Berlin, Germany