Efficient format for huge amount of data

Efficient format for huge amount of data

Post by gagenellin » Wed, 21 Jan 2004 14:51:45


I have to pass a huge amount of data to a Java program. The source
program is not written in Java but I have control over both programs
and can arrange any suitable format at both ends.

The dataset is a sequence of records, all records having the same
structure. This structure is only known at runtime, and it's built on
simple types like string, integer, double, etc.

I could use an ASCII file to transfer data, like this:

"A string", 123, 4.567, "X"
"Another string", 89, 10.0, "Y"
"Third line", -1, 0.0, "Z"
... many more lines, 100K or 1M ...

but AFAIK to parse it I have to use a BufferedReader + StringTokenizer
for each line + the various wrapper classes like Integer, Double... I
think this may be very slow for a large file.

Maybe a binary format is more efficient, but I don't know which could
be the best way, nor how to implement it.
I've considered using Serialized, but since the source program is not
written in Java it may be hard to replicate exactly the Serialized
format - btw, where is it documented? if documented at all...

Any ideas are welcome.
Thanks,

Gabriel Genellina
Softlab SRL
 
 
 

Efficient format for huge amount of data

Post by Marco Schm » Wed, 21 Jan 2004 15:31:25

Gabriel Genellina:

[...]


There are DataInputStream and DataOutputStream. Both have read and
write method for the primitive types of Java and Strings. Byte order
is big endian, valid intervals for the primitives types are defined in
the Java specs (e.g. char from 0 to 65535), the format of String
serialization is described in the API docs of read/writeUTF.

So if an element would be like the data you described above, an
element class could be:

class Element {
String s;
int i;
float f;
String s2;
}

And reading and writing could work like that:

Element read(DataInputStream in) throws IOException {
Element elem = new Element();
elem.s = in.readUTF();
elem.i = in.readInt();
elem.f = in.readFloat();
elem.s2 = in.readUTF();
return elem;
}

void write(DataOutputStream out, Element elem) throws IOException {
out.writeUTF(elem.s);
out.writeInt(elem.i);
out.writeFloat(elem.f);
out.writeUTF(elem.s2);
}

There is no single best way of doing persistent storage. Personally
I'd work with databases whenever it's feasible. I don't like self-made
binary formats like the above very much. You can't change things
easily, at least not if you have to convert existing data from binary
format A to B. Other people will have to study your format and write
and maintain dedicated code.

However, the format is more efficient (less space and faster to parse)
than ASCII text.

Regards,
Marco
--
Please reply in the newsgroup, not by email!
Java programming tips: http://www.yqcomputer.com/
Other Java pages: http://www.yqcomputer.com/

 
 
 

Efficient format for huge amount of data

Post by Thomas Sch » Wed, 21 Jan 2004 16:59:24


So be sure to use htons() / htonl() in the non-Java app before stuffing
the data on the stream.
 
 
 

Efficient format for huge amount of data

Post by Andrew Hob » Wed, 21 Jan 2004 18:32:47


How large are you talking about. 1 Mbyte is not a large file. And what do
you consider too slow? Have you tried that approach. I suspect you will
find it faster than you think. Alternatively what about writing a parser
yourself. Look at each character in turn and using the commas as
delimiters.

We wrote our own parser and reading a 1 MByte file off disc, parsing it into
floats and strings and then drawing the 3D structure that it represents
takes a fraction of a second. If you want to see what I mean then log onto
www.metasense.com.au and try the free trial version. Click on the Chemistry
and then the DNA folder and try out some of those molecules. The largest is
almost 1 M in size and it loads and displays on my machine in about 1/2
second. It might take longer for you depending upon the speed of your
connection.

Cheers

Andrew

MetaSense Pty Ltd - www.metasense.com.au
12 Ashover Grove
Carine W.A.
Australia 6020

61 8 9246 2026
XXXX@XXXXX.COM

*********************************************************
 
 
 

Efficient format for huge amount of data

Post by Christian » Wed, 21 Jan 2004 20:58:15


<snip>

I wouldn't worry too much about speed. I've written something very similar,
and was able to parse a 600 mb text-file using the method above in about a
minute. Your case may be a bit more timeconsuming, but it will probably
still be fast enough.

Christian
 
 
 

Efficient format for huge amount of data

Post by Thomas Wei » Wed, 21 Jan 2004 21:53:26


[...]

1M is not a huge amount of data. I eat that for breakfast - twice :-)


Try it. Slow is a relative term, but I don't think you will get in
trouble here.


A ByteBuffer might be the fastest.


AFAIR the low-level details are documented in the
Data[Output|Input]Stream or Object[Input|Output]Stream API
documentation. There is also some spec. on Sun's Java web site.

/Thomas
 
 
 

Efficient format for huge amount of data

Post by nos » Thu, 22 Jan 2004 00:13:27


I would put one value per line. This avoids tokenizing and
the file size doesn't change much.
 
 
 

Efficient format for huge amount of data

Post by William Br » Thu, 22 Jan 2004 00:28:49


A StreamTokenizer would be much more flexible and you would only need to
create one.
Using the flag to set end-of-line as a token would let you tell when each
line ends.
Bill





----== Posted via Newsfeed.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.yqcomputer.com/ The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= 19 East/West-Coast Specialized Servers - Total Privacy via Encryption =---
 
 
 

Efficient format for huge amount of data

Post by sarge_chri » Thu, 22 Jan 2004 01:11:58

I doubt that speed will be an issue for you.

I've been working on some address handling software for a mate,
comma-delimited records, file-size usually around the 3-4Mb mark,
using BufferedReader and StringTokenizer for parsing - it generally
takes a minute or so to process (and it looks like the in-memory
processing I'm doing is considerably more complex than your
requirements).

Try it and see!

- sarge
 
 
 

Efficient format for huge amount of data

Post by Jon A. Cru » Thu, 22 Jan 2004 05:14:51


Actually, try not to use them.

Instead use explicit byte math to get values out in an explicit order.

Since most networked applications use 'network byte order' which is
big-endian, go ahead and use that.

to give you the rough idea:

write32( char* dst, uint32 u )
{
dst++ = (u >> 24) & 0x0ff;
dst++ = (u >> 16) & 0x0ff;
dst++ = (u >> 8) & 0x0ff;
dst++ = (u >> 0) & 0x0ff;
}
 
 
 

Efficient format for huge amount of data

Post by Jon A. Cru » Thu, 22 Jan 2004 05:20:17


Probably not, since an "ASCII" file would be limited to 7-bit data, and
would lose things. It's very important, especially in the Java world, to
remember that "ASCII" is *not* a synonym for "plain text".

Most of the MS Windows documentation uses "ANSI" as a term for 8-bit
text. "ASCII" is much more limited, and actually present in Java's data
conversions. You'll hit a lot of subtle errors telling Java applications
that you want "ASCII" data when it's not really what you need.



As long as you wrap IO in one of the buffered types, speed probably
won't be a problem on only 1MB.


HOWEVER... there's another gotcha. Readers use some encoding to convert
from 8-bit encodings to internal Java strings which are UTF-16. You'll
probably want to be very explicit on the encoding used. UTF-8 is
probably very good for your needs.
 
 
 

Efficient format for huge amount of data

Post by gagenellin » Thu, 22 Jan 2004 05:23:45


Sorry, I meant between 100000 and 1 million lines, not 1MB file size.
My test file (ASCII format) is about 200 MB.
Reading the ASCII file was too slow - I'll try other ways as suggested
by other people here.
 
 
 

Efficient format for huge amount of data

Post by Scott Ells » Thu, 22 Jan 2004 06:13:38

In article < XXXX@XXXXX.COM >,



Depends on just how much the "huge amount" ends up being, and how you
intend to use it.

I parse a data file containing matrix data for a simple lapack test. It
has an x, a y, and a double value for matrices up to 500 by 500. This
uses no tokenizers, just reading the line, splitting on the space, and
parsing the data. This 1.1M file is read in 1.357 seconds.

In a different project, I parse 10M XML files using a JDOM-based parser
in 5 seconds or so, though these are all string data without a
string-double conversion.

For both of these, it was important for me to have a format that a human
could read, and that a junior programmer could write a correct parser
for in a very short time, so I used a pure text format.

The nio package has memory mapped files, auto-endian converting byte
buffers, and other tools that make a binary representation easier to
handle.

The key question is where your time is likely to be spent. If you have
a lot of data that has to come off the disk quickly, then a binary
format will minimize wire time. If that file needs to be curated,
parsed, read in by other languages, then human readability might become
*** . If you only need a small subset of the data, you might be
best served by a relational database, as those are very good at
searching gigabytes of data to extract the 50k or so you wanted.

Scott
XXXX@XXXXX.COM
Java, Cocoa, WebObjects and Database consulting for the life sciences
 
 
 

Efficient format for huge amount of data

Post by Jon A. Cru » Thu, 22 Jan 2004 06:54:01


Again, "ASCII" is not correct.

Among other things, Java can use "ASCII" as then coding during
conversion, but you will lose 50% of all possible data.

Not safe.
 
 
 

Efficient format for huge amount of data

Post by A. Craig W » Thu, 22 Jan 2004 09:22:55


I'm not sure what you have against htons() and htonl(), seeing as they are
commonly available macros that convert data from the host-specific byte order
to network byte order, which is exactly what is needed. That's the whole POINT
of htons() and htonl(). While you could expand out the macros yourself (like
you did in your example) if you are doing any significant amount of data at
all you will end up writing your own anyways, so you might as well use the
common ones.
Now if it should happen that the non-java app isn't written in C or C++, then
I can see where using htons() and htonl() could be a problem...

--
Craig West Ph: (416) 666-1645 | It's not a bug,
XXXX@XXXXX.COM | It's a feature...