How do you handle and distribute huge amounts of data?

How do you handle and distribute huge amounts of data?

Post by Ravens Fa » Thu, 02 Aug 2007 10:40:05



Well I'm sitting in Cleveland Ohio right now so I am going to have to disregard your message.....hahhahaha...
\


:smileyhappy:
I'll be curious to see if Jamal Lewis's yardage improves for you guys this year after having some definite dropoffs with us the last couple.
Though now that he's not playing against you,  your defense's numbers against the run should improve.:smileywink:
 
I'd be surprised if you could get that data to the researchers at all using Excel.  It has a limit of 256 columns and 65536 rows.  If you had a column per channel and a row per data point, you'd be talking 315 columns and 312,000 rows for 4 minutes of data.  I guess you could always break it up into several files being sure to leave some rows and columns to give them a chance to do some data calculation.
Out of curiosity, I created a spread sheet where every cell was filled with a 1.  It took a good 30 seconds to save and was over 100 MB.  That was probably about 1/10 of the amount of data your dealing with.  And going back to my earlier calculations, I would guess that as a text file, that much data would need about 10 bytes per value thus getting you to about 1 GB in text files.
I would ask them what kind of software they will use to analyze these files and what format they would prefer it in.  There are really only 2 ways to get it to them, either ASCII text file which could be very large, but would be the most flexible to manipulate.  Or a binary file, which would be smaller, but there could be a conflict if they don't interpret the file the same way you write it out.
I haven't used a TDM file add-in for Excel before.  So I don't know how powerful it is, or if there is a lot of overhead involved that would make it take 3 hours.  What if you create multiple TDM files?  Let's say you break it down by bunches of channels and only 20-30 seconds of data at a time.  Something that would keep the size of the data array within the row and column limits of Excel.  (I am guessing about 10 files).  Would each file go into Excel so much faster with that add-in that even if you need to do it 10 times, it would still be far quicker than one large file?  I am wondering if the add-on is spending a lot of time figuring out how to break down the large dataset on its own.
The only other question.  Have you tried the TDMS files as opposed to the TDM files?  I know they are an upgrade and are supposed to work better with streaming data.  I wonder if they have any improvements for larger datasets.
 
 
 
 
 

How do you handle and distribute huge amounts of data?

Post by Ravens Fa » Thu, 02 Aug 2007 11:10:07


Hi Ravens Fan,To the horror of programmers everywhere, Excel 2007 now supports something like 64k columns and 1M rows.We've actually kept that to ourselves where I work.  We already get spreadsheets with 20 tabs, 255 columns, and 50k rows.  Extending that further is... painful to contemplate.Joe Z.

Message Edited by Underflow on 07-31-2007 07:52 PM


I did not realize that.  Honestly, I didn't even realize there was an Excel 2007 out.  I think it is long overdue in terms of expanding the worksheet space, but you're right,  that can cause a lot of headaches.  Like they say, trash will expand to fill the amount of available space.  Our company has just about finished upgrading all but the oldest, lowest denominator PC's from Win2000 to XP.  And the office package I'm running has Excel 2002.  It will probably be another year or two before they start installing Vista on the newest of PC's.  Office packages will probably be upgrading about the same time.
I realize I made an error in my earlier statement.  I said that double precision would be 4 bytes per value, but it is actually 8 bytes, so all my calculations would be twice as large.  They would be accurate for single precision numbers.  But I was just trying to put a sense of scale to the massive amount of data that was being generated.

 
 
 

How do you handle and distribute huge amounts of data?

Post by Yuri3 » Thu, 02 Aug 2007 16:10:07

I have to collect and analyze very large datasets produced by LV programs every day.  A few suggestions:

1)  If you are collecting your data from DAQ boards, they are most
likely 16 bit, so when using the DAQmx read VI, read the data in as raw
16 bit integers.  If the researchers really want voltage values or
whatever calibrated floating point values, just tell them to divide the
16 bit integer data by the appropriate value (based on the gain you set
when sampling the channels, for example).  Let them deal with
doubles.

2)  Stream the data to disk as raw binary values, so that you have
chunks of 315x2 bytes (for 16 bit integers), for a total of file size
of 315x2x(N samples) bytes (this would be a 150-200 MB file based on
your numbers).  This represents the smallest possible size of the
data you can create (unless you want to try and zip it afterwards, but
it probably won't help much).  If the researchers can't read raw
binary data into their analysis program (and most decent analysis
packages and\or custom written software should be able to), then they
need to learn how to do so, or you can write some small translation
program that they can use to convert the raw binary data to whatever
format they desire (even the dreaded Excel file!).

3)  For each raw binary file, I'd create a small, text-based
(perhaps XML formatted) header file (same name as the data file but
with different extension) that summarized the data file associated with
it (time\date collected, relevant experimental parameters, perhaps some
summary statistics, etc.).  This way they can open up the header
file to quickly see the important information, and then use that
information to target specific parts of the data in the binary file for
further processing.

4)  Excel is a very poor choice for this data.  No matter
what you do,
Excel will not handle this much data well at all.  If the
researchers
only know how to use Excel, introduce them to the wonderful world of
Matlab or Sigmaplot or some other environment capable of handling large
amounts of data.  If they really want to look at the data you
produce,
they'll have to learn to go beyond Excel at some point. 
Researchers have to learn how to adapt to new software with new types
of data.  As a researcher myself, I know that sometimes you have
to drag them kicking and screaming.  "Because we're used to it" is
not a good excuse.

5)  If they originally planned to use Excel, that means the type
of analysis they expect to do is relatively simplistic.  Perhaps
you can find out what calculations\graphs they would be creating in
Excel with the data, and just pre-empt that by having your program
automatically create that output.

Hope that helps.  Good luck!
 
 
 

How do you handle and distribute huge amounts of data?

Post by Sima » Fri, 03 Aug 2007 02:40:06

Yes, but I meant in terms of TDM(S) files. In terms of file size, would it be better to go TDMS, simple 2D dbl array binary file, text file (bad), excel (worse), etc?
Hey, I can take this to the DIAdem forum ;)Message Edited by Sima on 08-01-2007 12:28 PM