Downloading lots and lots and lots of files

Downloading lots and lots and lots of files

Post by coolne » Tue, 30 Jan 2007 23:44:02


First, what I am doing is legit... I'm NOT trying to grab someone
elses content. I work for a non-profit organization and we have
something going on with Google where they are providing digitized
versions of our material. They (Google) provided some information on
howto write a script (shell) to download the digitized version using
wget.

There are about 50,000 items, raning in size from 15MB-600MB. My
script downloads them fine, but it would be much faster if i could
multi-thread(?) it. I'm running the wget using the sys command on a
windows box (i know, i know, but the whole place is windows so I don't
have much of a choice).

Am I on the right track? Or should I be doing this differently?

Thanks!
J
 
 
 

Downloading lots and lots and lots of files

Post by Lawrence S » Wed, 31 Jan 2007 00:00:32

"coolneo" < XXXX@XXXXX.COM > writes:

Moving 5 Terabytes of data is going to take a long, long time no
matter how many threads you throw at the job. If you had 50,000 files
of a few kilobytes each, then you *might* see some improvement because
of the overhead of setting up and tearing down connections, but with
larger files you're most likely network bound.


Never underestimate the bandwidth of a station wagon filled with
magtape.

Or, updating for the 21st century: An SUV with a box of DVDs.

--
Lawrence Statton - XXXX@XXXXX.COM s/aba/c/g
Computer software consists of only two components: ones and
zeros, in roughly equal proportions. All that is required is to
place them into the correct order.

 
 
 

Downloading lots and lots and lots of files

Post by Purl Gur » Wed, 31 Jan 2007 00:04:52


You indicate you have already downloaded those files.

Why do you want to download those files again?

Purl Gurl
 
 
 

Downloading lots and lots and lots of files

Post by coolne » Wed, 31 Jan 2007 00:25:04


I managed to download about 21,000 of the 50,000 items over the course
of some time. Initally, Google was processing these items at a slow
rate but lately they have picked it up.

Bandwidth is indeed a concern, and I understand downloading 5TB will
take a long long time, but I think it would be a little shorter if I
could spawn off 4 downloads at a time, or even 2, during our off
business hours and the weekend (I get . The average file size is
125MB. We have a 200mb pipe, so it's not entirely unreasonable (is
it?).
 
 
 

Downloading lots and lots and lots of files

Post by Abigai » Wed, 31 Jan 2007 00:26:01

coolneo ( XXXX@XXXXX.COM ) wrote on MMMMDCCCXCIX September MCMXCIII in

== First, what I am doing is legit... I'm NOT trying to grab someone
== elses content. I work for a non-profit organization and we have
== something going on with Google where they are providing digitized
== versions of our material. They (Google) provided some information on
== howto write a script (shell) to download the digitized version using
== wget.
==
== There are about 50,000 items, raning in size from 15MB-600MB. My
== script downloads them fine, but it would be much faster if i could
== multi-thread(?) it. I'm running the wget using the sys command on a
== windows box (i know, i know, but the whole place is windows so I don't
== have much of a choice).
==
== Am I on the right track? Or should I be doing this differently?


Before you do anything, first check with google if they allow multiple
connection, and if they do, how many multiple connection you may start.
It won't do you much good to start 100 downloads in parallel if google
holds up 95 of them.

Of course, it's quite likely that the network is the bottleneck.
Starting up many simultaneous connections isn't going to help in
that case.

Finally, I wouldn't use threads. I'd either fork() or use a select()
loop, depending on the details of the work that needs to be done.
But then, I'm a Unix person.


Abigail
--
A perl rose: perl -e '@}-`-,-`-%-'
 
 
 

Downloading lots and lots and lots of files

Post by Peter Scot » Wed, 31 Jan 2007 00:42:10


You could try

http://www.yqcomputer.com/ ~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel.pm

Looks like you'll need Cygwin.

--
Peter Scott
http://www.yqcomputer.com/
http://www.yqcomputer.com/
 
 
 

Downloading lots and lots and lots of files

Post by Ted Zlatan » Wed, 31 Jan 2007 02:20:43


You should contact Google and request the data directly. I guarantee
you they will be happy to avoid the load on their network and
servers, since HTTP is not the best way to transfer lots of data.

Ted
 
 
 

Downloading lots and lots and lots of files

Post by xhoste » Wed, 31 Jan 2007 02:22:48


I probably wouldn't even use fork. I'd just make 3 (or 4, or 10, whatever)
different to do lists, and start up 3 (or 4, or 10) completely independent
programs from the command line.

Xho

--
-------------------- http://www.yqcomputer.com/
Usenet Newsgroup Service $9.95/Month 30GB
 
 
 

Downloading lots and lots and lots of files

Post by Ted Zlatan » Wed, 31 Jan 2007 02:25:40


This depends on the error rates and the latency between the two sides
(each file may be on a different server in a different part of the
world, for all we know). Generally, 4 downloads are faster than 1,
because of the synchronized way TCP/IP works, but of course they
create a bigger load on the client and on the server.

Ted
 
 
 

Downloading lots and lots and lots of files

Post by gf » Wed, 31 Jan 2007 02:55:28


You didn't say if this is a one-time job or something that'll be on-
going.

If it's a one-time job, then I'd split that file list into however
many processes I want to run, then start that many shell jobs and just
let 'em run until it's done. It's not elegant, it's brute force, but
sometimes that's plenty good.

If you're going to be doing this regularly, then LWP::Parallel is
pretty sweet. You can have each LWP agent shift an individual URL off
the list and slowly whittle it down.

The I/O issues mentioned are going to be worse on a single box though.
You can hit a point where the machine is network I/O bound so you
might want to consider confiscating a couple PCs and run a separate
job on each PC, as long as you're on a switch and a fast pipe.

I'd also seriously consider a modern sneaker-net, and see about buying
some hard-drives that'll hold the entire set of data, and send them to
Google, have them fill the drives, and then return them overnight air.
That might be a lot faster, and then you could reuse the drives later.
 
 
 

Downloading lots and lots and lots of files

Post by coolne » Wed, 31 Jan 2007 04:04:13


Ted, I didn't provide some addition information that would may make
you think differently:

Google is kinda odd sometimes. It took them forever to allow multiple
download streams, and then they provide this web interface to recall
data in text format with wget. I mean, for Google, you figure they
could do better. I think they would prefer to not give us anything at
all. Once we have it there is always the chance we'll give it way or
lose it or have it stolen (by Microsoft!).

Another thing I didn't mention is that this can grow to much larger
than the 50,000, in which case, I'd much rather just auto-download,
than deal with media.
 
 
 

Downloading lots and lots and lots of files

Post by Dr.Ruu » Wed, 31 Jan 2007 04:34:23

coolneo schreef:


I assume it is gz-compressed?

--
Affijn, Ruud

"Gewoon is een tijger."
 
 
 

Downloading lots and lots and lots of files

Post by Ted Zlatan » Wed, 31 Jan 2007 05:33:25


As a business decision it may make sense; technically it's nonsense :)

At the very least they should give you a rsync interface. It's a
single TCP stream, it's fast, and it can be resumed if the connection
should abort. HTTP is low on my list of transport mechanisms for
large files.


Sure. I was talking about your initial data load; subsequent loads
can be incremental.

I would also suggest limiting to N downloads per hour, to avoid bugs
or other situations (unmounted disk, for example) where you're
repeatedly requesting all the data you already have. That's a very
*** situation.

Ted
 
 
 

Downloading lots and lots and lots of files

Post by Michele Do » Wed, 31 Jan 2007 07:00:03


There's no sys command, BTW...


Well, one (cheap) option that has not been mentioned is the simple
minded parallelization you can get with piped open()s...


Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
.'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
 
 
 

Downloading lots and lots and lots of files

Post by coolne » Wed, 31 Jan 2007 23:34:55


Thanks everyone. I'm going to give LWP:Parallel a closer look. That
looks like it will do what I want. Thanks for the advice on queuing
the downloads. That makes perfect sense.