Automated download of PDF documents from web site

Automated download of PDF documents from web site

Post by bstar » Mon, 24 May 2004 00:38:19

have a project to automatically download a set of Acrobat PDF
documents from a web site (and then extract data from them).

I started in Object REXX, using the ~Navigate method of the OLE
Internet.Explorer object to load the URL. This DOES bring up the PDF
file in my browser, rendered by the Acrobat Reader plug-in. There does
not seem to be a OLE method to interact with this. The ~Document
method shows a 0 byte document (because it's not HTML). So, onto plan

For plan B, I wrote the socket program included below. It actually
works, but only on sites with the http: protocol (port 80). However,
the site I am targeting uses the SSL https: protocol (port 443).

Is it possible to intact with an SSL site with RxSocket, or do I need
to abandon REXX for another language?

Is there another tool that I might use. I did download REXXCURL,
haven't mastered it yet, but it complains about my intended site's
security certificate.

Other suggestions?

Here is getpdf1.rex:
/* REXX exec to save data from a web URL (e.g. Adobe Acrobat PDF file
/* from a web site to local directory.
/* Author: Bob Stark, ProTech,
Trace N

/* Hint: Move desired URL to the bottom of the list */
url = ''
/* The following Share URL does not work, I suspect redirection... */
url = ''
url = '' /* This one works
url = '' /* Bad hostname
url = '' /* Bad document
url = '' /* Good document
url = '' /* This one hangs
/* No data is returned by the following https: request
url = ''
url = ''/*works*/
url = ''/*works*/
/* The following one from rgf works fine.
url = ''

Do program = 1 To 1
If RxFuncQuery('SockDropFuncs') then
rc = RxFuncAdd("SockLoadFuncs","rxsock","SockLoadFuncs")
rc = SockLoadFuncs()

Parse Var url protocol '//' hostname '/' +0 document

/* Set the filename of the document to be saved */
Parse Value Reverse(document) With filename '/'
If filename <> '' Then
filename = Reverse(filename)
filename = MakValid(filename,
Else filename = 'default.html'

protocol = Translate(protocol)
When Abbrev(protocol,'HTTP:') = 1 Then
port = 80
When Abbrev(protocol,'HTTPS:') = 1 Then
port = 443
Say 'Protocol type ('protocol') unsupported'
Leave program

If SockInit() <> 0 Then Do; Say 'Sockinit failed'; Leave program;

rc = SockGetHostByName(hostname,'host.!')
If rc <> 1 Then Do; Say 'SockGetHostByName failed'; Leave program;

socket = SockSocket('AF_INET','SOCK_STREAM',0)
If socket < 0
Then Do;

Automated download of PDF documents from web site

Post by Mark Hessl » Mon, 24 May 2004 14:01:13

On Sat, 22 May 2004, Bob Stark wrote:

Obviously I'm biased, but Rexx/cURL (and cURL) is meant to do the sort of
thing you need to do. I don't know that the issue is so much with
Rexx/cURL, but with how some web sites interface to the web client;
such as redirections, cookies, certificates; all things you probably will
need to deal with irrespective of which technology you use to communicate
with the site.

Perservere with Rexx/cURL ;-)

Cheers, Mark