Automated download of PDF documents from web site

Automated download of PDF documents from web site

Post by bstar » Mon, 24 May 2004 00:38:19


have a project to automatically download a set of Acrobat PDF
documents from a web site (and then extract data from them).

I started in Object REXX, using the ~Navigate method of the OLE
Internet.Explorer object to load the URL. This DOES bring up the PDF
file in my browser, rendered by the Acrobat Reader plug-in. There does
not seem to be a OLE method to interact with this. The ~Document
method shows a 0 byte document (because it's not HTML). So, onto plan
B.

For plan B, I wrote the socket program included below. It actually
works, but only on sites with the http: protocol (port 80). However,
the site I am targeting uses the SSL https: protocol (port 443).

Is it possible to intact with an SSL site with RxSocket, or do I need
to abandon REXX for another language?

Is there another tool that I might use. I did download REXXCURL,
haven't mastered it yet, but it complains about my intended site's
security certificate.

Other suggestions?

Here is getpdf1.rex:
/* REXX exec to save data from a web URL (e.g. Adobe Acrobat PDF file
*/
/* from a web site to local directory.
*/
/* Author: Bob Stark, ProTech, www.protechtraining.com
*/
Trace N

/* Hint: Move desired URL to the bottom of the list */
url = 'http://www.adobe.com/products/acrobat/pdfs/acrruserguide.pdf'
/* The following Share URL does not work, I suspect redirection... */
url = 'http://ew.share.org/callpapers/attach/Long_Beach_Conference/S8314.pdf'
url = 'http://www.rexxla.org/' /* This one works
*/
url = 'http://www.rezzla.org/' /* Bad hostname
*/
url = 'http://www.rexxla.org/foo.html' /* Bad document
*/
url = 'http://www.rexxla.org/Standards/J18PUB.pdf' /* Good document
*/
url = 'https://www.rexxla.org' /* This one hangs
*/
/* No data is returned by the following https: request
*/
url = 'https://dibbs2.bsm.dla.mil/Downloads/RFQ/3/SPM40504Q0023.pdf'
url = 'http://publibfi.boulder.ibm.com/epubs/pdf/rxoq5a00.pdf'/*works*/
url = 'http://www-1.ibm.com/support/search.wss?rs=22&tc=SS8PLL&dc=DB520+D800+D900+DA900+DA800&rankprofile=8&dtm'/*works*/
/* The following one from rgf works fine.
*/
url = 'http://wi.wu-wien.ac.at/rgf/rexx/orx12/JavaBeanScriptingWithRexx_orx12.pdf'

Do program = 1 To 1
If RxFuncQuery('SockDropFuncs') then
Do
rc = RxFuncAdd("SockLoadFuncs","rxsock","SockLoadFuncs")
rc = SockLoadFuncs()
End

Parse Var url protocol '//' hostname '/' +0 document

/* Set the filename of the document to be saved */
Parse Value Reverse(document) With filename '/'
If filename <> '' Then
Do
filename = Reverse(filename)
filename = MakValid(filename,
,'.'xRange('a','z')xRange('A','Z')xRange(0,9))
End
Else filename = 'default.html'

protocol = Translate(protocol)
Select
When Abbrev(protocol,'HTTP:') = 1 Then
port = 80
When Abbrev(protocol,'HTTPS:') = 1 Then
port = 443
Otherwise
Say 'Protocol type ('protocol') unsupported'
Leave program
End

If SockInit() <> 0 Then Do; Say 'Sockinit failed'; Leave program;
End

rc = SockGetHostByName(hostname,'host.!')
If rc <> 1 Then Do; Say 'SockGetHostByName failed'; Leave program;
End

socket = SockSocket('AF_INET','SOCK_STREAM',0)
If socket < 0
Then Do;
 
 
 

Automated download of PDF documents from web site

Post by Mark Hessl » Mon, 24 May 2004 14:01:13


On Sat, 22 May 2004, Bob Stark wrote:


Obviously I'm biased, but Rexx/cURL (and cURL) is meant to do the sort of
thing you need to do. I don't know that the issue is so much with
Rexx/cURL, but with how some web sites interface to the web client;
such as redirections, cookies, certificates; all things you probably will
need to deal with irrespective of which technology you use to communicate
with the site.


Perservere with Rexx/cURL ;-)

Cheers, Mark