Checkpoints need to read old journals?

Checkpoints need to read old journals?

Post by Brad » Sat, 24 Apr 2004 13:16:04


'day. I had a situation yesterday where a runaway transaction was logging
a heap of stuff to the journal area. To conserve space I bzipped most
of the journals (but not the most recent half-dozen) and then created
zero-length files to stand in for the journals that had been compressed.

I.e. in the journal area I had files

...
j0000827.jnl (0 length)
j0000827.jnl.bz2
j0000828.jnl (0 length)
j0000828.jnl.bz2
j0000829.jnl (0 length)
j0000829.jnl.bz2
j0000830.jnl
j0000831.jnl
j0000832.jnl
j0000833.jnl
j0000834.jnl
j0000835.jnl

all with the proper modes and owned by Ingres.

Our nightly backup does an alterdb to remove the oldest checkpoint before
proceeding to take a new one. I figured the existence of the zero-length
journal files would keep Ingres happy when it deleted the journals corresponding
to those old checkpoints.

I was surprised when the new nightly checkpoint itself failed:

Thu Apr 22 22:41:57 2004 CPP: Preparing to checkpoint database: lasctdb
Thu Apr 22 22:41:57 2004 CPP: Preparing stall of database, active xact cnt: 0
Thu Apr 22 22:41:57 2004 CPP: Finished stall of database
beginning checkpoint to disk /ckpI2/ingres/ckp/default/lasctdb of 1 locations
Thu Apr 22 22:41:57 2004 CPP: Start checkpoint of location: ii_database to disk:
path = '/ckpI2/ingres/ckp/default/lasctdb'
file = 'c0276001.ckp'
executing checkpoint to disk
ending checkpoint to disk /ckpI2/ingres/ckp/default/lasctdb of 1 locations
Thu Apr 22 22:43:55 2004 E_DM1118_CPP_JNL_OPEN Error opening journal file while
checkpointing the database.
Thu Apr 22 22:43:55 2004 E_DM1118_CPP_JNL_OPEN Error opening journal file while
checkpointing the database.
Thu Apr 22 22:43:55 2004 E_DM110B_CPP_FAILED Error occurred checkpointing the
database.

... but the funny thing is that the checkpoint seemed to succeed, as it was listed
in the infodb output for the database as a standard ONLINE checkpoint #276
with date Thu Apr 22 22:41:57 2004.

The relevant errlog.log entries are:

::[II_JSP , 0072c100]: Thu Apr 22 22:43:55 2004
E_CL2403_JF_BAD_OPEN Error trying to open a journal file
read() failed with operating system error 0 (Error 0)
::[II_JSP , 0072c100]: Thu Apr 22 22:43:55 2004
E_DM1118_CPP_JNL_OPEN Error opening journal file while checkpointing the
database.
::[II_JSP , 0072c100]: Thu Apr 22 22:43:55 2004
E_DM1118_CPP_JNL_OPEN Error opening journal file while checkpointing the
database.
::[II_JSP , 0072c100]: Thu Apr 22 22:43:55 2004
E_DM110B_CPP_FAILED Error occurred checkpointing the database.

It seems to me that:

a. the checkpoint succeeded?

b. The first error listed in errlog.log is probably the 'real' error, with
the subsequent entries being of less importance. I think Ingres was
able to actually open() a (zero length!) journal file with no problems,
but then got back 0 bytes in its first read() system call, which it
didn't expect and raised as an error. The value of errno would be
0, since there was no actual system-call error.

But I'm puzzled as to why a checkpoint would bother looking at old journal
files. The zeroed journals certainly dated after the last checkpoint (prior
to the one that failed), but I still don't know why ckpdb would bother
reading them.

If anyone could tell me what the story is with che
 
 
 

Checkpoints need to read old journals?

Post by bowe » Sat, 24 Apr 2004 15:25:28

Hi Brad,
Snip...

Probably NOT. Was it marked as valid in the infodb output. 1 for yes 0
for NO.

In both cases, you'll find a full checkpoint tar file in the CKP area.
But I suspect that you have fallen foul of Dumping...

I suspect that you had a transaction start during the checkpoint after the
stall that caused 'Dumping' to occur. The System will need to write both
Dump files and to write appropriate entries in the Journal files to allow
a complete recovery.

I am a little surprised that it felt the need to look at the older
journals.

As the errant transaction was active, the checkpoint would have 'stalled'
waiting for it to complete. Even worse, having 'sort of started', the
checkpoint now blocks every new transaction until it can get access. Only
after it gets out of the stall will new work be permitted.

If the transaction is a raunaway. ie nust not required/in error/Totally
Bogus then kill the transaction. If the transaction is required, but just
cutting a lot more journals than anticipated then you have to find some
room by moving older journals aside - as you did. But leave all journals
belonging to the current checkpoint alone. If the disk fills up the
Archiver will crash - but the transaction will continue. You can even go
as far as moving the journal directory as a whole to a bigger area and
then installing a symlink to point to this new area. But we'd be getting
desperate at this stage...


The system will pause waiting for a good start point before commencing the
backup. This is the 'stall'. It wants all active transactions to end.


Martin Bowes


--
Random Titus Quote #4:
It's always funny until someone gets hurt. And then its funnier.

 
 
 

Checkpoints need to read old journals?

Post by Brad » Sat, 24 Apr 2004 16:48:54


Marked '1' as valid.


Exactly. I ensured that the last five or ten journal files at least - the
most recent journals - were untouched. Yet I got this error.


Ingres can't do an online backup until there's a moment where there are no
outstanding transactions? I.e. a transaction can't cross checkpoints?

(This is a very basic 'understanding how Ingres does backups' question; it's
been a while, sorry)

Thanks Martin!


Brad
 
 
 

Checkpoints need to read old journals?

Post by schende » Sat, 24 Apr 2004 19:59:13


[snip]



Well. Learn something new every day.

It turns out that near the end of the checkpoint process, ckpdb
goes back to "truncate" the journal and dump files. The thinking
was that the files are preallocated and this truncation process
would trim them down to the actual written size, if that were less.
So it tries to open your old journal files, and gets upset because
there's nothing in them. The ironic thing is that this "truncation"
bit is a no-op -- nothing happens other than opening and closing
all the journals!

Had the file not been there at all, no error would have occurred.

Your checkpoint did not succeed, I think.

I'd suggest that you open an issue on this, because the whole truncate-
journal thing is a bogus leftover from ancient times and should be
reviewed for probable removal.

Karl
 
 
 

Checkpoints need to read old journals?

Post by bowe » Tue, 27 Apr 2004 10:16:51

>

Kudos and Browny Point to you Karl!


It Did. Well it was marked valid. I suspect then that all the dumping had
been completed and as this final stage of the Op. was just a 'tarting up'
phase, then the checkpoint, dump[s and journals would all be usable for a
recovery.

Marty



--
Random Farscape Quote #22:
John - Us Southern Boys don't make good eating.
 
 
 

Checkpoints need to read old journals?

Post by Brad » Wed, 28 Apr 2004 13:36:29


Karl, thank you. Most informative! Do you have the source code or
something?


So it was just scanning the directory for *.jnl files; not looking
for filenames basedon the information in the .cnf file?

Thanks,


Brad
 
 
 

Checkpoints need to read old journals?

Post by schende » Wed, 28 Apr 2004 18:16:51


I joined the Ingres development team last November. Usual disclaimer:
my posts are my opinion, and not policy, and not necessarily Tech Support.

(I don't use my CA address for this forum because I'm trying to keep
it relatively spam free. The kbcomputer.com address has been spam-hopeless
for years now.)




No, the names are indeed based on the information in the config file,
but when it goes to open and "truncate" a journal, a "file not found"
error is specifically ignored.

--
Karl R. Schendel, Jr. XXXX@XXXXX.COM
Computer Associates Ingres Development