Read a file line by line with a maximum number of characters per line

Read a file line by line with a maximum number of characters per line

Post by hugogu » Fri, 15 Oct 2004 23:14:16


Hello,

I want to read a file line by line. I first used the readLine() method
which returns a string, but if the line contains too much characters
I'm ending with an OutOfMemory exception. I could use the
read(buffer[], maxChars) method, but this method does not take in
account the end of the line. So my buffer could contain more than a
single line. I would like to use benefit of both methods, meaning
using a method which return a string representing a file's line (like
readLine()), but with a maximum characters per line (like
read(buffer[], maxChars)).

I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

Could someone have an idea how to proceed ?

Thanks a lot.

Hugo
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Thomas Wei » Sat, 16 Oct 2004 00:03:12


That's unlikely. If readLine() was fast enough for you, char-by-char
reading should be fast enough for you, too. Because this is what
readLine() does internally to find the line ends.


I suggest you revise your code.

/Thomas

 
 
 

Read a file line by line with a maximum number of characters per line

Post by Matt Humph » Sat, 16 Oct 2004 00:13:30


There's an implicit contradiction in what you're asking for. You want to
process the data by lines, but some lines are too big to be processed.
You're going to have to give up one of these. Before you make that decision,
however, you should ask yourself a couple of questions.

1) Is the OutOfMemory exception really being caused by an input line that is
too large? Will such lines be common or expected and must your program
defend against them? Are the lines supposed to be less than a particular
length such that a very long one constitutes an invalid input file?

2) How important is it that your data be processed by lines? Are you
scanning for something in particular? or are you just counting lines as you
go? Is each line parsed independently or scanned for data? As in part 1,
will there never be valid data after a particular length?

3) You say that reading and searching is too slow, but are you using a
BufferedReader? Also, what do you mean by "slow" as your tests that run
using readline simply fail with an exception, perhaps the file is so large
that "slow" is normal.

I would guess that, realistically, you're going to have to give up the idea
of processing data by lines in order to protect your program from input
files that consists of 2.4Gb of data with no carriage returns at all.

To do this you have to change your input system so that it is not line
oriented but that is uses some other structure such as words or phrases,
etc. You say you've tried but that it takes too much time to search for the
end of line. Consider this: the readline method must also search (stop at)
the end of line and if it can do it with reasonable performance so can
you--the answer is probably in how you buffer the data. I would recommend
you look at a design centered on reading (buffering) a large chunk and
tokenizing it according to whatever you're looking for. This tokenizer
would refill the buffer when it gets low and handle the two unpleasant cases
of a line (or whatever you're looking for) either spanning multiple blocks
or there being several within one block. It may be possible for your
tokenizer to read simlpy read a character a time from a BufferedReader and
for you to scan for what you're looking for.

Cheers,
Matt Humphrey XXXX@XXXXX.COM http://www.yqcomputer.com/
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Will Hartu » Sat, 16 Oct 2004 02:05:28


Really?

I would think since they're using a buffered reader, they'd load blocks of
data in big gulps and then scan it. That's what I would do.

Regards,

Will Hartung
( XXXX@XXXXX.COM )
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Steve Hors » Sat, 16 Oct 2004 06:30:28


<lots of good advice snipped>

I would like to add: If you are looking for line endings,
remember that BufferedReader.readLine accepts line endings
of any of the following sequences:
"\n"
"\r"
"\r\n"

I advise that you try and emulate this.
It will save you much grief one day.

Steve
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Thomas Wei » Sat, 16 Oct 2004 15:43:00


I suggest you read the source code.


They read once char after the other from the buffer, check it, and run a
small state machine to handle \r\n.

/Thomas
 
 
 

Read a file line by line with a maximum number of characters per line

Post by hugogu » Sat, 16 Oct 2004 16:56:41

Matt Humphrey" < XXXX@XXXXX.COM > wrote in message news:< XXXX@XXXXX.COM >...

When my file is about 3MBytes on only one line, I get an OutOfMemory
error.
These large lines are not exepected and do not correspond to a normal
behaviour, but it may happen and I must protect my system against
them.


It is important to be processed by line because the user may want to
look for a particular keyword at a particular position in the file
(column, line). Yes, each read line is sent to a scanner one after the
other.


Yes, I am using a BufferedReader. Slow means 45 minutes for a 3MBytes
file when I read it char by char without using readLine() !!



Here the code I use to read a my file char by char with a maximum
number of read charachters :

private String readLineWithMaxSize(BufferedReader br) throws
IOException {
String finalLine = null;
int readCharacter = -1;
char[] lineChars = new char[204800];
boolean bufferFull = false;
if (br != null) {
int index = 0;
readCharacter = br.read();
// If the read character does not correspond to a new line
or to
// an end of file, we treat it.
while (readCharacter != -1 && readCharacter != '\r' &&
readCharacter != '\n') {
// if the buffer is not full, we add the character to
the array of characters
if (!bufferFull) {
lineChars[index] = (char) readCharacter;
index++;
bufferFull = index >= lineChars.length;
}
readCharacter = br.read();
}
// If the read character is \r and the next one is \n, we
skip it.
if (readCharacter == '\r') {
br.mark(2);
int nextReadCharacter = br.read();
if (nextReadCharacter != '\n') {
br.reset();
}
}
// We construct a string representing the line from the
buffer of
// characters read
if (index != 0) {
finalLine = new String(lineChars);
} else if (readCharacter == '\r' || readCharacter == '\n')
{
finalLine = "";
}
}
return finalLine;
}
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Owen Jacob » Sat, 16 Oct 2004 18:51:04


(This is because the various platforms Java runs on haven't, historically,
agreed on a line terminator/separator.)

Just out of curiousity, and because I'm about to go to bed and therefore
don't want to start coding, how many lines is the pathological sequence:

"\r\r\n\r\r\n\n\r\r\n"?
()(??)()(??)()()(??) <-- helpful markers

--
Some say the Wired doesn't have political borders like the real world,
but there are far too many nonsense-spouting anarchists or idiots who
think that pranks are a revolution.
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Matt Humph » Sat, 16 Oct 2004 22:03:37


<snip>

<more snip>


I compiled your code and it ran fine for me. I wrote a program that creates
a test file with 1 short line, a line of 3.5Mb and a final short line. Your
code above on my 1.7Ghz Windows 2000 machine with Java 1.4.2_03 with no
special memory expansion -Xmx set runs in less than a second. I wrote a
similar version based on StringBuffer that returns the complete 3.4Mb string
and it works perfectly fine also. Note that your code above has a serious
problem--every string it returns will be 204800 characters long. You won't
need many of these for your program to run out of memory.

As for the speed problem, I think it will be something with the file and the
OS rather than with Java.

Cheers,
Matt Humphrey XXXX@XXXXX.COM http://www.yqcomputer.com/
 
 
 

Read a file line by line with a maximum number of characters per line

Post by hugogu » Tue, 19 Oct 2004 18:04:32

<snip>
<more snip>
<more more snip>

Thank you for your answer.
If my code works for you, it seems that I may have miss something in
the code which calls this method. I will check that. On the other
hand, I don't understant why say that this method will always return
204800 characters long strings. I mean, in the while loop, I check if
the read character is an end-of-line or not. So my array of characters
is not always full. Is there something I don't understand here? If I
initialize my array at 204800 characters, does it mean the string I
will construct from it will contain 204800 charcters, even if the
array is not full??

Thanks a lot for your answers.

Hugo.
 
 
 

Read a file line by line with a maximum number of characters per line

Post by Matt Humph » Tue, 19 Oct 2004 20:42:43


Arrays are fixed-length and always have the declared length.

Cheers,
Matt Humphrey XXXX@XXXXX.COM http://www.yqcomputer.com/