## least square fit

### least square fit

I have the following data file:

http://www.yqcomputer.com/ ~bartv/gnuplot/sobol.dat

Now I want to do a least square fit on the data of the first and
second column. The data in the first-column represent the x-values, the
data in the second column represent the y-values. I don't need the rest
of the columns.

Now I now that x and y are related somehow as:

y = a/(x**b)

and I want to find a and b as accurate as possible.

The file that plots my data to an .eps file is at

http://www.yqcomputer.com/ ~bartv/gnuplot/relative_error.pl

If I don't specify an initial value for b_sob, the fit fails. If I do
specify a starting value of 0.9 the fit works, but the result is not
satisfying as can be seen from the .eps file at

http://www.yqcomputer.com/ ~bartv/gnuplot/relative_error.eps

I have also tried to fit on log(\$2)=a-b*log(\$1) and then plot
exp(a)/(x**b) but this didn't give me satisfying results either.

How can I increase the quality of my fit?

Regards,
Bart

--
"Share what you know. Learn what you don't."

### least square fit

"Satisfying" and "quality" are very subjective words. You need to
describe where you feel the fit can be improved. All gnuplot can do is
minimize the weighted sum of squares.

You shouldn't be surprised that you have to specify initial values for
the fit. Curve fitting is not a fire-and-forget technique; it does
require some input from the user, namely reasonable starting values and
(preferably relative) weights to the data points. You're supplying
neither.

To get reasonable starting values for the function a/x**b, do a linear
fit to the log(\$2) versus log(\$1). In fact, I would suggest that if the
data can be linearized, they should be, and a weighted linear least
squares fit should be used rather than a NLLS fit.

I note as well that you're not telling gnuplot how to weight your data
points, even though you later plot y errorbars using sqrt(\$5). gnuplot
therefore weights each data point equally: `fit ... using 1:2` is
equivalent to `fit ... using 1:2:(1)`. This effectively means that the
left third of your data is determining the fit, and the right
two-thirds are contributing next to nothing. You should probably be
doing something like `fit ... using 1:2:(sqrt(\$5))`.

Finally, you're plotting on a logarithmic scale, which by its nature
emphasizes small values. You're not satisfied with the first fit, but
consider how small the absolute differences are between the fit and the
data points.

THeo

### least square fit

OK. I just thought the fitted line should go more "through" the points,
now it was simply lying above all my data points...

I am in a situation where i'm not quite sure if i always have reasonable
starting values, so it would be nice if the fitting would work without
needing to specify any starting value... so...

OK. During my attempts to make it right, this was something that I've
tried but apparently I must've made a mistake somewhere... After
re-trying and re-trying, i finally got this method to work and the
result is much more satisfying! This was indeed the way to go, and I
don't need to specify any starting values, it simply works :-)

Interesting remark, but I'm not completely sure if i fully understand
it. The first 2 columns are indeed my x-y data, and the fifth column is
the variance on the averages from the second column.
My x-values are x=2**(i)-1 for i=1..21. I don't quite understand it
completely why I should use 1:2:(sqrt(\$5)) for my fit. Can somebody
explain a bit more, maybe even with a (gnuplot) example?

I agree with that.

That's true, but apparently the method you described before (doing a
linear fit on the log-values) gave me even better results so i'll stick
with that one :-)

Regards,
Bart

--
"Share what you know. Learn what you don't."

### least square fit

"Accurate" has no useful meaning standing on its own. You have to at
least provide a rough handle on what it is to accurately recreate. If
you're sure that the data actually behave according to this model, for
a given a and b, then "accurate" would have a meaning. In almost all
other cases it doesn't.

That's to be expected to some amount. Non-linear least squares
fitting is an art. It even says so in the manual...

What's special about 0.9? Where did this number come from?

As Theo pointed out, the plot appears to have error bars, and in such
a case, the fit command really should be given those error figures as
input, too, so it can weight the importance of individual points
properly. Without weights, the assumed equal weight of 1 for all will
cause exactly the kind of behaviour you see, in exponential fits: the
assumed error of 1.0 is a very small error on a large y value, but a
huge uncertainty on the small numbers, so the fit will be completely
dominated by the upper-most points of the exponential.

As a matter of fact, your data don't really fit the model --- they're
visibly not on a straight line in the double-log diagram.

One of the best recipes for this kind of job is to do a linear fit in
logarithmic space first, then use the parameters found from that as
startup values for the actual fit.

It's generally not a good idea to use the logarithmized fit as the
actual result --- it'll have entirely the wrong weighting, unless you
were *very* careful about transforming not just the y values, but also
the errors.

--
Hans-Bernhard Broeker ( XXXX@XXXXX.COM )
Even if all the snow were burnt, ashes would remain.

### least square fit

Bart Vandewoestyne wrote:

No, it's not. Zoom in, and you'll see that the point with the
second-smallest x value is lying above the curve found with unweighted
data. As I and HBB have said, the points with the largest y values are
determining the fit.

That's why you use the technique I described to get reasonable values.

Unfortunately, satisfying and correct are not the same. I suspect that
you're doing an unweighted linear fit to the log(y) versus log(x). This
will not take the variances of the data into account.

[...]

Hmmm. I think I've assumed that you have more background in error
analysis and propagation that you actually do. If the 'cs' in your
email address stands for Computing Science, this may well be the case
-- this kind of stuff simply isn't taught in CS, at least here. Two
books I can recommend to get you started are "An introduction to error
analysis : the study of uncertainties in physical measurements", 2nd
ed. John R. TAYLOR, University Science Books, 1997, and
"Data reduction and error analysis for the physical sciences", 2nd ed.
Philip R. BEVINGTON, McGraw-Hill, 1992. The latter is, I think, at a
higher level, and includes a discussion of non-linear least squares
fitting. A thorough reading of `help fit` in gnuplot may also be
helpful, although it too assumes a certain level of background.

I'll try to boil things down as simply as possible. You've got the
variances on the y values of your data in column 5 of your file; the
standard deviation (or, pretty much equivalently, uncertainly) is
obviously sqrt(\$5); the error bars are simply a graphical
representation of the uncertainty on your data points. You're doing:
set logscale xy
plot 'sobol.dat' using 1:2:(sqrt(\$5)) w e
Now suppose you take the logarithms manually:
unset logscale xy
plot 'sobol.dat' using (log(\$1)):(log(\$2))
What should the error bars be? Certainly not sqrt(\$5), nor
log(sqrt(\$5)), as a plot will show -- the error bars look completely
wrong compared to the first plot. The science of error propagation
answers the question "If I have a function depending on variables each
having uncertainties, what is the uncertainty on the value of the
function?" In this case, the function is log(y), and y has an
uncertainty dy (=sqrt(\$5) in your case). As it turns out, the
uncertainty on log(y) is dy/y, so in your case you would do
plot 'sobol.dat' using (log(\$1)):(log(\$2)):(sqrt(\$5)/\$2) w e
to get correct error bars.

I think it's pretty intuitively obvious that data points with larger
error bars are less certain, and should contribute relatively less to
any fit. A plot of the _relative_ uncertainty is a good way of seeing
which data points are going to most heavily influence a fit. In
gnuplot, if you don't give an uncertainty to the `fit` command, a
default value of 1 is supplied. Let's look at this default relative
uncertainty (i.e, uncertainty divided by value).
set logs xy
plot 'sobol.dat' using 1:(1/\$2)
Notice how this data covers six orders of magnitude. If you were to fit
using the default uncertainty of 1, the data at small x would be
considered by the fitting routine to be a million times more important
than the data at large x. That's why I said that it's the left third or
so of your data that's determining the fit; the rest is basically
ignored.

Now do
replot '' using 1:(sqrt(\$5)/\$2)
Here we plot the relative uncertainties based on the known variances of
your data.

### least square fit

Thanks for this very good explanation Theo! I now see what i did
wrong... not supplying (sqrt(\$5)) as weights for my fits makes gnuplot
use 1 for all weights and indeed then this means that the data points on
the right-side of my figure have *huge* errors and are not determining
my fit. Adding this extra (sqrt(\$5)) into the 'using' of the fit tells
gnuplot that those are the errors on the data and now gnuplot can use
these values to assign weights to my data-points. In this second case,
all my data-points are quasi equal-important.

I did not know the fit-command could also be supplied these
uncertainties... I'm just taking my first steps in fitting with
gnuplot... Thanks for teaching me this!

Still, one question: you always mention

1:2:(sqrt(\$5))

But actually the error-bars in my plot were made using

1:2:(c*sqrt(\$5))

with c a constant depending on the level of confidence that I want (e.g.
a value of 1.96 for a confidence of 95%).

So actually in my fit command I should be using

1:2:(c*sqrt(\$5))

But am I right if i say that this doesn't matter because whether or not
the c is in there, the weighting will be the same since it's just a
constant multiplier and the relative weights will remain the same?
I've tested it with c=1.96 and my fit-results remained the same.

OK, this is currently what's in my gnplot script:

fit [x=startfit:*] log_a-b*x "../../data/separate_results/f_c0/a0.4000/dim0010/sobol.dat" using (log(\$1)):(log(\$2)) via log_a, b
a_sob=exp(log_a); b_sob=b
fit [x=startfit:*] a_sob/x**b_sob "../../data/separate_results/f_c0/a0.4000/dim0010/sobol.dat" using 1:2:(c*sqrt(\$5)) via a_sob, b_sob
f_sob(x) = a_sob/x**b_sob
set label '{/Symbol a}_{sob} = %3.5g', b_sob, ', C = %4.3g', a_sob at graph 0.05,0.1

I guess this about the best and correct I can do, right?

Regards and thanks for your help again!
Bart

--
"Share what you know. Learn what you don't."

### least square fit

[...]

The parameter values returned by the fitting process will be indeed
independent of c, as will their estimated uncertainties. What will
change is the weighted sum of squares of residuals (WSSR), also known
as chi-squared. The reduced WSSR (i.e., WSSR divided by number of
degrees of freedom) is a measure of the goodness of the fit. Data with
a perfectly normal distribution about the fitted curve and correctly
estimated error bars will have a reduced WSSR of 1.0.

Looks fine to me.

You're quite welcome.

THeo

### least square fit

N-es. The fit result obtained this way are not very reliable, and can be
improved upon by using them as starting values for the non-linear fit.
Note in particular that the error estimates for the parameters from a
linearised plot are next to useles.

Assume that you do an experiment at largely different x-values (I had
cases where the variation was over 5 orders of magnitude), resulting in
largely different y-values. Each of the results has an error margin
associated with it, often expressed as standard deviation. Experiments
like that are often performed under conditions of constant relative
error, say 2%. So for example, your results may vary between 1000 +/- 20
and 0.1 +/- 0.002.

Now put this into a fit procedure like you did. The program will try and
minimise the sum of squared residuals. It is obvious, that most of the
residuals are in the large y-values, and the program will improve the
fit to those values and virtually disregard the smaller ones. If you
plot the results (especially in a log/log plot in which the small values
are given more weight) this will be very obvious.

There are two possible ways out. One is to fit for minimal Chi-square
rather than minimal sum of squares, in other words to minimise the
relative rather than the absolute deviations.

The other is to weigh the residuals by the reciprocal of the standard
deviation of the y-values before summing them up.