I have the following data file:

http://www.yqcomputer.com/ ~bartv/gnuplot/sobol.dat

Now I want to do a least square fit on the data of the first and

second column. The data in the first-column represent the x-values, the

data in the second column represent the y-values. I don't need the rest

of the columns.

Now I now that x and y are related somehow as:

y = a/(x**b)

and I want to find a and b as accurate as possible.

The file that plots my data to an .eps file is at

http://www.yqcomputer.com/ ~bartv/gnuplot/relative_error.pl

If I don't specify an initial value for b_sob, the fit fails. If I do

specify a starting value of 0.9 the fit works, but the result is not

satisfying as can be seen from the .eps file at

http://www.yqcomputer.com/ ~bartv/gnuplot/relative_error.eps

I have also tried to fit on log($2)=a-b*log($1) and then plot

exp(a)/(x**b) but this didn't give me satisfying results either.

How can I increase the quality of my fit?

Regards,

Bart

--

"Share what you know. Learn what you don't."

"Satisfying" and "quality" are very subjective words. You need to

describe where you feel the fit can be improved. All gnuplot can do is

minimize the weighted sum of squares.

You shouldn't be surprised that you have to specify initial values for

the fit. Curve fitting is not a fire-and-forget technique; it does

require some input from the user, namely reasonable starting values and

(preferably relative) weights to the data points. You're supplying

neither.

To get reasonable starting values for the function a/x**b, do a linear

fit to the log($2) versus log($1). In fact, I would suggest that if the

data can be linearized, they should be, and a weighted linear least

squares fit should be used rather than a NLLS fit.

I note as well that you're not telling gnuplot how to weight your data

points, even though you later plot y errorbars using sqrt($5). gnuplot

therefore weights each data point equally: `fit ... using 1:2` is

equivalent to `fit ... using 1:2:(1)`. This effectively means that the

left third of your data is determining the fit, and the right

two-thirds are contributing next to nothing. You should probably be

doing something like `fit ... using 1:2:(sqrt($5))`.

Finally, you're plotting on a logarithmic scale, which by its nature

emphasizes small values. You're not satisfied with the first fit, but

consider how small the absolute differences are between the fit and the

data points.

THeo

OK. I just thought the fitted line should go more "through" the points,

now it was simply lying above all my data points...

I am in a situation where i'm not quite sure if i always have reasonable

starting values, so it would be nice if the fitting would work without

needing to specify any starting value... so...

OK. During my attempts to make it right, this was something that I've

tried but apparently I must've made a mistake somewhere... After

re-trying and re-trying, i finally got this method to work and the

result is much more satisfying! This was indeed the way to go, and I

don't need to specify any starting values, it simply works :-)

Interesting remark, but I'm not completely sure if i fully understand

it. The first 2 columns are indeed my x-y data, and the fifth column is

the variance on the averages from the second column.

My x-values are x=2**(i)-1 for i=1..21. I don't quite understand it

completely why I should use 1:2:(sqrt($5)) for my fit. Can somebody

explain a bit more, maybe even with a (gnuplot) example?

I agree with that.

That's true, but apparently the method you described before (doing a

linear fit on the log-values) gave me even better results so i'll stick

with that one :-)

Regards,

Bart

--

"Share what you know. Learn what you don't."

"Accurate" has no useful meaning standing on its own. You have to at

least provide a rough handle on what it is to accurately recreate. If

you're sure that the data actually behave according to this model, for

a given a and b, then "accurate" would have a meaning. In almost all

other cases it doesn't.

That's to be expected to some amount. Non-linear least squares

fitting is an art. It even says so in the manual...

What's special about 0.9? Where did this number come from?

As Theo pointed out, the plot appears to have error bars, and in such

a case, the fit command really should be given those error figures as

input, too, so it can weight the importance of individual points

properly. Without weights, the assumed equal weight of 1 for all will

cause exactly the kind of behaviour you see, in exponential fits: the

assumed error of 1.0 is a very small error on a large y value, but a

huge uncertainty on the small numbers, so the fit will be completely

dominated by the upper-most points of the exponential.

As a matter of fact, your data don't really fit the model --- they're

visibly not on a straight line in the double-log diagram.

One of the best recipes for this kind of job is to do a linear fit in

logarithmic space first, then use the parameters found from that as

startup values for the actual fit.

It's generally not a good idea to use the logarithmized fit as the

actual result --- it'll have entirely the wrong weighting, unless you

were *very* careful about transforming not just the y values, but also

the errors.

--

Hans-Bernhard Broeker ( XXXX@XXXXX.COM )

Even if all the snow were burnt, ashes would remain.

Bart Vandewoestyne wrote:

No, it's not. Zoom in, and you'll see that the point with the

second-smallest x value is lying above the curve found with unweighted

data. As I and HBB have said, the points with the largest y values are

determining the fit.

That's why you use the technique I described to get reasonable values.

Unfortunately, satisfying and correct are not the same. I suspect that

you're doing an unweighted linear fit to the log(y) versus log(x). This

will not take the variances of the data into account.

[...]

Hmmm. I think I've assumed that you have more background in error

analysis and propagation that you actually do. If the 'cs' in your

email address stands for Computing Science, this may well be the case

-- this kind of stuff simply isn't taught in CS, at least here. Two

books I can recommend to get you started are "An introduction to error

analysis : the study of uncertainties in physical measurements", 2nd

ed. John R. TAYLOR, University Science Books, 1997, and

"Data reduction and error analysis for the physical sciences", 2nd ed.

Philip R. BEVINGTON, McGraw-Hill, 1992. The latter is, I think, at a

higher level, and includes a discussion of non-linear least squares

fitting. A thorough reading of `help fit` in gnuplot may also be

helpful, although it too assumes a certain level of background.

I'll try to boil things down as simply as possible. You've got the

variances on the y values of your data in column 5 of your file; the

standard deviation (or, pretty much equivalently, uncertainly) is

obviously sqrt($5); the error bars are simply a graphical

representation of the uncertainty on your data points. You're doing:

set logscale xy

plot 'sobol.dat' using 1:2:(sqrt($5)) w e

Now suppose you take the logarithms manually:

unset logscale xy

plot 'sobol.dat' using (log($1)):(log($2))

What should the error bars be? Certainly not sqrt($5), nor

log(sqrt($5)), as a plot will show -- the error bars look completely

wrong compared to the first plot. The science of error propagation

answers the question "If I have a function depending on variables each

having uncertainties, what is the uncertainty on the value of the

function?" In this case, the function is log(y), and y has an

uncertainty dy (=sqrt($5) in your case). As it turns out, the

uncertainty on log(y) is dy/y, so in your case you would do

plot 'sobol.dat' using (log($1)):(log($2)):(sqrt($5)/$2) w e

to get correct error bars.

I think it's pretty intuitively obvious that data points with larger

error bars are less certain, and should contribute relatively less to

any fit. A plot of the _relative_ uncertainty is a good way of seeing

which data points are going to most heavily influence a fit. In

gnuplot, if you don't give an uncertainty to the `fit` command, a

default value of 1 is supplied. Let's look at this default relative

uncertainty (i.e, uncertainty divided by value).

set logs xy

plot 'sobol.dat' using 1:(1/$2)

Notice how this data covers six orders of magnitude. If you were to fit

using the default uncertainty of 1, the data at small x would be

considered by the fitting routine to be a million times more important

than the data at large x. That's why I said that it's the left third or

so of your data that's determining the fit; the rest is basically

ignored.

Now do

replot '' using 1:(sqrt($5)/$2)

Here we plot the relative uncertainties based on the known variances of

your data.

No, it's not. Zoom in, and you'll see that the point with the

second-smallest x value is lying above the curve found with unweighted

data. As I and HBB have said, the points with the largest y values are

determining the fit.

That's why you use the technique I described to get reasonable values.

Unfortunately, satisfying and correct are not the same. I suspect that

you're doing an unweighted linear fit to the log(y) versus log(x). This

will not take the variances of the data into account.

[...]

Hmmm. I think I've assumed that you have more background in error

analysis and propagation that you actually do. If the 'cs' in your

email address stands for Computing Science, this may well be the case

-- this kind of stuff simply isn't taught in CS, at least here. Two

books I can recommend to get you started are "An introduction to error

analysis : the study of uncertainties in physical measurements", 2nd

ed. John R. TAYLOR, University Science Books, 1997, and

"Data reduction and error analysis for the physical sciences", 2nd ed.

Philip R. BEVINGTON, McGraw-Hill, 1992. The latter is, I think, at a

higher level, and includes a discussion of non-linear least squares

fitting. A thorough reading of `help fit` in gnuplot may also be

helpful, although it too assumes a certain level of background.

I'll try to boil things down as simply as possible. You've got the

variances on the y values of your data in column 5 of your file; the

standard deviation (or, pretty much equivalently, uncertainly) is

obviously sqrt($5); the error bars are simply a graphical

representation of the uncertainty on your data points. You're doing:

set logscale xy

plot 'sobol.dat' using 1:2:(sqrt($5)) w e

Now suppose you take the logarithms manually:

unset logscale xy

plot 'sobol.dat' using (log($1)):(log($2))

What should the error bars be? Certainly not sqrt($5), nor

log(sqrt($5)), as a plot will show -- the error bars look completely

wrong compared to the first plot. The science of error propagation

answers the question "If I have a function depending on variables each

having uncertainties, what is the uncertainty on the value of the

function?" In this case, the function is log(y), and y has an

uncertainty dy (=sqrt($5) in your case). As it turns out, the

uncertainty on log(y) is dy/y, so in your case you would do

plot 'sobol.dat' using (log($1)):(log($2)):(sqrt($5)/$2) w e

to get correct error bars.

I think it's pretty intuitively obvious that data points with larger

error bars are less certain, and should contribute relatively less to

any fit. A plot of the _relative_ uncertainty is a good way of seeing

which data points are going to most heavily influence a fit. In

gnuplot, if you don't give an uncertainty to the `fit` command, a

default value of 1 is supplied. Let's look at this default relative

uncertainty (i.e, uncertainty divided by value).

set logs xy

plot 'sobol.dat' using 1:(1/$2)

Notice how this data covers six orders of magnitude. If you were to fit

using the default uncertainty of 1, the data at small x would be

considered by the fitting routine to be a million times more important

than the data at large x. That's why I said that it's the left third or

so of your data that's determining the fit; the rest is basically

ignored.

Now do

replot '' using 1:(sqrt($5)/$2)

Here we plot the relative uncertainties based on the known variances of

your data.

Thanks for this very good explanation Theo! I now see what i did

wrong... not supplying (sqrt($5)) as weights for my fits makes gnuplot

use 1 for all weights and indeed then this means that the data points on

the right-side of my figure have *huge* errors and are not determining

my fit. Adding this extra (sqrt($5)) into the 'using' of the fit tells

gnuplot that those are the errors on the data and now gnuplot can use

these values to assign weights to my data-points. In this second case,

all my data-points are quasi equal-important.

I did not know the fit-command could also be supplied these

uncertainties... I'm just taking my first steps in fitting with

gnuplot... Thanks for teaching me this!

Still, one question: you always mention

1:2:(sqrt($5))

But actually the error-bars in my plot were made using

1:2:(c*sqrt($5))

with c a constant depending on the level of confidence that I want (e.g.

a value of 1.96 for a confidence of 95%).

So actually in my fit command I should be using

1:2:(c*sqrt($5))

But am I right if i say that this doesn't matter because whether or not

the c is in there, the weighting will be the same since it's just a

constant multiplier and the relative weights will remain the same?

I've tested it with c=1.96 and my fit-results remained the same.

OK, this is currently what's in my gnplot script:

fit [x=startfit:*] log_a-b*x "../../data/separate_results/f_c0/a0.4000/dim0010/sobol.dat" using (log($1)):(log($2)) via log_a, b

a_sob=exp(log_a); b_sob=b

fit [x=startfit:*] a_sob/x**b_sob "../../data/separate_results/f_c0/a0.4000/dim0010/sobol.dat" using 1:2:(c*sqrt($5)) via a_sob, b_sob

f_sob(x) = a_sob/x**b_sob

set label '{/Symbol a}_{sob} = %3.5g', b_sob, ', C = %4.3g', a_sob at graph 0.05,0.1

I guess this about the best and correct I can do, right?

Regards and thanks for your help again!

Bart

--

"Share what you know. Learn what you don't."

[...]

The parameter values returned by the fitting process will be indeed

independent of c, as will their estimated uncertainties. What will

change is the weighted sum of squares of residuals (WSSR), also known

as chi-squared. The reduced WSSR (i.e., WSSR divided by number of

degrees of freedom) is a measure of the goodness of the fit. Data with

a perfectly normal distribution about the fitted curve and correctly

estimated error bars will have a reduced WSSR of 1.0.

Looks fine to me.

You're quite welcome.

THeo

N-es. The fit result obtained this way are not very reliable, and can be

improved upon by using them as starting values for the non-linear fit.

Note in particular that the error estimates for the parameters from a

linearised plot are next to useles.

Assume that you do an experiment at largely different x-values (I had

cases where the variation was over 5 orders of magnitude), resulting in

largely different y-values. Each of the results has an error margin

associated with it, often expressed as standard deviation. Experiments

like that are often performed under conditions of constant relative

error, say 2%. So for example, your results may vary between 1000 +/- 20

and 0.1 +/- 0.002.

Now put this into a fit procedure like you did. The program will try and

minimise the sum of squared residuals. It is obvious, that most of the

residuals are in the large y-values, and the program will improve the

fit to those values and virtually disregard the smaller ones. If you

plot the results (especially in a log/log plot in which the small values

are given more weight) this will be very obvious.

There are two possible ways out. One is to fit for minimal Chi-square

rather than minimal sum of squares, in other words to minimise the

relative rather than the absolute deviations.

The other is to weigh the residuals by the reciprocal of the standard

deviation of the y-values before summing them up.

1. Fitting a surface (height=f(x,y) with a least square fit

2. Adding constraints to linear fit parameters of separable least squares fit

3. Least Square Fit vs. Transform to get Spectrum

4. How to obtain the least square fitting and corresponding standard deviation and confidence?

5. novice question: non-linear least squares fit

6. Least Squares Fitting - Cubic Polynomial

7. How to obtain the least square fitting and corresponding standard deviation and confidence?

8. Looking for Levenberg-Marquardt code (multi-variable least-squares fitting)

9. 3D nonlinear least squares fit in Matlab

10. number of data points required for nonlinear least square fit

11. 2D Matrix Least Squares Fitting

12. least square fit of a Catmull Rom spline

13. Least square fit on 2D data

14. How to obtain the least square fitting and corresponding standarddeviation and confidence?

15. Least squares fit for two parameter matrices?

8 post • Page:**1** of **1**