Threadsafe overhead

I just noticed adding the threadsafe keyword to a user function adds about a 20% execution time without even calling the function in a multithreaded use. Is this expected? I guess I had assumed threadsafe was just for the compiler: flag an error if I try to call a non-threadsafe function/operation with multthreading or from within another threadsafe function, but it seems to be affecting execution even if only called in the main thread.
Could you post a small example, so that I can try it out myself?
I also thought that the "threadsafe" keyword is just a hint for the compiler.
Sure:

If you're having trouble observing this, I should concede I only checked with one function, so it's not a comphensive study, although I did check a few combinations of calculation sizes including small and large numbers of points in the implicit loop caused by using a wave assignment statement. You may notice the function I based this one is a direct cut and paste from Peak Functions.ipf, written to go with the Multipeak Fitting package. I started looking at this because I was actually hoping to be able to call the XOP peak functions from MultiPeak Fit with the "multithread" statement. I use them in the HITRAN procedures package, often to calculate a simulated spectrum of thousands of peaks in thousands of output wavelengths. As you'll see from the benchmark (and no real surprise) worrying about a few extra processors on the user defined function is pointless if one has an equivalent XOP function to call. However, I'm assuming the XOP could similarly be sped up by multithreading the wave assignment statement if it were threadsafe, which it is not. I would assume MultiPeak fitting would similarly gain from making the peak functions threadsafe and multithreading the fit, but only if FuncFit can be multithreaded, and I'm not sure if/how one can/would do that. Alternately, if threadsafe were just a compiler directive, then I could whine about why the XOP version peak functions aren't already marked threadsafe. But if it's going to add 20% calculation time and not help speed up MultiPeak Fit, then I can see why the XOP peak functions are not marked as threadsafe even though they could be.

Function fLorentzianFit(w,x)
    Wave w; Variable x
   
    Variable r= w[0]
    variable npts= numpnts(w),i=1
    do
        if( i>=npts )
            break
        endif
        r += w[i]/((x-w[i+1])^2+w[i+2])
        i+=3
    while(1)
    return r
End

Threadsafe Function  TS_fLorentzianFit(w,x)
    Wave w; Variable x
   
    Variable r= w[0]
    variable npts= numpnts(w),i=1
    do
        if( i>=npts )
            break
        endif
        r += w[i]/((x-w[i+1])^2+w[i+2])
        i+=3
    while(1)
    return r
End

Function BenchMarkit(destsize, numpeaks, functime, tsfunctime, mtfunctime, xoptime)
    variable destsize, numpeaks
    variable &functime, &tsfunctime, &mtfunctime, &xoptime
    make /free /d /n = (destsize) outputwave
    make /free /d /n = (3*numpeaks + 1) coefs
    coefs[0] = 0
    coefs[1,3*numpeaks;3] = enoise(1)
    coefs[2,3*numpeaks;3] =  enoise(destsize / 2) + destsize / 2
    coefs[3,3*numpeaks;3] = exp(gnoise(1) + 1)
    variable timerref = StartMSTimer
    outputwave = fLorentzianFit(coefs,x)
    functime = StopMSTimer(timerref)* 1e-6
    timerref = StartMSTimer
    outputwave = TS_fLorentzianFit(coefs,x)
    tsfunctime = StopMSTimer(timerref)* 1e-6
    timerref = StartMSTimer
    multithread outputwave = TS_fLorentzianFit(coefs,x)
    mtfunctime = StopMSTimer(timerref)* 1e-6
    timerref = StartMSTimer
    outputwave = LorentzianFit(coefs,x)
    xoptime = StopMSTimer(timerref) * 1e-6
End

Function BenchMarkWrapper(destsize, numpeaks)
    variable destsize, numpeaks
    variable functime, tsfunctime, mtfunctime, xoptime
    BenchMarkit(destsize, numpeaks, functime, tsfunctime, mtfunctime, xoptime)
    print "basic function completed in ", functime, "s"
    print "Thread Safe function completed in ", tsfunctime, "s"
    print "Multithreaded function completed in ", mtfunctime, "s"
    print "XOP function completed in ", xoptime, "s"

End


and from the command line, the results I get (on a 2 processor netbook running Windows 7) are

•BenchMarkWrapper(100000,10)
  basic function completed in   1.21606  s
  Thread Safe function completed in   1.50094  s
  Multithreaded function completed in   0.791856  s
  XOP function completed in   0.0787004  s
•BenchMarkWrapper(10,100000)
  basic function completed in   1.13298  s
  Thread Safe function completed in   1.39786  s
  Multithreaded function completed in   0.771549  s
  XOP function completed in   0.0189796  s
More of an aside. It's nearly always a _lot_ faster to write an all-at-once fitting function. See the following (not tested for correctness). You remove the overhead of calling the function many, many, times. You just call it once.

threadsafe Function  ARJN_TS_fLorentzianFit(w, yy, xx): fitfunc
    Wave w, yy, xx
 
        multithread yy = w[0]
 
    variable nreps= (numpnts(w) - 1) / 3
        variable ii
        for(ii = 0 ; ii < nreps ; ii += 1)
                  multithread yy[] += w[3 * ii + 1]/((xx[p] - w[3 * ii + 2])^2+w[3 * ii + 3])
        endfor
End

//edited for correctness
Thanks for the code ikonen. The relative slowdown is on my machine also 20%.
•BenchMarkWrapper(100000,10) basic function completed in 0.280375 s Thread Safe function completed in 0.340006 s Multithreaded function completed in 0.0754768 s XOP function completed in 0.0195716 s •BenchMarkWrapper(10,100000) basic function completed in 0.251737 s Thread Safe function completed in 0.298765 s Multithreaded function completed in 0.0569062 s XOP function completed in 0.00619713 s