Ethical Data Smoothing?

Hi all,
Does anybody have any information on ethical data smoothing? Obviously I can't just use a huge smoothing number and make my data all nice and shiny, but I don't know why I can't. I know it has a lot to do with your experimental setup, but are there any hard and fast rules? I'm going to go reference an error analysis book now at the library, but I figured I'd ask as well. (I'm making the large assumption that most people that use igor and are on this site do some sort of experimental work that involves collecting and analyzing data.)

Thank you.
I work mostly with spectra. My hard rule is ... Numerical results shall only be obtained from raw data. Otherwise, I consider smoothing to be an optional process to use only to make the trends in data "look prettier". I will as a general rule always show raw data points under a smoothed curve when I feel that I have to show a smoothed curve.

Can you provide a bit more background on your particular application?

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
jjweimer wrote:
I work mostly with spectra. My hard rule is ... Numerical results shall only be obtained from raw data. Otherwise, I consider smoothing to be an optional process to use only to make the trends in data "look prettier". I will as a general rule always show raw data points under a smoothed curve when I feel that I have to show a smoothed curve.


Thank you for your reply. I also work with spectra. I work particularly with Raman and some electrochemical measurements. Yes I am essentially smoothing my data to make it "look" prettier, and I'm grabbing most my actual analysis from the raw data however, there is a slight deviation to that rule.

I've written a few algorithms that find the intensity of particular peaks on the spectrum, then it graphs that point intensity vs. time. However, due to noise, it's very hard for the computer to always pick the correct intensity (and the number of spectra is so high that I could not physically do this manually, rather I'm a programmer and refuse to do it manually ;) ) Basically, as it sits now the data gets smoothed by a small amount, then the computer finds the max around the point you specified (so if I specified wavenumber = 500, it'd look at 501 and 499 and see if it's at the max, if it's not then it moves and checks again.) That's not exactly how it works, but it's close enough.

So, I guess my questions are, how much smoothing can I do on a spectrum to make it "look pretty" and how much smoothing can I do to enable my program to find the intensity of the peak? I'm not reporting these values as quantitative, but more as a guide to show trends. However, it would be great if I could justify the smoothing and definitively report these values as quantitative. If not smoothing, more so "noise reduction."

Again, thanks for your help.
You shouldn't be doing any smoothing if this directly affects the extracted information. Noise is part of your measurement and should be reported properly, for example by including the standard deviation or error of the fit when reporting the intensity of a peak.

If the data is not directly affected, for example you use smoothing to facilitate image processing like automated counting of high intensity spots in an image, or for simple visual purposes, then smoothing is ok.

One commonly accepted data manipulation prior to measurement is background subtraction, but even then one should be wary and double check that the chosen process is properly performed and acceptable. Improper use of the rolling ball method, for example, can easily distort amplitudes of peaks, and is something I wouldn't do unless I was very careful.
reepingk wrote:
...
Basically, as it sits now the data gets smoothed by a small amount, then the computer finds the max around the point you specified ...

So, I guess my questions are, how much smoothing can I do on a spectrum to make it "look pretty" and how much smoothing can I do to enable my program to find the intensity of the peak? I'm not reporting these values as quantitative, but more as a guide to show trends. However, it would be great if I could justify the smoothing and definitively report these values as quantitative. If not smoothing, more so "noise reduction."


Here are a few options that would be fully proper IMO.

* Suppose your peak changes in wavenumber (sample chemistry changes) as a function of time. Use the smoothed data to find the approximate location of the peak. Then, graph raw intensity at that wavenumber versus time. In this case, also graph the peak wavenumber versus time.

* Suppose your peak is to be at one location throughout all time. Decide the position of the peak in wavenumber. Graph raw intensity at that wavenumber as a function of time.

In the first case, absent a real physical reason that your peak position must change with time AND when peak position versus time scatters about an average value that is essentially constant, you are far better off to use the second method. Otherwise, you are just over-playing some guess about one effect (change in peak position with time) in order to produce results that you want subjectively to force to be better in another dimension (peak intensity versus time).

To improve S/N in the plot of intensity versus time in either case, measure more samples.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
I have done a lot of work with Raman data in the past. One good way of determining peak position & intensity is to locally fit a peak-function to your data. Roughly speaking, fit the top ~25% of the peak to something like a Gaussian, Lorentzian or Voigt type curve, and determine the peak position and peak intensity from the fit parameters. If you "know" that your peak position is fixed at a known position then you can hold that parameter fixed at that value and let the other parameters vary. If your peak is particularly asymmetric, then use an asymmetric fit function. What is the advantage of this method? - it uses all the relevant data points (i.e. all those in the vicinity of the peak) to contribute to the determination of the peak location. One can also get an estimate of the uncertainty in the peak parameters as well, which is crucial in determining whether or not a trend is statistically significant.
KurtB wrote:
... One good way of determining peak position & intensity is to locally fit a peak-function to your data. ...


I like this approach. I would consider it valid ("ethical") because, when well-documented and objectively applied, it can be reproduced by a random observer to get the same results from any given data set. My concern in this regard would be to keep as many of the "subjective" fitting parameters fixed across comparable data sets. The two main ones I see here are the peak shape and the amount of peak region to fit.

Do you report such results in journal publications, or do you use this method only for your internal consumption? How well-documented is this approach in the Raman literature, and how do you report your method if you do report results from it in the literature?

KurtB wrote:
One can also get an estimate of the uncertainty in the peak parameters as well, which is crucial in determining whether or not a trend is statistically significant.


With the caveat that, such uncertainties only show trends. They do not represent a "true" confidence level for a fit of the given peak shape to the given data because only a restricted portion of the peak region is used in the fit. Of course, the confidence in the trend will increase as an increasingly wider region of the peak is used in the fit. If you will, this is equivalent to a conversion of standard uncertainties in peak parameters to standard uncertainties of the means of those parameters, where the latter decreases to zero as the number of samples goes to infinity.

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
Thank both of you for your input. I am going to use all of the advice here to make my data analysis more reproducible and "ethical".

I really like the gaussian fit idea, in fact my coworker and I had discussed that before, but we figured that we would have to take into account either the flat or slanted baseline. If we only fit the top 1/2 or 3/4 of the curve that shouldn't be a problem. As long as the "1/2" or "3/4" stays the same. However, as the peaks don't always look the same, that will still involve some calculations to determine WHERE that 1/2 or 3/4 of the curve starts and ends.

Our raman setup has a real life resolution of about 0.5 wavenumbers. (That is, when we read the silicon standard it'll vary up to 0.5 wavenumbers.) So, I think I would be justified in taking the position of the most intense spot 0.25 wavenumbers in each direction. And, the particular peaks I'm observing move with temperature (we work at very high temperatures), so that will also have to be taken into account when comparing lower temp. data with higher temp. data. (Currently the other graduate students in my lab ignore these factors, but I'm a bit of a perfectionist so I cannot do the same.)

As for smoothing the data, I really don't think I have to now. Our spectra in general have good signal, so the noise doesn't make it "ugly", I just wanted to be able to asses the peak position and maximum intensity more accurately. The trace of the max. intensity vs. time is the most problematic graph as, because of noise, it'll report unrealistic intensities. We generally chalk it up to noise and move on. With fitting the actual peaks I believe I'll solve this problem.
One push in measurement analysis now is to define the level of confidence in results within the context of what is called an "overall uncertainty budget". The approach is to define unambiguously the contributions of Type A (statistical) and Type B (non-statistical) uncertainties to a final result. A raw signal contains both to start. Manipulations of raw data contribute one or both types depending on what the manipulation does to the raw signal.

In this vein, I think an overall summary to the theme of "ethical" data analysis is ...

Whatever methods you use in measurement and analysis, you will do well to define, determine, and report to the best of your ability all of the factors that contribute to the overall uncertainty budget in the result as well as the relative magnitudes of each of the factors.

With this in mind, as I think back now on your opening question and the responses, I would say that, compared to many other methods to improve peak detection and analysis, smoothing raw data is so fraught with inconsistencies as to how it contributes to the overall uncertainty budget in results that it should just be avoided completely. By comparison, when you fit a peak to a defined region using non-linear regression, you get well-defined parametric uncertainties on the fit parameters as well as directly-measurable comparative uncertainties between the values from the smooth peak shape and the raw data. Your job now is just to put such uncertainty values in to a systematic report using the syntax appropriate to an overall uncertainty budget analysis.

For more information on this, I find a good starting point is the NIST web pages on uncertainty analysis.

http://physics.nist.gov/cuu/Uncertainty/index.html

--
J. J. Weimer
Chemistry / Chemical & Materials Engineering, UAHuntsville
jjweimer wrote:

With this in mind, as I think back now on your opening question and the responses, I would say that, compared to many other methods to improve peak detection and analysis, smoothing raw data is so fraught with inconsistencies as to how it contributes to the overall uncertainty budget in results that it should just be avoided completely. By comparison, when you fit a peak to a defined region using non-linear regression, you get well-defined parametric uncertainties on the fit parameters as well as directly-measurable comparative uncertainties between the values from the smooth peak shape and the raw data. Your job now is just to put such uncertainty values in to a systematic report using the syntax appropriate to an overall uncertainty budget analysis.



Again, thank you. This is the train of thought this conversation eventually brought me to. I've run the same analysis with gaussian fits and it gives very similar data (almost indistinguishable from a cursory glance) however, I can now define exactly how I got the peaks instead of saying "We used igor's smoothing function (who knows exactly what that does.)" No offense to the smoothing function, and I'm sure I could look it up, but I'd rather just use a well understood gaussian fit. The way I have it work now is the data gets smoothed to find a "guess" for where the peak should be, then I center the gaussian fit on that value and it goes to work from there.
I think this thread has a good summary of a sensible, practical approach.

jjweimer wrote:
Do you report such results in journal publications, or do you use this method only for your internal consumption? How well-documented is this approach in the Raman literature, and how do you report your method if you do report results from it in the literature?


Alas no - I have not published very much (I left academia many years ago), and nothing with this method.

jjweimer wrote:
With the caveat that, such uncertainties only show trends. They do not represent a "true" confidence level for a fit of the given peak shape to the given data because only a restricted portion of the peak region is used in the fit. Of course, the confidence in the trend will increase as an increasingly wider region of the peak is used in the fit. If you will, this is equivalent to a conversion of standard uncertainties in peak parameters to standard uncertainties of the means of those parameters, where the latter decreases to zero as the number of samples goes to infinity.


Agreed. The danger of widening the portion of the peak that is used in the fit is that the fit function may not truly describe the actual shape of the peak and lead to systematic errors in the parameters (e.g. if the peak data is asymmetric - say it has a tail to the right - and you fit a Gaussian then the parameter describing peak position will be offset to the right).

One final comment - when my peak position is somewhat variable I have found an iterative approach quite successful. Here I fit a (Gaussian) to the region where I believe the peak to be, and mask the data to encompass a suitable region (symmetric about the peak position). I then do a second fit using the result of the first fit to centre my fit region on 'new' peak position. One can then iterate with more fits as appropriate. This is fine-tuning, but it has lead to an improvement in the analysis on some test data (fluorescence spectra, that I did a while back). I the case of the Raman bands where the variation in peak position is small, I suspect that this approach would not offer much improvement.

Quote:

Again, thank you. This is the train of thought this conversation eventually brought me to. I've run the same analysis with gaussian fits and it gives very similar data (almost indistinguishable from a cursory glance) however, I can now define exactly how I got the peaks instead of saying "We used igor's smoothing function (who knows exactly what that does.)" No offense to the smoothing function, and I'm sure I could look it up, but I'd rather just use a well understood gaussian fit. The way I have it work now is the data gets smoothed to find a "guess" for where the peak should be, then I center the gaussian fit on that value and it goes to work from there.


I'm a little late to this discussion, but I think it's worth jumping in here in sticking up for poor, maligned "smoothing.". Smoothing is a mathematical algorithm that can be described quite simply if one wishes to publish smoothed data, and it's no better to write "we used Igor's built in fit routine" if you're going to be idealistic about describing algorithms to a level someone could repeat with another software package. (Technically there are many possible smoothing algorithms but that's true for fitting as well, and Igor does a very good job describing what each type of smoothing does.) Other than the oddball non-linear variants like median smoothing, smoothing is just a form of digital low-pass filtering. It's no more "cheating" than inserting an analog low-pass filter in front of a detector that is picking up high frequency noise.

I think smoothing gets a bad rap because it SEEMS like it gets you something for nothing by making noisy data appear less noisy, and that trips your natural skepticism. But's it's not free data improvement...if you're trying to detect very weak signals, depending on the detection algorithm you apply after smoothing (which could be a programmed threshold check or "look with your eyes and decide if there's a peak there") smoothing can legitimately increase detection probability, but it can also increases false positives. Another common misconception is to just look at the height of a peak compared to the baseline noise and call that a Signal to Noise. That's the part where you'll notice smoothing seems to increase Signal-to-noise, but it's only because ignoring the width of a peak is poor way to calculate signal-to-noise. There is a significant difference in the probability that a single data point N deviations above the baseline is just noise and a cluster of points all of which (or a large fraction of which) are N devations above baseline.

The other downside to smoothing is that if you did have a real signal that was very narrow, heavy smoothing will not improve it. Linear smoothing (box, binomial and Savitsky-Gollay (SP?)) all preserve area, so if you widen the peak significantly by aggressively smoothing, you'll end up reducing the height. You can figure out how much it will widen the peak and if it's less than or comparable to your instrument resolution it's harmless. Same goes for any low-pass filtering, whether digital or analog.

As far as publishing your data, if you're not making any claims about detection sensitivity (or claiming that your highly sensitive instrument is part of your novel approach making the paper worth publishing) I wouldn't lose any sleep over using smoothed spectra. As noted above, if you overdo it, eventually you'll decide your baseline noise is looks like signal and/or your real peaks are overly distorted and it's not improving the appearance any more. If you are trying to claim that you've made some experimental improvement to the sensitivity, then yes, it would be highly dishonest to publish smoothed data and not explain it, because it would give the false impression of what your system could really do, but I would emphasize that if your goal is to improve sensitivity and you're retaining high frequency noise that cannot possibly represent real data, that's poor experimental design.

Having said all that, I'm not necessarily against the "fit to a peak" approach, but from what I gather, there is a baseline you should probably try to remove by using the rest of the spectrum with a baseline remove tool before you run the fit. If you're just letting the gaussian fit float both peak height and baseline and you're not fitting out to the small tails, those two parameters are going to be highly correlated and extremely sensitive to noise. Your fit will have a lot of trouble distinguishing between a baseline of 50 and a height of 50 vs. a baseline of 0 and height of 100 if you only fit the top 1/2 or so of the peak, and I gather it's the height you consider intensity, not the absolute y position. That's kind of a separate point from the smoothing vs. fitting discussion, because I think you'd want to do baseline removal no matter how you determine peak intensity.
Thank you for your reply. Our baseline removal is actually taken care of by the instrument program itself. (Although migrating that to igor would save a good bit of time as it's simply subtracting one spectrum from the next.)

In the program we do use the smoothing function to determine the baseline or else the noise would make the baseline extremely erratic.

My labmates told me something similar when it came to smoothing. As long as I don't overdo it I would probably be fine.

Either way, smoothing or not, my program has trouble with small peaks. This has a lot to do with the actual optics and CCD. I don't exactly know how to describe it, but when the noise "width" (the width of the "points" of noise) get's within 4 or 5 times as much as the width of the peak, and the peak has a very low intensity, my gaussian fit of the top 50% of the peak likes to fit ONE spike of noise. So it'll have have a short tail, a gaussian peak, then a long tail off of the other side even though the fit's width is only 50% of the peak's width. And honestly, I can't expect a good fit here anyway because at that point, there is only ~5-10 data points covered by the fit. I think I'm going to just report a threshold where there will be less accurate data. Something like "A less accurage method for finding peak intensity was used for intensities less that X because of noise." Basically my less accurate method is to expand the range of the fit to include the baseline.

Even using smoothing (which was how the previous program was designed), the small peaks still reported erroneous data and many times my other labmates had to go back in and pick the point out manually.

I'll attach a picture. Keep in mind, this is a very blown up section of my graph focusing on a very small peak. Zoomed out the data looks much... better. And generally our peaks are much higher, but we study kinetics, so sometimes peaks disappear, and I have to account for it.

You can see in the graph the raw data (red/spiky and mustard yellow offset above). You can see the initial fits (tiny red traces inside of the peak). There are multiple fits because there are 5 spectra on the graph, and I didn't bother to offset the fits (these are very much initial graphs, from here we decide which spectra to use and make publishable spectra from there. These are more to show me the overview of the experiment.) You can see how they like to fit just one POINT of noise, instead of the whole top of the peak. You can also see my "less accurate" solution in green. Sure, it's pretty accurate now, but if I have a sloping baseline or an asymmetric peak, it'll be further off. It's obvious that there is a peak there (we know there is, we can observe it growing), it's just getting the program to see that is a bit harder.

Being a programmer really makes you appreciate the evolution of the human body (given the amount of programming needed to do very simple tasks) yet also shows you it's many limitations (slow, doesn't like repetitive work etc). Then again if we had a program that worked at "human speed" it could probably go through enough iterations to make the same predictions as us.
I think you're asking the fit procedure to do too much with too little information...in fact if all your data is sampled at that same spacing it would appear you're fitting 4 parameters (offset, center, width and height) with 4 data points...I guess Igor doesn't complain until you have one more parameter than data point, but that's not generally a good idea. I think KurtB's suggestion to only fit a portion of the peak was more intended to pull out a good peak center and position of the highest point from a peak that may not be well separated from its neighbors, or sitting on a steep sloping "baseline" caused by the tail of the Raleigh scattering or fluorescence. Your baseline subtraction seems to working well (if I understand correctly the red data is pretty much sitting on a zero baseline that only reads 200 because of the trace offset) and there are no interfering features around, so you're doing yourself no favors fitting over such a small region. You also probably shouldn't let the width vary, unless it really does change across your different experiments. Holding the width to a wider value (presumably determined from a fit to a stronger signal, or a known instrument resolution if you think the "true" feature should be narrower) will prevent the fit from honing in on a single noise spike.

I think you can also see in those other bad fits what I was getting at with the baseline: By not including any data away from the peak and letting the fit adjust the vertical offset and vertical height, some of your fit results are small peaks offset upwards and others are large peaks offset downwards. Better to widen the fit region, or if the baseline is sloped, automate a procedure to find the best single offset value (average of the outermost points e.g.?) and hold the value fixed when you fit the peak.
ikonen wrote:
I think you're asking the fit procedure to do too much with too little information...in fact if all your data is sampled at that same spacing it would appear you're fitting 4 parameters (offset, center, width and height) with 4 data points...I guess Igor doesn't complain until you have one more parameter than data point, but that's not generally a good idea. ...


Yes I agree with you. Our baseline is nice in this spectrum at this point, but that is definitely not always the case. With multiple peaks to look at, and multiple environmental conditions producing more blackbody radiation (and therefore making our baseline terrible) I need my program to be able to deal with all of it. Now you see why I'm writing this in the first place. However, the important data that we care about generally has good signal, so these functions will never see the light of day during processing of those those.

And unfortunately, the width of the peak will change across experiments, depending on a lot of factors. Not to mention the other peaks I look at are pretty broad. However, the "average width" is a parameter that is input before the program starts, so it's easy to change.
reepingk wrote:
And unfortunately, the width of the peak will change across experiments, depending on a lot of factors. Not to mention the other peaks I look at are pretty broad. However, the "average width" is a parameter that is input before the program starts, so it's easy to change.


I suggest, as eluded to by ikonen, that you put some Constraints on the fit parameters, especially width and centre. You should have a good idea what realistic limits are for these from your experience of looking at the data, and this will reduce the tendency for the fit function to home in on a spike or other unwanted feature in the data.

As an aside, you said in an earlier comment that you have a resolution of about 0.5cm-1. My quick eye-ball of your plots suggest you are sampling the spectrum every 2.5cm-1. It may be worth checking your instrument settings to see what is going on here.
KurtB wrote:
reepingk wrote:
And unfortunately, the width of the peak will change across experiments, depending on a lot of factors. Not to mention the other peaks I look at are pretty broad. However, the "average width" is a parameter that is input before the program starts, so it's easy to change.


I suggest, as eluded to by ikonen, that you put some Constraints on the fit parameters, especially width and centre. You should have a good idea what realistic limits are for these from your experience of looking at the data, and this will reduce the tendency for the fit function to home in on a spike or other unwanted feature in the data.

As an aside, you said in an earlier comment that you have a resolution of about 0.5cm-1. My quick eye-ball of your plots suggest you are sampling the spectrum every 2.5cm-1. It may be worth checking your instrument settings to see what is going on here.


You guys are right. I'm going to work on both putting constraints on the peak position and width, and I'm also going to use the previous time's fit as initial parameters as the fit doesn't change much between time data points. (Remember this is for a kinetic trace, watching a peak grow in or disappear.)

I came up with the .5 cm-1 number from observing the peak position of the internal Si standard, however that number was the peak position as reported by the program itself, I believe you are correct. I misspoke.