Robust Fit: External Data
You can use DataView robust fit facilities to smooth or fit a polynomial equation to data from sources external to the program.
The algorithms used for robust fitting (i.e. reducing the influence of outliers) are described here.
- Select the Analyse: External data: Robust fit menu command to open the Robust Fit dialog. Unlike most analysis routines, this is available even if no data file is loaded into the main program.
Data source
The external data source should be plain-text numbers arranged in two tab- or comma-separated equal-length columns, with the left column containing the X values and the right column containing the Y values. There can be one or more text header rows, but these are ignored. The X values of the data do not have to be evenly spaced, and if they are not in order, they are automatically sorted by the program.
There are 3 ways to load data into the program:
- Copy the data onto the clipboard outside of DataView, and then click the Paste button in the Robust Fit dialog.
- Click the Load button in the dialog and select a text file (.txt) containing the data.
- Drag-and-drop a text file containing the data from File Explorer onto the dialog.
Robust Polynomial Fit
- Click polyfit shuffled discontig.txt.
- When you click the link, most browsers will open the file and display the numbers. You can then select them (usually control-a) and copy them to the clipboard (control-c). Then click the browser Back button to return to this page.
- Alternatively, you can right-click the link and download a local copy of the file, then open and copy its contents (or load or drag-and-drop as described above).
- Assuming you have placed the file contents onto the clipboard, click Paste in the Robust Fit dialog.
When you load the data, a notification message tells you that they were automatically sorted in ascending X order, and then the X-Y values display as a scattergraph.
The data were originally generated from a second degree (i.e. second order) polynomial equation, but then several Y values were replaced by outlier numbers (mainly 0s). The X values are not evenly spaced (there are gaps in the sequence), and, for demonstration purposes, the order of values were randomly shuffled (hence the message on loading).
The red line drawn through the scatterplot shows the robust fit of a polynomial function to the data. It happens that the default vaues of the parameters are pretty much optimal for this analysis, so the red line is a close (by eye) fit to the data excluding the obvious outliers. The cofficients of the polynomial equation are displayed, and the fitted equation is thus:
y = -0.300074 x2 + 49.9903 x + 27.2745
It may be instructive to change parameters to explore how this affects the fit.
- Set the Degree to 3.
- This fits a robust 3rd order polynomial equation to the data. However, the coefficent for the 3rd order is very small (2.89088e-06) indicating that it has little weight in the equation. This is not surprising since the original data were generated using a 2nd order equation. The other coefficients change slightly, but the red curve remains essentially the same.
- Set the Degree to 1.
- This fits a robust linear equation to the data.
- Set the Degree to 0.
- The line is drawn at the robust average of the data (Fit coefficient = ~282).
- Set the Robust iter to 0.
- This removes the robust qualification for the fit. Note that the red line drops in the display and is now drawn at the standard numerical mean Y value (Fit coefficient = ~-113).
- Return the Degree to 2, but leave the Robust iter at 0.
- You have now fitted a standard 2nd order polynomial to the data, but without any robust qualification. Note how the red line is "pulled" in the directions of the outliers. The coefficients are the same as those that would be obtained using the generalized Curve fit routine available in DataView.
- Return the Robust iter to 3, and then set the Outlier thresh to 100.
- The Outlier threshold value determines how extreme the data value departure from expectation (i.e. the residual) has to be to be considered an outlier. The value is expressed as a multiplier of the MAD (median absolute deviation from the median), and normally a value of 6 is considered as a reasonable definition of an outlier. By setting a value of 100 we are saying that only enormous deviations from expectation will be considered an outlier, and there are few if any such deviations in this data set. Thus the fit becomes similar to a standard, non-robust, polynomial fit.
- Return the Outlier thresh to 6 to get back to where you started.
Hopefully, you now have a reasonable idea of how the parameter choice affects the robust fit procedure.
Robust Smoothing
Details of the algorithm for robust smoothing are given here, but the essential concept is to apply a moving average filter, but with modifications to reduce the influence of outliers.
- Click robust smooth.txt.
- When you click the link, most browsers will open the file and display the numbers. You can then select them (usually control-a) and copy them to the clipboard (control-c). Then click the browser Back button to return to this page.
- Alternatively, you can right-click the link and download a local copy of the file, then open and copy its contents (or load or drag-and-drop as described above).
- Assuming you have placed the file contents onto the clipboard, click Paste in the Robust Fit dialog.
These data show the instantaneous frequency of spikes generated by the caudal photoreceptor in a crayfish, where each data point represents the reciprocal of the time interval between a spike and the preceding spike (Y value), plotted against the time of occurrence of the spike (X value). However, the spikes were identified from extracellular recordings using template recognition, and the allowed error was purposefully set high to allow many false positives, and these generate outliers above the obvious trend. There are also a few false negatives generating outliers below the trend, mainly due to spike collision in the recording. Robust smoothing attempts to draw a line through the trend, reducing the influence of the outliers.
- Uncheck the line box towards the bottom left of the dialog.
- This removes the connecting lines joining the dots in the scattergraph, which makes the main trend visually more obvious.
- Uncheck the AutoScl Y box, and set the upper Y axis scale to 150 and the lower scale to 0 to zoom in on the main trend.
- Select Robust smoothing as the Fit type option.
You should now see a very jagged line drawn approximately through the main trendline of the data. It is clearly of little value as it stands, so parameters need adjusting.
- Set the Half-window to 18.
- There is an immediate very substantial improvement in the smoothed fit, as would be expected even for a standard moving average filter.
- Set the Degree to 3.
- The LOWESS technique used in robust smoothing uses a local polynomial fit, and a 3rd order polynomial follows sharper curves in the data than does a 2nd order fit.
- Reduce the Outlier thresh to 3.
- This means that more of the data points are identified as outliers in the initial smoothing iteration.
- Set the Symbol size (towards the bottom-left of the dialog) to 2.
- This simply makes the smoothed red line easier to see against the background data points
- Increase the Robust iter to 6.
At this point the smoothed line seems like a good fit to the main trend in the data. It has a similar shape to the "ground truth" instantaneous frequency plot of spikes recorded intracellularly from the photoreceptor, where there is no ambiguity in spike recognition. (If you want to see this, load file cpr intra extra into the main program and select the Event analyse: 2-D scatter graph menu option.)
Heuristic parameter adjustment
In the examples above, parameter adjustment was basically heuristic, guided by prior knowledge of what the fitted profile should look like. So far as I am aware, there is no generalized algorithm for determining the "best" parameters, since from the data alone there is no way to distinguish between outliers that should be ignored, and extreme-but-genuine values that should not be ignored. However, this methodology does have the key advantage of reproducibility - if the same parameters are used for the analysis of different data sets, then any differences in the output are an objective reflection of differences in those data sets.
Obtaining Output Values
The Copy button has a drop-down option Copy text (also Save text) that provides the output in text format.
For the Polynomial fit option, the output starts with a list of the coefficients, and this is followed by the X-Y values of the raw data themselves. You can also just select and copy the coefficient values from the display in the dialog itself.
For the Robust smoothing option, the output consists of two X-Y columns, the first of which has the raw data, the second has the smoothed data (the X values are the same in each set).