A Fast, Global Smoothing Algorithm

by Dan Vasicek

Abstract

A method for smoothing data that allows a trade-off between the "smoothness" of the fit and "accuracy" of the fit is proposed. The method uses a sparse linear system solver to find an optimum between smoothing the data and honoring the original data. The method can be used to smooth data that is multiple valued. Smooth loops can be obtained from erratic original data. Discontinuities can be allowed by providing locations where the fitted values are not forced to be smooth. The idea is developed from Tichinov Regularization. The method can be applied to data on a regular grid or irregularly spaced data. Code implimenting some of these techniques for 2d data is available in the program xsdsmooth and for 3d gridded surface data is available in the program grids2grid.

Introduction

Suppose that we have measurements that should be functionally related but for which we do not have a guess for an "a priori" relationship. Assume also that the measurements are inexact. If we make a series of measurements and detect that the data does appear to be functionally related, then we might like to refine our estimate of the relationship by making more measurements and/or, by "smoothing" the original data. Instead of guessing at a functional form to be the basis for smoothing the data, Tichinov suggested selecting a new set of points that are "close" to the original points but that minimize an additional regularization constraint. The regularization constraint can penalize "non-physical", or undesired behavior. In order to make the idea more concrete, assume that we have 2 dimensional data. Extension to multidimensional data is simple by promoting the independent variable to a vector.

Honoring the Original Locations:

Let the measured values be pairs: . Let us approximate the original data by a new set of data. To simplify the example, assume that the approximation agrees exactly in the "x" coordinate and only approximately in the "z" coordinate. Extensions of the idea allowing errors in both coordinates use techniques in the standard statistical literature. Let the approximation for i = 1,N be: . Many "physically" reasonable constraints could be imposed. "Smoothness" is one such constraint. But the idea of smoothness is not very concrete. And we will introduce several measures of smoothness in this note.

A "badness" function will be built and optimized by adjusting the variables to achieve minimum "badness". Optimization of a "badness" function that is quadratic in the unknowns will produce a linear system of equations.

Irregular Data Interpolated onto a Regular Grid:

Suppose that the orginal data is sampled on a set of irregularly spaced locations, and the data is desired on a regular grid that best matches the original measurements. Since the original data is given at different locations from the desired grid, a generalization of the idea of agreement between the orginal data and the "interpolated" or "extrapolated" data must be made. A reasonable generalization is for the values interpolated from the regularly spaced data to the orginal samples agree. And this will lead to a more complex "veracity" component of the least square problem. Instead of modifying ONLY the diagonal of the system of equations, the veracity component will add non-zero coefficients to the variables within the interpolation window of each original measurement. Assuming that we use linear interpolation, in the 2d case (curve) each original datum will couple 2 grid values. Assuming that we use bilinear interpolation in the 3d (surface) case, each original datum will couple 4 grid values.

Generalizations Allowing Discontinuity, Extrapolation, and Interpolation:

We could allow interpolation, and extrapolation by generalizing the problem to finding a set of pairs of the form , where may be at the same or different locations from the original . And the may be forced to have the same values as the original , or allowed to be different from these values. We can provide these generalizations by carefully controlling which constraints are included in the quality function and which are allowed to vary during the optimization of the quality ("badness") function. Discontinuities will be introduced where smoothness is not required. And the original data will be honored at locations where the are not allowed to change from the original values.

Reducing the Variation in Fitted Values

Let us discuss a family of smoothing ideas based on minimization of weighted differences in the smoothed values. Let us minimize the following function of the original data and the smoothed values:

Making the second term small forces the smoothed data to be "close" to the original values. Making the first term small, decreases the variation in the smoothed data. By adjusting the value of the weighting parameter, "v", the fitted values can be made smoother or made to honor the original data as closely as desired. For large "v" the fitted values will be close to the original data. For small "v" the fitted values will tend toward the average value of the original data. The weighting parameter, "v", might be called the "veracity" since it controls how well the fitted values match the original data.

Reducing Second Differences

The first term could be considerably more elaborate. For example, summing the squared second differences produces a quality function that can reduce the second variation in the fitted values:

We might choose to decrease variations in the local estimates for the first derivative of the dependent variable by replacing the first term with:

Third Differences

Minimization of the third differences in the smoothed data will tend to allow constant curvature in the fitted values which might be useful in fitting multiple valued relations such as circles.

Explicit Formulae

In order to present the basic idea, consider minimization of the first quality function, S. The first term of this quality function may be considered to be the square of the slope of the fitted values weighted by the square of the length of the x-intervals. Since S is quadratic in the smoothed data values, partial derivatives of S with respect to the smoothed data values will be linear in the smoothed data values. Since S increases for very large or very small smoothed data values, we expect a minimum value of S to occur where the gradient of S with respect to the smoothed data values is zero. The "best" fit, in this sense, can be found by solving a linear algebra problem obtained by equating the gradient of the quality function with respect to the fitted values to zero: . The linear equations are:

and

This is a tridiagonal system of linear equations for the unknowns, . It is almost Toeplitz. If the (1,1) and (N,N) coefficients of the system were 2, instead of 1, then the system would be Toeplitz. Sparse matrix methods can be used to solve this linear system or Toeplitz methods can be used to get an approximate solution. More complex quality functions lead to less sparse (but still diagonal) linear systems. For example, minimization of the second differences leads to a system with 5 diagonals:

This 5 diagonal smoother is similar to 2 applications of the simpler, 3 diagonal smoother and drives the smoothed data toward constant slope. The 3 diagonal smoother drives the smoothed data toward a constant value.

Minimization of the sum of the squares of changes in the slope of the curve leads to a similar 5 diagonal system of equations. In this case the weighting terms depend inversely on the increments in the independent variable. (Thus, if the increment is zero, there is a singularity that needs to be avoided in numerical computation. Thus, it would be desirable to caste the minimization problem into the form of minimization of the sum of the squares of the angles (or sines of the angles) between adjacent segments of the curve, since this would be independent of the orientation of the coordinate axis. However, this leads to a non-linear optimization problem that might be harder to solve.) We attempt to minimize the following sum of squares:

Organizing the sum into common factors of the unknown gives:

And the remaining equations for points near the boundary can be obtained from these by symmetry. Singularities where can be removed by deleting the replicated values of and replacing the value of the dependent variable at that x by the average of the values at the replicated x value.

Matrix Formulation

If is the vector of smoothed data values , is the symmetric, difference matrix associated with the smoothing term, and I is the Identity matrix if and if then must satisfy

The binomial coefficients play an important role because they give the coefficients of the finite differences:

1,-1

1,-2,1

1,-3, 3, -1

1, -4, 6,-4, 1

1, -5, 10,-10, 5, -1

1, -6, 15, -20, 15, -6, 1

1, -7, 21, -35, 35, -21, 7,-1

Notice that each row in the above table sums to zero.

Software: grids2grid

The above ideas are implemented in a program called grids2grid. The program, grids2grid, takes a regular grided surface as input and creates a smoothed, interpolated, and/or extrapolated regular gridded surface as output. Masks, flags, and/or conventions can be used to specify where the data is honored, smoothed, extrapolated, and/or discontinuous. Valid values of the input grid are specified by either a mask or a special value of the grid. If a mask is used, then non-zero values of the mask are assumed to specify valid grid locations. A special "mask" value is used to specify the invalid grid locations by coding the command line argument of the form mask_value=99989. In this case, grid locations with value equal to the mask value will be assumed to hold invalid data.

The location of grid points where output values are desired is specified by a second mask. The default is to provide output for the entire grid. So if no mask2 is specified, then values will be output for the entire grid.

The original data can be honored at locations specified by a third mask. The default is to honor all of the original data. If all of the original data is to be honored then you may code the command line argument honor_original_data=yes and omit the mask3= argument.

The veracity may be set with the command line argument veracity=value. The default value is 1. If no veracity is specified, then veracity is set to 1.

Example syntax:

grids2grid [ ? | help=] [-h | -help ]

[in1= dds_horizon_and_mask_dict] [format= usp | cube ...]

[in2= dds_out_mask_dict]

[in3= dds_honor_dictionary]

[out= dds_horizon_and_mask_dict], [out_format= usp | ...]

[veracity= Pos_Real_Number (default= 1.)]

[honor_original_data= yes | no (default=yes)

[axis= x y z] [verbose=]

[size.x= size_of_fastest_data_direction]

[size.y= size_of_next_data_direction]

[size.z= number_of_surfaces]

[delta.x= increment_associated_with_fastest_dir]

[delta.y= increment_associated_with_next_dir]

[x0= coordinate_origin_of_fastest_dir]

[y0= coordinate_origin_of_second_dir]

Smooth, interpolate, and extrapolate dds gridded surfaces.

using the dds i/o system. grids2grid accepts

the following optional command line arguments of the form

arg_id= arg_value:

help= causes this help message to be printed

in1= dds_grid_dict IDs the input grid file data dictionary

in2= dds_grid_dict IDs the input file that specifies where

in the grid output data is desired. Id est, grid

points that have in2_mask values greater than zero will

be computed for output. If in2 is not specified, then

the output is assumed to be requested on the entire grid.

in3= dds_honor_dictionary IDs the mask that specifies where

the original data should be honored. Where the data

specified by in3 is non-zero, the original data will

be honored and propagated to the output.

If in3 is not specified, then the honor mask can be

created by specifying the honor_orginal_data flag=yes or

no. If honor_original_data is not set or it is set to

"yes" then the honor mask is set to honor the orginal

data. If honor_original_data is set to "no" then the

output data values are smoothed at all output locations.

honor_original_data= yes | no (default = yes)

out= dds_grid_dict identifys the output grid data dictionary

format= input grid format (usp, cube, segy, ...)

out_format= ouput grid format (usp, cube, segy, ...)

veracity= A weighting factor allowing for adjusting the

relative importance of honoring the original data versus

forcing the data to be SMOOTH. Large values of veracity

cause the smoothed data to closely honor the original

values. Small veracity, allows the smoothed data to

be smoother, and not honor the orginal data as closely.

axis= x y z identifies the names of the coordinate axes

size.x= size_of_fastest_data_dir sets the number of values

in the fastest data direction.

size.y= size_of_second_data_dir sets the number of values

in the second data direction.

size.z= number_of_surfaces

delta.x=dx sets the size of the increment in the first

direction.

Example of Using the Software:

Grids2grid was used to extrapolate the op and bottom of the salt body for the Ultra_13 survey. The landmark picks for the top-of-salt surface had missing picks on some lines and the edge was sawtoothed because of picking inconsistencies. Notice the square corners and vertical dropouts along the edge of the surface. The square corners and vertical dropouts are probably artifacts associtated with the picking of the surface and not characteristic of the shape of the top of salt surface. Therefore, a perimeter of the top of salt was picked using xsd.new. This xsd segment was used to create a mask to indicate how to extrapolate and trim the existing picks to form a surface with a smoother edge with fewer spurous artifacts. This perimeter is plotted on the following picture: Notice that the picked perimeter rounds the square corners and ignores the vertical dropouts. This perimeter is subject to interpretation. The right angle corners could be rounded more or less as in the above example. And the interpreter who picked the surface could certainly make suggestions for improvment. For example, the left edge of the salt surface should almost certainly be extended to the left edge of the model as is shown in the following image of the surface. And the grids2grid program can be used to extend the data to the left edge of the model. The program xsd2grid was then used to create a mask grid specifying the interior of the xsd curve as the location of the salt surface by giving it a seed point interior to the curve. The point (1016, 296) was specified to the xsd2grid program with the command line syntax of "seeds_x=1016 seeds_y=296". More complex surfaces consisting of more than one piece or specified by more than one closed xsd curve would require more seed points. The mask created in this step is used to specify where output data is required for the "improved" top and bottom of salt surfaces using program grids2grid. The original base-of-salt is shown in the following image. :The final extrapolated, trimmed, and interpolated surfaces are shown in the next two images: Notice that both the top and base of salt were constructed using the same output mask. That is, both surfaces are defined at exactly the same grid nodes. This top and bottom surface are reasonable candidates for top and bottom limits of the salt body and can be inserted into usp headers using the program "grids2headers" so that the "replace" program can be used to fill the volume between the base and top of salt with salt velocity. A script capturing the process of creating a modified velocity model from the landmark picks of the top and bottom surfaces is:

# First we make a binary surface from the landmark picks for the

# top of salt. We call the binary surface top.salt.grid

landmark2surf list_of_horizon_files= horiz1 \

out_grid= top.salt.grid \

out_format= cube \

axis= x y z \

size.z= 1251 \

size.x= 1381 \

size.y= 833 \

origin.x= 0. \

origin.y= 0. \

origin.z= 0. \

delta.x= 82.5 \

delta.y= 82.5 \

delta.z= 20.0 \

base.x= 2298 \

base.y= 4300 \

step.x= 2 \

step.y= 1 \

masked= yes \

grid_type= depth \

verbose=

# Second, we make a binary surface for the base of salt from the

# Landmark picks and call it base.salt.grid

landmark2surf list_of_horizon_files= horiz2 \

out_grid= base.salt.grid \

out_format= cube \

axis= x y z \

size.z= 1251 \

size.x= 1381 \

size.y= 833 \

origin.x= 0. \

origin.y= 0. \

origin.z= 0. \

delta.x= 82.5 \

delta.y= 82.5 \

delta.z= 20.0 \

base.x= 2298 \

base.y= 4300 \

step.x= 2 \

step.y= 1 \

masked= yes \

grid_type= depth \

verbose=

# We plot the top and base with xsd.new and pick

# a single perimeter that looks consistent with both top and bottom.

# We picked the single perimeter to smooth out the corners in the

# original surfaces. we called the xsd pick file s_perim

# We used xsd.new to locate a point (or points) inside of the

# salt body surface (seeds_x= x_value, and seeds_y=yvalue).

# Then we use xsd2grid to create a grid mask from the xsd segment.

# We call the mask "s_perim_mask".

xsd2grid xsd_segments=s_perim \

seeds_x=1016 seeds_y=296 \

out_grid=s_perim.mask \

out_format= cube \

axis= x y s \

size.s= 2 \

size.x= 1381 \

size.y= 833 \

origin.x= 0. \

origin.y= 0. \

origin.z= 0. \

delta.x= 82.5 \

delta.y= 82.5 \

delta.s= 1.0 \

base.x= 3588 \

base.y= 4312 \

step.x= 2 \

step.y= 1 \

verbose=yes

# Step 4:

# Extrapolate, trim, and smooth the data to fit the new mask.

# The new surfaces are called "top.salt.extrap.grid" and

# " base.salt.extrap.grid".

/home/zdjv02/geodev/IRIX64/6/mbs/prod/IP25/grids2grid \

verbose=yes \

in1= top.salt.grid \

format=cube \

axis=x y z \

size.x=1381 \

size.y=833 \

size.z=2 \

honor_original_data=y \

out_format=cube \

in2= s_perim.mask \

out= top.salt.extrap.grid \

out_data=top.salt.extrap.binary

#/home/zdjv02/geodev/SunOS/4/mbs/prod/sun4m/grids2grid \

/home/zdjv02/geodev/IRIX64/6/mbs/prod/IP25/grids2grid \

verbose=yes \

in1= base.salt.grid \

format=cube \

axis=x y z \

size.x=1381 \

size.y=833 \

size.z=2 \

honor_original_data=y \

out_format=cube \

in2= s_perim.mask \

out= base.salt.extrap.grid \

# Insert the new surface values into usp velocity model headers

/home/zdjv02/geodev/IRIX64/6/mbs/prod/IP25/grids2headers \

in=topNbot.extrap.grid \

vel=$in_dir/tos_temp \

z_max=30020 header_end=TVPair.TVPT01.TVPT02 \

header_start=TVPair.TVPT01.TVPT01 verbose=yes

# Replace the velocity values between the top and bottom

# surfaces with the salt velocity of 14500.

replace -N$in_dir/tos_temp \

-O/data11/zdjv02/Ultra_13/sept24/topNbot.rep2 \

-HW1TVPT01 -HW2TVPT02 -value14500.

Appendix- Interpolation from Irregular Initial Data Onto a Regular Grid

Suppose that we use linear (or bilinear) interpolation to measure the difference between the orignal data and the data as interpolated onto a regular grid. Then we must modify the second term (the veracity term) of equation 1. Notice that in the veracity term, the index, i, is a function of the location, . The structure of this formula can be made more apparent by defining a new variable that we can call the "weight", . In terms of this weight, the veracity term becomes: .