As often happens with important concepts, the idea of penalization applied to functional fitting can be traced to more origins than one. The inspiration from the probabilistic side arose in the context of “graduation,” a “borrowing strength” enterprise in the actuarial context. The best exposition of the Whittaker (or, as actuarial science prefers to have it, Whittaker–Henderson) graduation is, as a matter of fact, Chapter XI of Whittaker and Robinson (1924), entitled “Graduation, or the Smoothing of Data.” It also provides the best account of the earlier references of Whittaker (1922–1923, (1924); those of Henderson (1924, (1925, (1928, (1938) are to be found elsewhere.
As often happens with important concepts, the idea of penalization applied to functional fitting can be traced to more origins than one. The inspiration from the probabilistic side arose in the context of “graduation,” a “borrowing strength” enterprise in the actuarial context. The best exposition of the Whittaker (or, as actuarial science prefers to have it, Whittaker–Henderson) graduation is, as a matter of fact, Chapter XI of Whittaker and Robinson (1924), entitled “Graduation, or the Smoothing of Data.” It also provides the best account of the earlier references of Whittaker (1922–1923, (1924); those of Henderson (1924, (1925, (1928, (1938) are to be found elsewhere.
A process of “smoothing” is to be performed over the original values, typically those of a mortality table, burdened by observational or other errors. The objective, as Whittaker and Robinson point out, is not so much to get a smooth curve, as to “get the most probable deaths.” After surveying several existing methodologies (among them one that can be regarded as a prototype for local polynomial fitting) they adopt the conventional normal model for the distribution of errors, together with the normal a priori distribution for what they call a measure of roughness, S. The subsequent use of what they consider a “fundamental theorem in the theory of Inductive Probability” (with no personal attribution) yields as the “most probable hypothesis” that “which makes
a maximum,” with F signifying fidelity (“of the graduated to the ungraduated values, respectively”).The subsequent century of further development changed not that much.
Nowadays,
is used rather than ; the word “fidelity” is still around – although, as the mathematical expression quantifies “infidelity,” more specific terms like “lack-of-fit,” or even better and becoming widespread, loss, L(f), associated with fit f, are taking its place. Given that various other objectives may be pursued rather than smoothness, it is preferred to speak in a neutral fashion about penalty, P(f). Finally, a custom developed to write the terms in the reverse order: in the modern formulation, we seek the solution ofThe result of (3.1) is sometimes called the maximum a posteriori (MAP) estimate. The prior/posterior, random-effect interpretation of (3.1) has received regular attention in the literature, in the work of Good (1971), Good and Gaskins (1971), Kimeldorf and Wahba (1970), Watson (1984), Cressie (1989, (1990), Wahba (1990a), Kent and Mardia (1994), and others.
A separate line of development – probability-free, as not aligning itself with what earlier could be called, according to Stigler (1986), “combination of observations” or “reconciling inconsistent observational equations,” and is now often referred to as “statistics” – recognizes as its origin the observation of Hadamard (1902) that not all problems are really well posed. As a consequence, solving them may be not easy; to this end, Tikhonov (1943, (1963) proposed a technique, or rather a family of techniques, for which he coined the catchy word regularization. His most successful proposal, the one referred to as the Tikhonov regularization, coincides in its form with (3.1).
Some used to prefer to speak about the Tikhonov–Phillips regularization instead; but then it may be of interest that Phillips (1962) originally started from the formulation
Hence, it may be appropriate to reserve (3.1) for Tikhonov alone and give the specific name Phillips regularization to the case where the problem is formulated in the loss-constrained form (3.2) – as was done in the Russian literature, which also spoke about Ivanov regularization when the tasks were specified in a penalty-constrained form
Ideologically, this alternative corresponds to the fixed-effect viewpoint: if the estimated regression function does not vary with every realization, and allows for repeated sampling (possibly with varying number of observations and different sizes of error), then it is crucial to determine the right amount of “roughness” (or whatever P aims to control) of the “true” f: once we know it, we keep it fixed and optimize merely the goodness of fit. Gu (1998) advocates (3.3) and its tuning parameter
as better suited for indexing models across replications. It may be of interest that Tibshirani (1996) also starts his exposition with the penalty-constrained formulation, perhaps motivated by the connection to the nonnegative garrote of Breiman (1995).In the view of the Lagrange multiplier theory, it is clear that Tikhonov, Phillips, and Ivanov are closely related, the essential assumption being the convexity of both L and P[see][]Vas70. Tibshirani (2015) studies a numerical method for the general form of (3.3) related to the “forward stagewise” algorithm of Efron et al. (2004). One reason for the popularity of the Tikhonov formulation (3.2) over the other two – as witnessed in the further development of the Whittaker graduation by Schoenberg (1964) and Reinsch (1967, (1971) – is the computational simplicity it offers in the quadratic case. This may also explain why Rudin et al. (1992) could easily embrace the Phillips formulation (3.2) (with equality) instead: in their case, there is no noteworthy computational advantage to be gained from (3.1). The case of quantile regression is similar – the reason why we recall here the alternative formulations.
Unlike the development of unpenalized regression fitting (the technology minimizing the loss function L(f) alone) the initial work in the penalized domain did not see that much variety and rather adhered to the ideal of “wonderful simplicity” in the sense of Gauss: all formulations were quadratic. Both Whittaker and Tikhonov, as well as Phillips and Schoenberg, took L(f) to be the sum of squared errors. For instance, the formulation of Whittaker determines the graduated values
from the original ones via solvingIt may be of some little conceptual advantage to take the “average error” instead, the sum of squares divided by the number of observations: the prototypic objective function of Tikhonov regularization (3.1) is then
Here
is the result of an evaluation functional, corresponding to , and applied to f: a putative fitted value of f. In the graduation setting, the fits form a vector, f, of the same length as that of the observations ; thus, . When the are fitted by a function, f, of covariates , then .The typical form of the penalty used in (3.5) developed into
, with J a convex function, typically a norm or its square, and D a linear operator. The motivation for the particular form of the penalty is usually rather intuitive; it is fairly clear that pushing the total size of the differences/derivatives down makes fits smoother, but which particular order of derivatives has to be used may be not that apparent. As a rule, the only specific exact motivation setting apart various penalties is the limiting behavior with respect to the tuning parameter. When in (3.5) (or in the Ivanov formulation), the sum of approaches 0 and observations are in the limit fitted by themselves. This is even more apparent in the Phillips formulation, where for we obtain an interpolation problemwhose solution, as will be recalled below, is the key to the shape of the solutions of regularization formulations considered above.
The important limit behavior we speak of is, however, that on the opposite side: if
, then the fit in the limit satisfies . Consequently, if J is a norm (or just a strictly increasing function with ), it follows that the fits lie on a graph of a function satisfying . If D is a difference/derivative operator of order k, then this function is constant, linear (affine), or quadratic, respectively, for . Conversely, the penalty does not influence the additive part of the fitted function for which it vanishes.The prevailing choice in the literature, forcing the limit fits to be linear, happens to be
, as in . A notable exception is Silverman (1982), who in the context of density estimation proposed to penalize the logarithm of the probability density and then suggested the use of the third derivative: the limiting fit is then the exponential of a quadratic function, the density of a normal distribution. Interestingly, the third-order differences were Whittaker’s original choice (3.4) as well.Figure 3.1 Maximum recorded running speed as a function of the logarithm of body weight, fitted for various quantile indices τ $ \tau $ as cubic smoothing splines with the appropriate sum-of-check-functions losses. Tuning parameters were selected to match fidelities with those in Figure 3.3.
The setting of graduation enjoyed relative simplicity due to the sequential, equidistant character of observations. On the other hand, in general nonparametric regression we may face complications related to possibly different distances between covariates, and their possible repetition. While differences are still possible (divided by the magnitudes of those distances), a revolutionary leap forward was accomplished by Schoenberg (1964), who recognized that the Whittaker graduation can be formulated in a very general functional setting, and pointed out its connection to splines, the effective interpolation vehicle in numerical analysis.
The new setting means that differences are replaced by derivatives: Df is now a differential operator applied to a function f, and the norm J typically involves some integral over the domain of the definition of f. This brings some technical complications, which we will not detail here: the integrals need to converge, and in general may depend on the domain of integration. The classical theory requires the penalty to have the
character: it is a strictly increasing function of a norm in a Hilbert space, a norm that can be expressed in terms of the inner product in a suitable functional space. A canonical instance of such a penalty, related to the inner product in the Sobolev space of twice-differentiable functions with square-integrable second derivative, isIt this particular fortuitous case, when x is a real variable, the choice of the domain of integration is not essential: any
that covers all covariate points, and in particular , gives the same result.Once the penalty has the
character, the theory retains its full strength when and the minimization involves a more general loss functionThe search for the minimizer of (3.8), an infinite-dimensional problem, can be then reduced to a finite optimization in a finite-dimensional space: the so-called representer theorem of Kimeldorf and Wahba (1970) – we refer to the most general version given by Schölkopf and Smola (2002) – asserts that the solution is a linear combination of a finite number of basis functions. Those functions generally depend on the problem (in particular, on the covariates) and are related to an important notion of positive definite functions of two variables, the so-called reproducing (Mercer) kernel. As it is somewhat more straightforward to obtain the penalty from the reproducing kernel than conversely, a considerable part of the literature concerning regularization in the
context concentrates on kernels rather than on penalties; the minimization of the loss is no longer “penalized” but “kernelized.” See also Wahba (1990b), Green and Silverman (1994), and Ramsay and Silverman (2005).We stress here that the thrust of the representer theorem lies in its explicit characterization, not merely in finite dimensionality. In fact, pretty much every solution to a problem of the form (3.8), with an arbitrary penalty, is characterized by the fitted values
, the values of the optimal function f at the points for ; for if is any function such that , then the loss is the same for and , and thus by the optimality of , the values of the penalty for both functions must be the same too. Therefore, the solution of (3.8) is characterized by the T fitted values, and consequently the space of such solution is some set parametrized by those (finite number of) values, although possibly in a nonlinear way. The only problem here may arise if the function interpolating fitted values is not unique; but even then sometimes a convenient version can be picked up in a unique way, as in the case of the quantile smoothing splines considered below. This small exercise also shows that the shape, the contour structure of the solution (whether it piecewise linear, cubic spline, or something similar) depends, for the pointwise defined loss functions, essentially only on the penalty.All this was certainly not an issue in the original finite-dimensional formulations of Whittaker and others; thus, many feel that it should have remained at that. The advantages of the functional setting may not be that compelling when the only task is to process the data set at hand, but we believe that they become visible when more elaborate sampling strategies (e.g. adding new observations) are involved. A more thorough investigation of such situations may still remain to be done.
Figure 3.2 The same data as in Figure 3.1, now fitted by quintic smoothing splines with the sum-of-check-functions losses. Tuning parameters are again selected to yield the same fidelities in all examples.
Quantile regression loss functions formed as sums of pointwise evaluations of check functions satisfy the conditions required by the representer theorem, which then applies to the formulations where they replace the sum-of-squares loss accompanied by a penalty of the
type,Here
is an interval covering all covariate points and denotes the kth derivative (possibly in the sense of absolute continuity). The solution of the functional, infinite-dimensional optimization task (3.9) can be reduced to a finite-dimensional one of finding the coefficients in an expansion , with known , possibly depending on . In such a case, we can also write J(Dg) as , where the matrix G depends only on the basis functions . The computation of a “quantile regression with smoothing splines” – the terminology is a bit confusing here, especially in view of “quantile smoothing splines” to come below; we thus adopt the term of Bosch et al. (1995) – via (3.9) is then reduced to a quadratic programming problem.Quantile regression proposals in this direction, for the most part with
, were put forward by Bloomfield and Steiger (1983) and others, including Nychka et al. (1995), Bosch et al. (1995), and Oh et al. (2012); the latter paid some attention to the development of dedicated computational algorithms in the iterated reweighted least-squares fashion, to eschew the full deployment of quadratic programming, a task which was at the time not that routine as it is nowadays, due to the rapid evolution of convex optimization software as documented by Koenker and Mizera (2014). More complex and general “kernelized” versions, with potentially a multidimensional scope, were proposed by Takeuchi et al. (2006) and Li et al. (2007).Figure 3.1 shows the fit of various quantiles (the indices
indicated in the legend) of the maximum recorded running speed of mammals as a function of the logarithm of their body weight, as cubic ( ) smoothing splines; by way of homage to Whittaker, we show also the analogous fits for in Figure 3.2. The fact that we take the logarithm of the body weight in these examples, but not that of the running speed, is not meant as a dispute with Example 7.2 of Koenker (2005), featured also in Koenker et al. (1994), where the logarithm of speed is taken as well; our intention was rather to achieve some variety. The tuning parameters for the penalized fits were initially selected by eye, roughly the same for all ; later they were slightly modified to match the fidelities in Figure 3.3, where the properties of quantile smoothing splines do not allow for that much flexibility in setting arbitrary fidelity values as in the other cases.Figure 3.3 The maximum speed of mammals as a function of the logarithm of weight, now fitted as quantile smoothing splines, linear total-variation splines with the sum-of-check-functions loss. Tuning parameters selected to yield fidelities that can be matched in the other examples.
The difficulties anticipated in their time by Koenker et al. (1994) in the practical use of quadratic programming for solving (3.9) with the penalty (3.7) motivated a step sideways: holding no prejudices against “kinky” statistics of piecewise linear loss functions, and at the same time aware of the potential of linear programming, Koenker et al. proposed to replace the square in the definition of the penalty (3.7) by the absolute value, thus instantiating formulation (3.9) as
Here
is again an interval covering all covariate points. By the Vitali theorem, the penalty is equal to the total variation of (not of f: note two derivatives) on . The second derivative may be thus merely in the sense of absolute continuity and the problem can be equivalently written asThe original infinite-dimensional formulation can be again reduced to a finite-dimensional one, admitting a representer theorem in the spirit discussed above. Koenker et al. (1994) showed that optimal solutions of (3.9) can be found among piecewise linear f, with the breaks occurring only at covariate points
. The fact that the objective function of (3.11) is convex, but not strictly convex, permits in some cases also optimal solutions that are not piecewise linear; nevertheless, an equivalent piecewise linear solution, a solution with the same value of the objective function, can always be found. (In fact, this aspect is also present in the proof of a general representer theorem for penalties.) The resulting finite-dimensional problem is a linear programming task, for which there are, especially nowadays, very efficient algorithms. An example of the result can be seen in Figure 3.3.As often happens with important ideas, the use of the
norm in penalization has a complex history. In the context of multiple regression (a response vector y fitted by a linear combination ) penalized by a simple norm penalty (a penalty without any linear operator inside it), the use of the norm penalty,as a replacement for the
penalty of ridge regression of Hoerl and Kennard (1970),can be traced back to the thesis of Logan (1965) – see also Donoho and Logan (1992) – and the work of geophysicists Claerbout and Muir (1973), Taylor et al. (1979), and Santosa and Symes (1986). Chen and Donoho (1995) and Chen et al. (1998) pointed out the connection to the total-variation denoising of Rudin et al. (1992), which penalizes the total variation of f rather than that of
; the setting of image processing is akin rather to that of graduation – the data appear in every pixel, there is no need for interpolation. In graduation itself, the use of absolute-value loss (median regression as the special case of (3.11) with ) and the norm/total variation in graduation was proposed by Schuette (1978). In statistics, the regularization became extremely popular under the acronym lasso, “least absolute selection and shrinkage operator,” introduced by Tibshirani (1996).Quantile smoothing splines enjoy many favorable properties. They are conceptually transparent and enjoy a straightforward linear programming implementation (in any of Tikhonov, Phillips, or Ivanov formulations). They offer relatively easy mitigation of the potential complication of crossing, the anomaly that sometimes happens (especially on the boundary of or outside the domain of covariates, for instance in our Figure 3.3 for
and ) when the quantile fit for lies for certain covariate values below the fit for . The straightforward strategy of rearrangement (Chernozhukov et al., 2010), amounting basically to the exchange of fits in parts where they violate the desired monotonicity, does not leave any aesthetically unpleasing kinks as for smooth curves, owing to the fact that both maximum and minimum of any piecewise linear functions is again piecewise linear.Quantile smoothing splines can be also easily adapted to potential shape constraints imposed on fits, like monotonicity or convexity. These shape constraints can actually be viewed themselves as a special case of penalized fits, with penalties assuming values 0 if the fits uphold them, or
if they do not. Such penalties can be designed to conform to the scheme J(Df): for instance, J can an made the indicator (in the sense of convex analysis) of, say, a set , with if u belongs to E, and if it does not; combined with the operator D of the first or second derivative, such penalties respectively enforce (nondecreasing) monotonicity or convexity.The only somewhat unfortunate property may be their name: the adjective “smoothing” may be considered misleading by those expecting in such a context smooth corners rather than kinks. However, this kind of thing may be a marketing problem for emerging brands, but not for well-established ones: everybody now knows that the three musketeers were actually four. Finally, those preferring round corners and wavy curves may opt not only for the
penalties mentioned above, but also for penalties with higher-order derivatives, exposed in the next section – or just smooth the corners, if their principal objective is the same as the desideratum of Whittaker and Robinson (1924): to achieve “not that much smoothness but the most probable deaths.”Figure 3.4 The same data as in all examples, now fitted by quadratic total-variation smoothing splines combined with the sum-of-check-functions loss. Tuning parameters selected again to match fidelities with those in all examples.
Total variation in spline smoothing was taken further in the important paper of Mammen and van de Geer (1997). They, having no special interest in quantile regression, transferred the setting to the realm of squared-error loss, but carried the idea further by introducing the whole class of estimates obtained through solving
Recall that by the Vitali theorem, the penalty can be equivalently written as the integral, over
, of , as the required absolute continuity of is in fact necessary for the feasibility of (3.14). Combining this with the return to quantile loss functions, we obtainThe latter is a generalization of (3.11) involving possibly other k than 2.
Given that due to the connection to the interpolation problem (3.6) the representer theorem depends more essentially on the penalty once the loss amounts to a finite number of pointwise evaluations, it was not surprising that Mammen and van de Geer (1997) reaffirmed the representer theorem of Koenker et al. for
, and added that it holds true also for , again in the sense that the knots of the minimizing piecewise constant “spline” can be chosen at the covariate points. They pointed out, however, that in general this is not true for : their “weak” representer theorem established that in such a case it is known only that the solution is the spline of order , but the locations of knots are unknown and may not coincide with any of the covariate points .This may constitute, for
, a possible difficulty for computation, particularly for squared-error loss. However, for quantile regression, the use of a relatively fine grid of “potential knots” (the original proposal of Mammen and van de Geer) renders the problem computationally quite feasible with modern methods for linear programming. In the least-squares case, Tibshirani (2014) proposed to act as if the knots were in the covariate points, using the opportunity to rename the whole technology – which differs from that of Mammen and van de Geer (1997) only for and higher, with being probably the most likely to be ever used – after the “graduation proposal” (this time in the context of time-series sequences) of Kim et al. (2009), blending the least-squares loss of Whittaker, as given by Whittaker and Robinson (1924), with the absolute-value difference penalty of Schuette (1978).The most important achievement of Mammen and van de Geer (1997) came on the theoretic front: they showed that in the smoothness class of the functions whose total variation of the
th derivative is (uniformly) bounded by a constant, the total-variation splines, with appropriately selected tuning parameter, achieve the minimax rate of convergence – the rate that Donoho and Johnstone (1998) showed cannot be achieved by the estimators that are linear in the observations. These are, in particular, the estimators obtained via the penalties, but, ironically, only when coupled with the least-squares loss functions; those coupled with the quantile regression sum-of-check-functions functions are technically not linear.Translated into the real world, the smoothness class condition essentially means the number of derivatives we believe the functions underlying the data possess. If we cannot rule out jumps, we have to opt for
; if we believe that the functions we seek are continuous, but kinks are allowed, we choose ; finally, if we believe that the fitted functions should be continuous and differentiable, then should do. The latter, with the penalty that may be viewed as the absolute-value transmutation of that of Whittaker, may be, in fact, the best choice for practical smoothing (in the original sense of the word), as suggested by the example in Figure 3.4. The larger values of k may be prone to numerical difficulties as well as yielding fits suffering too much from the inflexibility of the higher-order polynomials involved in their construction – as possibly indicated by Figure 3.5 involving .Whittaker and Robinson (1924) identify several advantages of their method of graduation, among them its superiority to the precursor of local polynomial fitting (also discussed in their book) in handling boundary data. Interestingly, they also include among its positive virtues the dependence on the tuning parameter
, because “a satisfactory method of graduation ought to possess such elasticity, because the degree to which we are justified in sacrificing fidelity in order to obtain smoothness varies greatly from one problem to another;” to determine , they advise to “try two or three different values and see which gives the most satisfactory result” (an approach which was pretty much used in the examples we present here).In the meantime, the outlook may have changed: the onus is now on the availability of “automatic” methods for the selection of tuning parameters. Theoretically oriented researchers may be inclined to feel that more work remains to be done in this respect, while practically oriented ones usually report no problems when already well-known tuning procedures are used.
The latter include various forms of cross-validation, as suggested by Hastie et al. (2008), who favor splitting the data set into relatively few groups. This k-fold cross-validation may suffer from some volatility caused by (typically) random allocation of observations to k groups, but is usually computationally feasible (especially for
, but often also for the commonly used ), and thus more often than not returns operational results.The classical, leave-one-out, n-fold (n being the size of the data set) cross-validation used for
penalties is a bit out of favor here, as its crude computational form can result in a prohibitive amount of computation time, and its refined form is not available, due to the nonlinear character of the problems with nonquadratic objectives; this applies also to its step-sibling, generalized cross-validation. It is possible, however, to employ these strategies on various linearizations of the problem, as proposed by Oh et al. (2012), Yuan (2006), and others.Figure 3.5 Once again the same data, this time fitted by cubic total-variation smoothing splines combined with the sum-of-check-functions loss, with the same strategy for the selection of tuning parameters as in all other examples.
The more preferred approach for the
penalties seems to be via the notion of the degrees of freedom, the projected number of free parameters in the fitted model. As pointed out already by Koenker et al. (1994), and verified by several other sources later, for the penalty the degrees of freedom are equal to the number of exact fits, the number of fitted points with residual equal to 0. The degrees of freedom, , calculated in this way can then be plugged into the objective function of the Akaike information criterion,or (apparently the more favored choice) of the Schwarz information criterion,
The
background inherent in prescriptions involving minimizing total-variation penalties yields in the one-dimensional case results consistent with the usual heuristics of sparse regression: most of the fitted parameters come out as zeros. In one-dimensional spline fitting involving total variation, “zero” means “no knot;” penalizing splines with total variation is therefore more akin to the knot deletion strategies of Smith (1982), championed by Kooperberg and Stone (1991), among others, in the context of log-spline density estimation, than to the classical smoothing splines with penalties. With smoothing splines starting with interpolatory splines and ending up with the linear fit, it is tempting to think that the fits in between pass through all possible intermediate complexity patterns; the truth is rather that the complexity of the fit remains the same, but the parameters are more shrunk toward zero.The success of
penalties in one-dimensional nonparametric regression led to the study of possible extensions in situations with multiple covariates. From the practical point of view, the most accessible path is to act one covariate at a time: either via the additive models briefly discussed in the next section, or coordinatewise, via tensor-product fits based on the univariate technologies. Thus, He et al. (1998) proposed “tensor-product quantile smoothing splines,” piecewise bilinear functions defined on a rectangular grid generated by the covariate points. The bivariate proposal for (when the penalty is the total variation of the function itself) also put forward in Mammen and van de Geer (1997) in the context of least-squares estimation was in a similar spirit.The driving force of these pursuits was the belief that the promise of a behavior similar to the one-dimensional case lies in the appropriate generalization of the mathematical notion of total variation to multidimensional functions and their gradients. An overview of some of this fascinating mathematics was given by Koenker and Mizera (2004), whose aim was also to go beyond the coordinatewise approach, to achieve coordinate-independent, rotationally invariant fits.
All multivariate (and, as we have seen above, some univariate) proposals pretty much had to forgo the hopes for a representer theorem. In such situations, the alternative was to act as if such characterizations existed: to restrict the fits to finite-dimensional domains that were deemed reasonable, and study penalized fitting on those domains. In fact, similar strategies, motivated mostly by real or imaginary algorithmic problems, were also proposed in the world of
penalties – but there they rather lacked the essential appeal due to the existence of principled reproducing-kernel methodology in that context.Koenker and Mizera (2004), too, sought the fits among triograms, piecewise linear functions on a triangulation (Lagrange–Courant triangles in the finite-element jargon); the Delaunay triangulation was preferred due to its optimal properties and algorithmic accessibility. For this particular case, they were able to isolate a fortuitous property which amounts to the following. A general notion of the total variation of the gradient is defined in à la Vitali through the formula
where the gradient and Hessian are interpreted in the sense of Schwartz distributions. To ensure that the penalty defined via (3.16) is invariant with respect to rotations, (i.e. orthogonally invariant), it is natural to require the matrix norm involved in (3.16) to be orthogonally invariant, invariant with respect to orthogonal similarity. In such a case, every total variation (3.16) with these properties reduces (up to a multiplicative constant dependent on the norm, but independent of the function and triangulation) to the same triogram penalty of Koenker and Mizera (2004): the sum, over all edges, of the length of the edge multiplied by the norm of the difference of gradients in the two pieces of linearity adjacent to the edge. As the latter difference is orthogonal to the edge, the finite-dimensional problem of finding penalized triograms,
restricted to triograms becomes ones again a linear programming task, with the penalty in (3.17) replaced by the triogram penalty.
An example of the matrix norm used in definition (3.16) could be simply the
Euclidean sum-of-squares, Frobenius–Hilbert–Schmidt norm; together with the quantile objective function, it would create a formulationwhere subscripts denote partial derivatives. This can be seen as the
pendant for thin-plate splines, the smoothing spline methodology in the realm of penalization, formulated throughHowever, despite the fact that the structure of the penalty in (3.18) resembles that of the group lasso of Yuan and Lin (2006), which in its typical applications achieves “group sparsity” (either all of the group is zero, or then no sparsity is pushed for within), this outcome seems not to be the case for the solutions of (3.18) or (3.17). A phase-transition phenomenon of Donoho et al. (2013) comes to mind here; also, the situation parallels that for the finite version of (3.19) yielded by the representer theorem, in which the matrix involved does not enjoy sparse banded structure as it did in the one-dimensional case. A possible explanation from physics (plate mechanics) of what is happening in
and penalizations defined by formulations like (3.18) and (3.19) was attempted by Balek et al. (2013); see also Koenker et al. (2002).Summing up, apart from coordinatewise approaches, if there is desire for rotationally independent penalized quantile regression in two (or more) dimensions, the user can either resort to
penalization or “kernelization” in the vein of the already mentioned proposals of Takeuchi et al. (2006) and Li et al. (2007), or can use penalized triograms relying on linear programming. The question is always whether full nonparametric regression in many covariates is really what is needed; some dimension reduction may rather be desirable. One of the options may be then additive fits, briefly reviewed in the final section of this chapter.We do not discuss penalizing by means of ridge (3.13) or lasso (3.12) penalties in linear quantile regression, with possibly many covariates, as this is covered elsewhere in this book. For the same reason, we do not give a comprehensive account of nonparametric regression methods in quantile regression, only of certain ones involving penalization.
As for what remains and was reviewed above, in the future we are bound, in the spirit of the apocryphal quote of Kolmogorov by Vapnik about mathematics operating inside the thin layer between the trivial and the intractable [see][for the exact wording]Vap06, to continue seeing incremental ramifications and modifications for the methods operating with one covariate, and, on the other hand, it is not at all clear how much progress will be encountered in the difficult realm of multiple covariates.
One area, however, that can still use a substantial advance even in the one-dimensional case, is confidence bands and, more generally, inference for the penalized regression formulations considered here. This is still a pressing issue even for methods using
penalties and least-squares loss; in the context of quantile regression, and penalties in particular, the recent explosion in the post-model selection inference literature, which we do not attempt to survey here, may perhaps eventually create a body of research work which the area discussed here can benefit from, rather than be taken advantage of.Some inspiring work in this direction by Wahba (1983) and Nychka (1988) was carried further in the framework of additive models by Wood (2006); the latter in turn became the departure point for Koenker (2011). Recall that in classical additive models the functions of covariates are decomposed into additive contributions of separate variables
and the latter are fitted by nonparametric methods aimed at the functions of one covariate. Not all the
need be present in the resulting fit; also, some of the can be fitted parametrically.An extension of this is the ANOVA fits of Gu (2002), which also add functions of two variables (and then possibly three, and so on). This are the approach adopted by Koenker (2011): the fitted functions are expanded into a sum of univariate and bivariate functions,
and out of those present in the model, the univariate ones are fitted by quantile smoothing splines, and the bivariate ones by penalized triograms. Koenker (2011) then gives the methodology for calculating uniform confidence “bandaids” for the additive components of the fits.
Another possible direction of expansion could be new kinds of additive models arising in connection with the strategy adopted by Maciak and Mizera (2016) for fitting “splines with changepoints.” As indicated at the end of the section discussing total-variation splines, there is no “universal Swiss army knife,” no universal penalty for all levels of smoothness. As a consequence, if we, say, admit jumps in our fits, we may have to face the piecewise constant character of the rest of them, even if we believe that between the jumps our fits can be general smooth functions. In such a situation, a representer theorem is hardly feasible, as more thorough mathematical analysis typically shows that the functions that one penalty allows (making it finite) make another penalty infinite.
Therefore, the fit has to be sought as a linear combination of basis functions that include, say, both classical cubic spline basis functions together with the Heaviside step functions, with steps located at covariate points, and their first and second antiderivatives, to allow possible changes in level or derivative(s). There is a certain analogy to the proposal of Mumford and Shah (1989) in image processing here: the image is segmented into subregions with the help of a total-variation penalty, and in the subregions smoothing is performed via the classical smoothing
penalty.The extension of this approach suggested by Maciak and Mizera (2016) might take this further and allow for additive fits in the same variable:
, for instance, may be further decomposed into a sum of several functions of , each with different smoothness properties, which can subsequently be fitted via and penalties of different orders. These penalties may then act either separately, or they may be coupled in an penalty in the spirit of the group lasso of Yuan and Lin (2006), expressing the belief that changes in level and in derivative are bound to occur simultaneously rather than separately.The proliferation of modal verbs in the last paragraph indicates that we have come to the last word in the title of this last section, and thus to the very end of this chapter as well.