988

As often happens with important concepts, the idea of penalization applied to functional fitting can be traced to more origins than one. The inspiration from the probabilistic side arose in the context of “graduation,” a “borrowing strength” enterprise in the actuarial context. The best exposition of the Whittaker (or, as actuarial science prefers to have it, Whittaker–Henderson) graduation is, as a matter of fact, Chapter XI of Whittaker and Robinson (1924), entitled “Graduation, or the Smoothing of Data.” It also provides the best account of the earlier references of Whittaker (1922–1923, (1924); those of Henderson (1924, (1925, (1928, (1938) are to be found elsewhere.

As often happens with important concepts, the idea of penalization applied to functional fitting can be traced to more origins than one. The inspiration from the probabilistic side arose in the context of “graduation,” a “borrowing strength” enterprise in the actuarial context. The best exposition of the Whittaker (or, as actuarial science prefers to have it, Whittaker–Henderson) graduation is, as a matter of fact, Chapter XI of Whittaker and Robinson (1924), entitled “Graduation, or the Smoothing of Data.” It also provides the best account of the earlier references of Whittaker (1922–1923, (1924); those of Henderson (1924, (1925, (1928, (1938) are to be found elsewhere.

A process of “smoothing” is to be performed over the original values, typically those of a mortality table, burdened by observational or other errors. The objective, as Whittaker and Robinson point out, is not so much to get a smooth curve, as to “get the most probable deaths.” After surveying several existing methodologies (among them one that can be regarded as a prototype for local polynomial fitting) they adopt the conventional normal model for the distribution of errors, together with the normal *a priori* distribution for what they call a *measure of roughness*, *S*. The subsequent use of what they consider a “fundamental theorem in the theory of Inductive Probability” (with no personal attribution) yields as the “most probable hypothesis” that “which makes

The subsequent century of further development changed not that much.

Nowadays,

is used rather than ; the word “fidelity” is still around – although, as the mathematical expression quantifies “infidelity,” more specific terms like “lack-of-fit,” or even better and becoming widespread,
3.1

The result of (3.1) is sometimes called the maximum a posteriori (MAP) estimate. The prior/posterior, random-effect interpretation of (3.1) has received regular attention in the literature, in the work of Good (1971), Good and Gaskins (1971), Kimeldorf and Wahba (1970), Watson (1984), Cressie (1989, (1990), Wahba (1990a), Kent and Mardia (1994), and others.

A separate line of development – probability-free, as not aligning itself with what earlier could be called, according to Stigler (1986), “combination of observations” or “reconciling inconsistent observational equations,” and is now often referred to as “statistics” – recognizes as its origin the observation of Hadamard (1902) that not all problems are really well posed. As a consequence, solving them may be not easy; to this end, Tikhonov (1943, (1963) proposed a technique, or rather a family of techniques, for which he coined the catchy word *regularization*. His most successful proposal, the one referred to as the *Tikhonov regularization*, coincides in its form with (3.1).

Some used to prefer to speak about the Tikhonov–Phillips regularization instead; but then it may be of interest that Phillips (1962) originally started from the formulation

3.2

(although actually in a slightly different form, with the equality
instead of the inequality
). This phrasing is particularly apt if we (somehow) know the noise level
; otherwise, of course, the Lagrange multiplier theory makes it quite apparent that the Tikhonov formulation (3.1) readily derives from (3.2). The choice between equality and inequality is not that substantial, as in typical cases the optimum makes the inequality constraint active, hence satisfying the equality.
Hence, it may be appropriate to reserve (3.1) for Tikhonov alone and give the specific name *Phillips regularization* to the case where the problem is formulated in the loss-constrained form (3.2) – as was done in the Russian literature, which also spoke about *Ivanov regularization* when the tasks were specified in a penalty-constrained form

3.3

Ideologically, this alternative corresponds to the fixed-effect viewpoint: if the estimated regression function does not vary with every realization, and allows for repeated sampling (possibly with varying number of observations and different sizes of error), then it is crucial to determine the right amount of “roughness” (or whatever *P* aims to control) of the “true”
*f*: once we know it, we keep it fixed and optimize merely the goodness of fit. Gu (1998) advocates (3.3) and its tuning parameter

In the view of the Lagrange multiplier theory, it is clear that Tikhonov, Phillips, and Ivanov are closely related, the essential assumption being the convexity of both *L* and *P*[see][]Vas70. Tibshirani (2015) studies a numerical method for the general form of (3.3) related to the “forward stagewise” algorithm of Efron et al. (2004). One reason for the popularity of the Tikhonov formulation (3.2) over the other two – as witnessed in the further development of the Whittaker graduation by Schoenberg (1964) and Reinsch (1967, (1971) – is the computational simplicity it offers in the quadratic case. This may also explain why Rudin et al. (1992) could easily embrace the Phillips formulation (3.2) (with equality) instead: in their case, there is no noteworthy computational advantage to be gained from (3.1). The case of quantile regression is similar – the reason why we recall here the alternative formulations.

Unlike the development of unpenalized regression fitting (the technology minimizing the loss function *L*(*f*) alone) the initial work in the penalized domain did not see that much variety and rather adhered to the ideal of “wonderful simplicity” in the sense of Gauss: all formulations were quadratic. Both Whittaker and Tikhonov, as well as Phillips and Schoenberg, took *L*(*f*) to be the sum of squared errors. For instance, the formulation of Whittaker determines the graduated values

3.4

It may be of some little conceptual advantage to take the “average error” instead, the sum of squares divided by the number of observations: the prototypic objective function of Tikhonov regularization (3.1) is then

3.5

Here

is the result of anThe typical form of the penalty used in (3.5) developed into

, with
3.6

whose solution, as will be recalled below, is the key to the shape of the solutions of regularization formulations considered above.

The important limit behavior we speak of is, however, that on the opposite side: if

, then the fit in the limit satisfies . Consequently, ifThe prevailing choice in the literature, forcing the limit fits to be linear, happens to be

, as in . A notable exception is Silverman (1982), who in the context of density estimation proposed to penalize theFigure 3.1 Maximum recorded running speed as a function of the logarithm of body weight, fitted for various quantile indices τ $ \tau $ as cubic smoothing splines with the appropriate sum-of-check-functions losses. Tuning parameters were selected to match fidelities with those in Figure 3.3.

The setting of graduation enjoyed relative simplicity due to the sequential, equidistant character of observations. On the other hand, in general nonparametric regression we may face complications related to possibly different distances between covariates, and their possible repetition. While differences are still possible (divided by the magnitudes of those distances), a revolutionary leap forward was accomplished by Schoenberg (1964), who recognized that the Whittaker graduation can be formulated in a very general functional setting, and pointed out its connection to splines, the effective interpolation vehicle in numerical analysis.

The new setting means that differences are replaced by derivatives: *Df* is now a differential operator applied to a function *f*, and the norm *J* typically involves some integral over the domain of the definition of *f*. This brings some technical complications, which we will not detail here: the integrals need to converge, and in general may depend on the domain of integration. The classical theory requires the penalty to have *the
*

3.7

It this particular fortuitous case, when *x* is a real variable, the choice of the domain of integration is not essential: any

Once the penalty has the

character, the theory retains its full strength when and the minimization involves a more general loss function
3.8

The search for the minimizer of (3.8), an infinite-dimensional problem, can be then reduced to a finite optimization in a finite-dimensional space: the so-called *representer theorem* of Kimeldorf and Wahba (1970) – we refer to the most general version given by Schölkopf and Smola (2002) – asserts that the solution is a linear combination of a finite number of basis functions. Those functions generally depend on the problem (in particular, on the covariates) and are related to an important notion of positive definite functions of two variables, the so-called *reproducing (Mercer) kernel*. As it is somewhat more straightforward to obtain the penalty from the reproducing kernel than conversely, a considerable part of the literature concerning regularization in the

We stress here that the thrust of the representer theorem lies in its explicit characterization, not merely in finite dimensionality. In fact, pretty much every solution to a problem of the form (3.8), with an arbitrary penalty, is characterized by the fitted values

, the values of the optimal functionAll this was certainly not an issue in the original finite-dimensional formulations of Whittaker and others; thus, many feel that it should have remained at that. The advantages of the functional setting may not be that compelling when the only task is to process the data set at hand, but we believe that they become visible when more elaborate sampling strategies (e.g. adding new observations) are involved. A more thorough investigation of such situations may still remain to be done.

Figure 3.2 The same data as in Figure 3.1, now fitted by quintic smoothing splines with the sum-of-check-functions losses. Tuning parameters are again selected to yield the same fidelities in all examples.

Quantile regression loss functions formed as sums of pointwise evaluations of check functions satisfy the conditions required by the representer theorem, which then applies to the formulations where they replace the sum-of-squares loss accompanied by a penalty of the

type,
3.9

Here

is an interval covering all covariate points and denotes theQuantile regression proposals in this direction, for the most part with

, were put forward by Bloomfield and Steiger (1983) and others, including Nychka et al. (1995), Bosch et al. (1995), and Oh et al. (2012); the latter paid some attention to the development of dedicated computational algorithms in the iterated reweighted least-squares fashion, to eschew the full deployment of quadratic programming, a task which was at the time not that routine as it is nowadays, due to the rapid evolution of convex optimization software as documented by Koenker and Mizera (2014). More complex and general “kernelized” versions, with potentially a multidimensional scope, were proposed by Takeuchi et al. (2006) and Li et al. (2007).Figure 3.1 shows the fit of various quantiles (the indices

indicated in the legend) of the maximum recorded running speed of mammals as a function of the logarithm of their body weight, as cubic ( ) smoothing splines; by way of homage to Whittaker, we show also the analogous fits for in Figure 3.2. The fact that we take the logarithm of the body weight in these examples, but not that of the running speed, is not meant as a dispute with Example 7.2 of Koenker (2005), featured also in Koenker et al. (1994), where the logarithm of speed is taken as well; our intention was rather to achieve some variety. The tuning parameters for the penalized fits were initially selected by eye, roughly the same for all ; later they were slightly modified to match the fidelities in Figure 3.3, where the properties of quantile smoothing splines do not allow for that much flexibility in setting arbitrary fidelity values as in the other cases.Figure 3.3 The maximum speed of mammals as a function of the logarithm of weight, now fitted as quantile smoothing splines, linear total-variation splines with the sum-of-check-functions loss. Tuning parameters selected to yield fidelities that can be matched in the other examples.

The difficulties anticipated in their time by Koenker et al. (1994) in the practical use of quadratic programming for solving (3.9) with the penalty (3.7) motivated a step sideways: holding no prejudices against “kinky” statistics of piecewise linear loss functions, and at the same time aware of the potential of linear programming, Koenker et al. proposed to replace the square in the definition of the penalty (3.7) by the absolute value, thus instantiating formulation (3.9) as

3.10

Here

is again an interval covering all covariate points. By the Vitali theorem, the penalty is equal to the
3.11

The original infinite-dimensional formulation can be again reduced to a finite-dimensional one, admitting a representer theorem in the spirit discussed above. Koenker et al. (1994) showed that optimal solutions of (3.9) can be found among piecewise linear *f*, with the breaks occurring only at covariate points

As often happens with important ideas, the use of the

norm in penalization has a complex history. In the context of multiple regression (a response vector
3.12

as a replacement for the

penalty of ridge regression of Hoerl and Kennard (1970),
3.13

can be traced back to the thesis of Logan (1965) – see also Donoho and Logan (1992) – and the work of geophysicists Claerbout and Muir (1973), Taylor et al. (1979), and Santosa and Symes (1986). Chen and Donoho (1995) and Chen et al. (1998) pointed out the connection to the total-variation denoising of Rudin et al. (1992), which penalizes the total variation of *f* rather than that of

Quantile smoothing splines enjoy many favorable properties. They are conceptually transparent and enjoy a straightforward linear programming implementation (in any of Tikhonov, Phillips, or Ivanov formulations). They offer relatively easy mitigation of the potential complication of crossing, the anomaly that sometimes happens (especially on the boundary of or outside the domain of covariates, for instance in our Figure 3.3 for

and ) when the quantile fit for lies for certain covariate values below the fit for . The straightforward strategy of rearrangement (Chernozhukov et al., 2010), amounting basically to the exchange of fits in parts where they violate the desired monotonicity, does not leave any aesthetically unpleasing kinks as for smooth curves, owing to the fact that both maximum and minimum of any piecewise linear functions is again piecewise linear.Quantile smoothing splines can be also easily adapted to potential *shape constraints* imposed on fits, like monotonicity or convexity. These shape constraints can actually be viewed themselves as a special case of penalized fits, with penalties assuming values 0 if the fits uphold them, or

The only somewhat unfortunate property may be their name: the adjective “smoothing” may be considered misleading by those expecting in such a context smooth corners rather than kinks. However, this kind of thing may be a marketing problem for emerging brands, but not for well-established ones: everybody now knows that the three musketeers were actually four. Finally, those preferring round corners and wavy curves may opt not only for the

penalties mentioned above, but also for penalties with higher-order derivatives, exposed in the next section – or just smooth the corners, if their principal objective is the same as the desideratum of Whittaker and Robinson (1924): to achieve “not that much smoothness but the most probable deaths.”Figure 3.4 The same data as in all examples, now fitted by quadratic total-variation smoothing splines combined with the sum-of-check-functions loss. Tuning parameters selected again to match fidelities with those in all examples.

Total variation in spline smoothing was taken further in the important paper of Mammen and van de Geer (1997). They, having no special interest in quantile regression, transferred the setting to the realm of squared-error loss, but carried the idea further by introducing the whole class of estimates obtained through solving

3.14

Recall that by the Vitali theorem, the penalty can be equivalently written as the integral, over

, of , as the required absolute continuity of is in fact necessary for the feasibility of (3.14). Combining this with the return to quantile loss functions, we obtain
3.15

The latter is a generalization of (3.11) involving possibly other *k* than 2.

Given that due to the connection to the interpolation problem (3.6) the representer theorem depends more essentially on the penalty once the loss amounts to a finite number of pointwise evaluations, it was not surprising that Mammen and van de Geer (1997) reaffirmed the representer theorem of Koenker et al. for

, and added that it holds true also for , again in the sense that the knots of the minimizing piecewise constant “spline”This may constitute, for

, a possible difficulty for computation, particularly for squared-error loss. However, for quantile regression, the use of a relatively fine grid of “potential knots” (the original proposal of Mammen and van de Geer) renders the problem computationally quite feasible with modern methods for linear programming. In the least-squares case, Tibshirani (2014) proposed to act as if the knots were in the covariate points, using the opportunity to rename the whole technology – which differs from that of Mammen and van de Geer (1997) only for and higher, with being probably the most likely to be ever used – after the “graduation proposal” (this time in the context of time-series sequences) of Kim et al. (2009), blending the least-squares loss of Whittaker, as given by Whittaker and Robinson (1924), with the absolute-value difference penalty of Schuette (1978).The most important achievement of Mammen and van de Geer (1997) came on the theoretic front: they showed that in the smoothness class of the functions whose total variation of the

th derivative is (uniformly) bounded by a constant, the total-variation splines, with appropriately selected tuning parameter, achieve the minimax rate of convergence – the rate that Donoho and Johnstone (1998) showed cannot be achieved by the estimators that are linear in the observations. These are, in particular, the estimators obtained via the penalties, but, ironically, only when coupled with the least-squares loss functions; those coupled with the quantile regression sum-of-check-functions functions are technically not linear.Translated into the real world, the smoothness class condition essentially means the number of derivatives we believe the functions underlying the data possess. If we cannot rule out jumps, we have to opt for

; if we believe that the functions we seek are continuous, but kinks are allowed, we choose ; finally, if we believe that the fitted functions should be continuous and differentiable, then should do. The latter, with the penalty that may be viewed as the absolute-value transmutation of that of Whittaker, may be, in fact, the best choice for practical smoothing (in the original sense of the word), as suggested by the example in Figure 3.4. The larger values ofWhittaker and Robinson (1924) identify several advantages of their method of graduation, among them its superiority to the precursor of local polynomial fitting (also discussed in their book) in handling boundary data. Interestingly, they also include among its positive virtues the dependence on the tuning parameter

, because “a satisfactory method of graduation ought to possess such elasticity, because the degree to which we are justified in sacrificing fidelity in order to obtain smoothness varies greatly from one problem to another;” to determine , they advise to “try two or three different values and see which gives the most satisfactory result” (an approach which was pretty much used in the examples we present here).In the meantime, the outlook may have changed: the onus is now on the availability of “automatic” methods for the selection of tuning parameters. Theoretically oriented researchers may be inclined to feel that more work remains to be done in this respect, while practically oriented ones usually report no problems when already well-known tuning procedures are used.

The latter include various forms of cross-validation, as suggested by Hastie et al. (2008), who favor splitting the data set into relatively few groups. This *k-fold cross-validation* may suffer from some volatility caused by (typically) random allocation of observations to *k* groups, but is usually computationally feasible (especially for

The classical, leave-one-out, *n*-fold (*n* being the size of the data set) cross-validation used for

Figure 3.5 Once again the same data, this time fitted by cubic total-variation smoothing splines combined with the sum-of-check-functions loss, with the same strategy for the selection of tuning parameters as in all other examples.

The more preferred approach for the

penalties seems to be via the notion of theor (apparently the more favored choice) of the Schwarz information criterion,

The

background inherent in prescriptions involving minimizing total-variation penalties yields in the one-dimensional case results consistent with the usual heuristics of sparse regression: most of the fitted parameters come out as zeros. In one-dimensional spline fitting involving total variation, “zero” means “no knot;” penalizing splines with total variation is therefore more akin to the knot deletion strategies of Smith (1982), championed by Kooperberg and Stone (1991), among others, in the context of log-spline density estimation, than to the classical smoothing splines with penalties. With smoothing splines starting with interpolatory splines and ending up with the linear fit, it is tempting to think that the fits in between pass through all possible intermediate complexity patterns; the truth is rather that the complexity of the fit remains the same, but the parameters are more shrunk toward zero.The success of

penalties in one-dimensional nonparametric regression led to the study of possible extensions in situations with multiple covariates. From the practical point of view, the most accessible path is to act one covariate at a time: either via the additive models briefly discussed in the next section, or coordinatewise, via tensor-product fits based on the univariate technologies. Thus, He et al. (1998) proposed “tensor-product quantile smoothing splines,” piecewise bilinear functions defined on a rectangular grid generated by the covariate points. The bivariate proposal for (when the penalty is the total variation of the function itself) also put forward in Mammen and van de Geer (1997) in the context of least-squares estimation was in a similar spirit.The driving force of these pursuits was the belief that the promise of a behavior similar to the one-dimensional case lies in the appropriate generalization of the mathematical notion of total variation to multidimensional functions and their gradients. An overview of some of this fascinating mathematics was given by Koenker and Mizera (2004), whose aim was also to go beyond the coordinatewise approach, to achieve coordinate-independent, rotationally invariant fits.

All multivariate (and, as we have seen above, some univariate) proposals pretty much had to forgo the hopes for a representer theorem. In such situations, the alternative was to act as if such characterizations existed: to restrict the fits to finite-dimensional domains that were deemed reasonable, and study penalized fitting on those domains. In fact, similar strategies, motivated mostly by real or imaginary algorithmic problems, were also proposed in the world of

penalties – but there they rather lacked the essential appeal due to the existence of principled reproducing-kernel methodology in that context.Koenker and Mizera (2004), too, sought the fits among *triograms*, piecewise linear functions on a triangulation (Lagrange–Courant triangles in the finite-element jargon); the Delaunay triangulation was preferred due to its optimal properties and algorithmic accessibility. For this particular case, they were able to isolate a fortuitous property which amounts to the following. A general notion of the total variation of the gradient is defined in *à la* Vitali through the formula

3.16

where the gradient and Hessian are interpreted in the sense of Schwartz distributions. To ensure that the penalty defined via (3.16) is invariant with respect to rotations, (i.e. orthogonally invariant), it is natural to require the matrix norm involved in (3.16) to be orthogonally invariant, invariant with respect to orthogonal similarity. In such a case, every total variation (3.16) with these properties reduces (up to a multiplicative constant dependent on the norm, but independent of the function and triangulation) to the same *triogram penalty* of Koenker and Mizera (2004): the sum, over all edges, of the length of the edge multiplied by the norm of the difference of gradients in the two pieces of linearity adjacent to the edge. As the latter difference is orthogonal to the edge, the finite-dimensional problem of finding *penalized triograms*,

3.17

restricted to triograms becomes ones again a linear programming task, with the penalty in (3.17) replaced by the triogram penalty.

An example of the matrix norm used in definition (3.16) could be simply the

Euclidean sum-of-squares, Frobenius–Hilbert–Schmidt norm; together with the quantile objective function, it would create a formulation
3.18

where subscripts denote partial derivatives. This can be seen as the

pendant for
3.19

However, despite the fact that the structure of the penalty in (3.18) resembles that of the group lasso of Yuan and Lin (2006), which in its typical applications achieves “group sparsity” (either all of the group is zero, or then no sparsity is pushed for within), this outcome seems not to be the case for the solutions of (3.18) or (3.17). A phase-transition phenomenon of Donoho et al. (2013) comes to mind here; also, the situation parallels that for the finite version of (3.19) yielded by the representer theorem, in which the matrix involved does not enjoy sparse banded structure as it did in the one-dimensional case. A possible explanation from physics (plate mechanics) of what is happening in

and penalizations defined by formulations like (3.18) and (3.19) was attempted by Balek et al. (2013); see also Koenker et al. (2002).Summing up, apart from coordinatewise approaches, if there is desire for rotationally independent penalized quantile regression in two (or more) dimensions, the user can either resort to

penalization or “kernelization” in the vein of the already mentioned proposals of Takeuchi et al. (2006) and Li et al. (2007), or can use penalized triograms relying on linear programming. The question is always whether full nonparametric regression in many covariates is really what is needed; some dimension reduction may rather be desirable. One of the options may be then additive fits, briefly reviewed in the final section of this chapter.We do not discuss penalizing by means of ridge (3.13) or lasso (3.12) penalties in linear quantile regression, with possibly many covariates, as this is covered elsewhere in this book. For the same reason, we do not give a comprehensive account of nonparametric regression methods in quantile regression, only of certain ones involving penalization.

As for what remains and was reviewed above, in the future we are bound, in the spirit of the apocryphal quote of Kolmogorov by Vapnik about mathematics operating inside the thin layer between the trivial and the intractable [see][for the exact wording]Vap06, to continue seeing incremental ramifications and modifications for the methods operating with one covariate, and, on the other hand, it is not at all clear how much progress will be encountered in the difficult realm of multiple covariates.

One area, however, that can still use a substantial advance even in the one-dimensional case, is confidence bands and, more generally, inference for the penalized regression formulations considered here. This is still a pressing issue even for methods using

penalties and least-squares loss; in the context of quantile regression, and penalties in particular, the recent explosion in the post-model selection inference literature, which we do not attempt to survey here, may perhaps eventually create a body of research work which the area discussed here can benefit from, rather than be taken advantage of.Some inspiring work in this direction by Wahba (1983) and Nychka (1988) was carried further in the framework of additive models by Wood (2006); the latter in turn became the departure point for Koenker (2011). Recall that in classical additive models the functions of covariates are decomposed into additive contributions of separate variables

and the latter are fitted by nonparametric methods aimed at the functions of one covariate. Not all the

need be present in the resulting fit; also, some of the can be fitted parametrically.An extension of this is the ANOVA fits of Gu (2002), which also add functions of two variables (and then possibly three, and so on). This are the approach adopted by Koenker (2011): the fitted functions are expanded into a sum of univariate and bivariate functions,

and out of those present in the model, the univariate ones are fitted by quantile smoothing splines, and the bivariate ones by penalized triograms. Koenker (2011) then gives the methodology for calculating uniform confidence “bandaids” for the additive components of the fits.

Another possible direction of expansion could be new kinds of additive models arising in connection with the strategy adopted by Maciak and Mizera (2016) for fitting “splines with changepoints.” As indicated at the end of the section discussing total-variation splines, there is no “universal Swiss army knife,” no universal penalty for all levels of smoothness. As a consequence, if we, say, admit jumps in our fits, we may have to face the piecewise constant character of the rest of them, even if we believe that between the jumps our fits can be general smooth functions. In such a situation, a representer theorem is hardly feasible, as more thorough mathematical analysis typically shows that the functions that one penalty allows (making it finite) make another penalty infinite.

Therefore, the fit has to be sought as a linear combination of basis functions that include, say, both classical cubic spline basis functions together with the Heaviside step functions, with steps located at covariate points, and their first and second antiderivatives, to allow possible changes in level or derivative(s). There is a certain analogy to the proposal of Mumford and Shah (1989) in image processing here: the image is segmented into subregions with the help of a total-variation penalty, and in the subregions smoothing is performed via the classical smoothing

penalty.The extension of this approach suggested by Maciak and Mizera (2016) might take this further and allow for additive fits *in the same variable*:

The proliferation of modal verbs in the last paragraph indicates that we have come to the last word in the title of this last section, and thus to the very end of this chapter as well.

Balek, V.
, and
I.
Mizera
. 2013. “Mechanical models in nonparametric regression.” In From Probability to Statistics and Back: High-Dimensional Models and Processes, A Festschrift in Honor of Jon A, edited by
M.
Banerjee
,
F.
Bunea
,
J.
Huang
,
V.
Koltchinskii
and
M. H.
Maathuis
, 5–19., Wellner, volume 9 of IMS Collections
Beachwood, OH: Institute of Mathematical Statistics.

Bloomfield, P.
, and
W. S.
Steiger
. 1983. Least Absolute Deviations. Theory: Applications and Algorithms. Birkhäuser, Boston.

Bosch, R. J.
,
Y.
Ye
, and
G. G.
Woodworth
. 1995. “A convergent algorithm for quantile regression with smoothing splines.” Computational Statistics and Data Analysis
19: 613–630.

Breiman, L.
1995. “Better subset selection using nonnegative garrote.” Technometrics
37: 373–384.

Chen, S. S.
, and
D. L.
Donoho
. 1995. “Examples of basis pursuit. In SPIE’s 1995 International Symposium on Optical Science, Engineering, and Instrumentation, pages 564--574. International Society for Optics and Photonics.

Chen, S. S.
,
D. L.
Donoho
, and
M. A.
Saunders
. 1998. “Atomic decomposition by basis pursuit.” SIAM Journal on Computing
20: 33–61.

Chernozhukov, V.
,
I.
Fernández-Val
, and
A.
Galichon
. 2010. “Quantile and probability curves without crossing.” Econometrica
78 (3): 1093–1125.

Claerbout, J. F.
, and
F.
Muir
. 1973. “Robust modeling with erratic data.” Geophysics
38: 826–844.

Cressie, N.
1989. “Geostatistics.” American Statistician
43: 197–202.

Cressie, N.
1990. “Reply.” American Statistician
44: 256–258.

Donoho, D. L.
, and
I.
Johnstone
. 1998. “Minimax estimation via wavelet shrinkage.” Annals of Statistics
26 (8): 879–921.

Donoho, D. L.
, and
B. F.
Logan
. 1992. “Signal recovery and the large sieve.” SIAM Journal on Applied Mathematics
52 (2): 577–591.

Donoho, D. L.
,
I.
Johnstone
, and
A.
Montanari
. 2013. “Accurate prediction of phase transitions in compressed sensing via a connection to minimax denoising.” IEEE Tractions on Information Theory
59 (6): 3396–3433.

Efron, B.
,
T.
Hastie
,
I.
Johnstone
, and
R.
Tibshirani
. 2004. “Least angle regression.” Annals of Statistics
32 (2): 407–499.

Good, I. J.
1971. “A nonparametric roughness penalty for probability densities.” Nature
229: 29–30.

Good, I. J.
, and
R. A.
Gaskins
. 1971. “Nonparametric roughness penalties for probability densities.” Biometrika
58: 255–277.

Green, P. J.
, and
B. W.
Silverman
. 1994. Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. London: Chapman & Hall.

Gu, C.
1998. “Model indexing and smoothing parameter selection in nonparametric function estimation (with discussion).” Statistica Sinica
8: 607–646.

Gu, C.
2002. Smoothing spline ANOVA models. New York: Springer-Verlag.

Hadamard, J.
1902. Sur les problèmes aux dérivées partielles et leur signification physique, 49–52. Bulletin: Princeton University.

Hastie, T.
,
R.
Tibshirani
, and
J.
Friedman
. 2008. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer.

He, X.
,
P.
Ng
, and
S.
Portnoy
. 1998. “Bivariate quantile smoothing splines.” Journal of the Royal Statistical Society B
60: 537–550.

Henderson, R.
1924. “A new method of graduation.” Tractions of the Actuarial Society of America
25: 29–40.

Henderson, R.
1925. “Further remarks on graduation.” Tractions of the Actuarial Society of America
26: 52–57.

Henderson, R.
1928. Some points in the general theory of graduation. In Proceedings of the International Mathematical Congress held in Toronto, August 11-16, 1924, Vol II., pages 815-820. University of Toronto Press, Toronto.

Henderson, R.
1938. Mathematical Theory of Graduation. New York: Actuarial Society of America.

Hoerl, A. E.
, and
R. W.
Kennard
. 1970. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics
12: 55–67.

Kent, J. T.
, and
K. V.
Mardia
. 1994. “The link between kriging and thin-plate splines.” Probability, 326–339., Statistics and Optimization
New York: Wiley.

Kim, S.-J.
,
K.
Koh
,
S.
Boyd
, and
D.
Gorinevsky
. 2009. “ℓ1 trend filtering.” SIAM Review
51 (2): 339–360.

Kimeldorf, G. S.
, and
G.
Wahba
. 1970. “A correspondence between Bayesian estimation on stochastic processes and smoothing by splines.” Annals of Mathematical Statistics
41: 495–502.

Koenker, R.
2005. Quantile Regression. Cambridge: Cambridge University Press.

Koenker, R.
2011. “Additive models for quantile regression: Model selection and confidence bandaids.” Brazilian Journal of Probability and Statistics
25 (3): 239–262.

Koenker, R.
, and
I.
Mizera
. 2002. “Elastic and plastic splines: Some experimental comparisons.” In Statistical Data Analysis Based on the L1-norm and Related Methods, edited by
Y.
Dodge
, 405–414. Basel: Birkhäuser.

Koenker, R.
, and
I.
Mizera
. 2004. “Penalized triograms: total variation regularization for bivariate smoothing.” Journal of the Royal Statistical Society B
66: 1681–1736.

Koenker, R.
, and
I.
Mizera
. 2014. “Convex optimization in R.” Journal of Statistical Software
60 (5): 1–23.

Koenker, R.
,
P.
Ng
, and
S.
Portnoy
. 1994. “Quantile smoothing splines.” Biometrika
81: 673–680.

Kooperberg, C.
, and
C. J.
Stone
. 1991. “A study of logspline density estimation.” Computational Statistics and Data Analysis
12: 327–347.

Li, Y.
,
Y.
Liu
, and
J.
Zhu
. 2007. “Quantile regression in reproducing kernel Hilbert spaces.” Journal of the American Statistical Association
102: 255–268.

Logan, B. F.
1965. Properties of high-pass signals. PhD thesis, Columbia University, New York.

Maciak, M.
, and
Mizera, I.
2016. Splines with changepoints: Additive models for functional data, Preprint..

Mammen, E.
, and
S.
van de Geer
. 1997. “Locally adaptive regression splines.” Annals of Statistics
25: 387–413.

Mumford, D.
, and
J.
Shah
. 1989. “Optimal approximation of piecewise smooth functions and associated variational problems.” Communications on Pure and Applied Mathematics
42 (5): 577–685.

Nychka, D.
1988. “Bayesian confidence intervals for smoothing splines.” Journal of the American Statistical Association
83 (404): 1134–1143.

Nychka, D.
,
G.
Gray
,
P.
Haaland
,
D.
Martin
, and
M.
O’Connell
. 1995. “A nonparametric regression approach to syringe grading for quality improvement.” Journal of the American Statistical Association
90 (432): 1171–1178.

Oh, H.-S.
,
T. C. M.
Lee
, and
D. W.
Nychka
. 2012. “Fast nonparametric quantile regression with arbitrary smoothing methods.” Journal of Computational and Graphical Statistics
20: 510–526.

Phillips, D. L.
1962. “A technique for the numerical solution of certain integral equations of the first kind.” Journal of the ACM
9: 84–97.

Ramsay, J. O.
, and
B. W.
Silverman
. 2005. Functional data analysis. 2nd ed. New York: Springer.

Reinsch, C. H.
1967. “Smoothing by spline functions.” Numerische Mathematik
10: 177–183.

Reinsch, C. H.
1971. “Smoothing by spline functions II.” Numerische Mathematik
16: 451–454.

Rudin, L. I.
,
S.
Osher
, and
E.
Fatemi
. 1992. “Nonlinear total variation based noise removal algorithms.” Physica D
60: 259–268.

Santosa, F.
, and
W.
Symes
. 1986. “Linear inversion of band-limited reflection seismograms.” SIAM Journal on Computing
7 (4): 1307–1330.

Schoenberg, I. J.
1964. “Spline functions and the problem of graduation.” Proceedings of the National Academy of Sciences of the USA
52: 947–950.

Schölkopf, B.
, and
A. J.
Smola
. 2002. Learning with Kernels. Cambridge, MA: MIT Press.

Schuette, D. R.
1978. “A linear programming approach to graduation.” Transactions of the Society of Actuaries
30: 407–445.

Silverman, B. W.
1982. “On the estimation of a probability density function by the maximum penalized likelihood method.” Annals of Statistics
10: 795–810.

Smith, P. L.
1982. Curve Fitting and Modeling with Splines Using Statistical Variable Selection Techniques. NASA, Langley Research Center, Hampla, VA, NASA Report 166034..

Stigler, S. M.
1986. The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Harvard University Press.

Takeuchi, I.
,
Q. V.
Le
,
T. D.
Sears
, and
A. J.
Smola
. 2006. “Nonparametric quantile estimation.” Journal of Machine Learning Research
7: 1231–1264.

Taylor, H.
,
S.
Banks
, and
J.
McCoy
. 1979. “Deconvolution with the ℓ1-norm.” Geophysics
44 (1): 39–52.

Tibshirani, R.
1996. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society B
58: 267–288.

Tibshirani, R. J.
2014. “Adaptive piecewise polynomial estimation via trend filtering.” Annals of Statistics
42 (1): 285–323.

Tibshirani, R. J.
2015. “A general framework for fast stagewise algorithms.” Journal of Machine Learning Research
16: 2543–2588.

Tikhonov, A. N.
1943. “On the stability of inverse problems.” Doklady Akademii Nauk SSSR
39: 195–198.

Tikhonov, A. N.
1963. “On the solution of incorrectly posed problems and the regularization method.” Doklady Akademii Nauk SSSR
151: 501–504.

Vapnik, V.
2006. Empirical inference science: Afterword of 2006. 2nd ed. New York: In Estimating of Dependences Based on Empirical Data. Springer.

Vasin, V. V.
1970. “Relationship of several variational methods for the approximate solution of ill-posed problems (in Russian).” Mathematical Notes
7: 161–165.

Wahba, G.
1983. “Bayesian “confidence intervals" for the cross-validated smoothing spline.” Journal of the Royal Statistical Society B
45 (1): 133–150.

Wahba, G.
1990a. Comment on Cressie. American Statistician, 44: 255--256.

Wahba, G.
1990b. Spline Models for Observational Data. SIAM, Philadelphia.

Watson, G. S.
1984. “Smoothing and interpolation by kriging and with splines.” Journal of the International Association for Mathematical Geology
16: 601–615.

Whittaker, E. T.
1922--1923. On a new method of graduation. Proceedings of the Edinburgh Mathematical Society, 41: 63--75.

Whittaker, E. T.
1924. “On the theory of graduation.” Proceedings of the Royal Society, Edinburgh
44: 77–83.

Whittaker, E. T.
, and
G.
Robinson
. 1924. The Calculus of Observations. London: Blackie.

Wood, S. N.
2006. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman and Hall/CRC.

Yuan, M.
2006. “GACV for quantile smoothing splines.” Computational Statistics and Data Analysis
50 (3): 813–829.

Yuan, M.
, and
Y.
Lin
. 2006. “Model selection and estimation in regression with grouped variables.” Journal of the Royal Statistical Society B
68: 49–67.

Zou, H.
, and
M.
Yuan
. 2008. “Regularized simultaneous model selection in multiple quantiles regression.” Computational Statistics and Data Analysis
52: 5296–5304.