Search

www.springer.com The European Mathematical Society

  • StatProb Collection
  • Recent changes
  • Current events
  • Random page
  • Project talk
  • Request account
  • What links here
  • Related changes
  • Special pages
  • Printable version
  • Permanent link
  • Page information
  • View source

Linear hypothesis

A statistical hypothesis according to which the mean $ a $ of an $ n $- dimensional normal law $ N _ {n} ( a , \sigma ^ {2} I ) $( where $ I $ is the unit matrix), lying in a linear subspace $ \Pi ^ {s} \subset \mathbf R ^ {n} $ of dimension $ s < n $, belongs to a linear subspace $ \Pi ^ {r} \subset \Pi ^ {s} $ of dimension $ r < s $.

Many problems of mathematical statistics can be reduced to the problem of testing a linear hypothesis, which is often stated in the following so-called canonical form. Let $ X = ( X _ {1} \dots X _ {n} ) $ be a normally distributed vector with independent components and let $ {\mathsf E} X _ {i} = a _ {i} $ for $ i = 1 \dots s $, $ {\mathsf E} X _ {i} = 0 $ for $ i = s + 1 \dots n $ and $ {\mathsf D} X _ {i} = \sigma ^ {2} $ for $ i = 1 \dots n $, where the quantities $ a _ {1} \dots a _ {s} $ are unknown. Then the hypothesis $ H _ {0} $, according to which

$$ a _ {1} = \dots = a _ {r} = 0 ,\ \ r < s < n , $$

is the canonical linear hypothesis.

Example. Let $ Y _ {1} \dots Y _ {n} $ and $ Z _ {1} \dots Z _ {m} $ be $ n + m $ independent random variables, subject to normal distributions $ N _ {1} ( a , \sigma ^ {2} ) $ and $ N _ {1} ( b , \sigma ^ {2} ) $, respectively, where the parameters $ a $, $ b $, $ \sigma ^ {2} $ are unknown. Then the hypothesis $ H _ {0} $: $ a = b = 0 $ is the linear hypothesis, while a hypothesis $ a = a _ {0} $, $ b = b _ {0} $ with $ a _ {0} \neq b _ {0} $ is not linear.

[1] E.L. Lehmann, "Testing statistical hypotheses" , Wiley (1986)

However, such a linear hypothesis $ a = a _ {0} $, $ b = b _ {0} $ with $ a _ {0} \neq b _ {0} $ does correspond to a linear hypothesis concerning the means of the transformed quantities $ Y _ {i} ^ \prime = Y _ {i} - a _ {0} $, $ Z _ {i} ^ \prime = Z _ {i} - b _ {0} $.

  • This page was last edited on 5 June 2020, at 22:17.
  • Privacy policy
  • About Encyclopedia of Mathematics
  • Disclaimers
  • Impressum-Legal

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

5.2 - writing hypotheses.

The first step in conducting a hypothesis test is to write the hypothesis statements that are going to be tested. For each test you will have a null hypothesis (\(H_0\)) and an alternative hypothesis (\(H_a\)).

When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the direction of the test (non-directional, right-tailed or left-tailed), and (3) the value of the hypothesized parameter.

  • At this point we can write hypotheses for a single mean (\(\mu\)), paired means(\(\mu_d\)), a single proportion (\(p\)), the difference between two independent means (\(\mu_1-\mu_2\)), the difference between two proportions (\(p_1-p_2\)), a simple linear regression slope (\(\beta\)), and a correlation (\(\rho\)). 
  • The research question will give us the information necessary to determine if the test is two-tailed (e.g., "different from," "not equal to"), right-tailed (e.g., "greater than," "more than"), or left-tailed (e.g., "less than," "fewer than").
  • The research question will also give us the hypothesized parameter value. This is the number that goes in the hypothesis statements (i.e., \(\mu_0\) and \(p_0\)). For the difference between two groups, regression, and correlation, this value is typically 0.

Hypotheses are always written in terms of population parameters (e.g., \(p\) and \(\mu\)).  The tables below display all of the possible hypotheses for the parameters that we have learned thus far. Note that the null hypothesis always includes the equality (i.e., =).

One Group Mean
Research Question Is the population mean different from \( \mu_{0} \)? Is the population mean greater than \(\mu_{0}\)? Is the population mean less than \(\mu_{0}\)?
Null Hypothesis, \(H_{0}\) \(\mu=\mu_{0} \) \(\mu=\mu_{0} \) \(\mu=\mu_{0} \)
Alternative Hypothesis, \(H_{a}\) \(\mu\neq \mu_{0} \) \(\mu> \mu_{0} \) \(\mu<\mu_{0} \)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
Paired Means
Research Question Is there a difference in the population? Is there a mean increase in the population? Is there a mean decrease in the population?
Null Hypothesis, \(H_{0}\) \(\mu_d=0 \) \(\mu_d =0 \) \(\mu_d=0 \)
Alternative Hypothesis, \(H_{a}\) \(\mu_d \neq 0 \) \(\mu_d> 0 \) \(\mu_d<0 \)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
One Group Proportion
Research Question Is the population proportion different from \(p_0\)? Is the population proportion greater than \(p_0\)? Is the population proportion less than \(p_0\)?
Null Hypothesis, \(H_{0}\) \(p=p_0\) \(p= p_0\) \(p= p_0\)
Alternative Hypothesis, \(H_{a}\) \(p\neq p_0\) \(p> p_0\) \(p< p_0\)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
Difference between Two Independent Means
Research Question Are the population means different? Is the population mean in group 1 greater than the population mean in group 2? Is the population mean in group 1 less than the population mean in groups 2?
Null Hypothesis, \(H_{0}\) \(\mu_1=\mu_2\) \(\mu_1 = \mu_2 \) \(\mu_1 = \mu_2 \)
Alternative Hypothesis, \(H_{a}\) \(\mu_1 \ne \mu_2 \) \(\mu_1 \gt \mu_2 \) \(\mu_1 \lt \mu_2\)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
Difference between Two Proportions
Research Question Are the population proportions different? Is the population proportion in group 1 greater than the population proportion in groups 2? Is the population proportion in group 1 less than the population proportion in group 2?
Null Hypothesis, \(H_{0}\) \(p_1 = p_2 \) \(p_1 = p_2 \) \(p_1 = p_2 \)
Alternative Hypothesis, \(H_{a}\) \(p_1 \ne p_2\) \(p_1 \gt p_2 \) \(p_1 \lt p_2\)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
Simple Linear Regression: Slope
Research Question Is the slope in the population different from 0? Is the slope in the population positive? Is the slope in the population negative?
Null Hypothesis, \(H_{0}\) \(\beta =0\) \(\beta= 0\) \(\beta = 0\)
Alternative Hypothesis, \(H_{a}\) \(\beta\neq 0\) \(\beta> 0\) \(\beta< 0\)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional
Correlation (Pearson's )
Research Question Is the correlation in the population different from 0? Is the correlation in the population positive? Is the correlation in the population negative?
Null Hypothesis, \(H_{0}\) \(\rho=0\) \(\rho= 0\) \(\rho = 0\)
Alternative Hypothesis, \(H_{a}\) \(\rho \neq 0\) \(\rho > 0\) \(\rho< 0\)
Type of Hypothesis Test Two-tailed, non-directional Right-tailed, directional Left-tailed, directional

Definition : Linear Hypothesis

  • 1 Definition
  • 2.1 Arbitrary Example
  • 4 Linguistic Note

A linear hypothesis is a hypothesis which concerns linear functions of parameters in the context of regression analysis and analysis of variance .

Arbitrary Example

An example of a linear hypothesis :

Thus $H_0$ is a linear hypothesis about $\tau_1 - \tau_2$.

  • Results about linear hypotheses can be found here .

Linguistic Note

The word hypothesis is pronounced hy- po -the-sis , the stress going on the second syllable.

Its plural is hypotheses , which is pronounced hy- po -the-seez .

The word hypothesis comes from the Greek for supposition , literally to put under , that is sub-position .

The idea is that one puts an idea under scrutiny .

The verb hypothesize (British English: hypothesise ) means to make a hypothesis , that is, to suppose .

The adjective hypothetical means having the nature of a hypothesis .

A hypothetical question is a question which relates to a situation that is supposed (or pretended) to be imaginary. One would, for example, announce that a question about to be posed is hypothetical if the questioner wishes to be believed to be at some distance from the possibility of actually being the subject of the question.

  • 1998:  David Nelson : The Penguin Dictionary of Mathematics  (2nd ed.)  ... (previous)  ... (next) : linear hypothesis
  • 2008:  David Nelson : The Penguin Dictionary of Mathematics  (4th ed.)  ... (previous)  ... (next) : linear hypothesis
  • Definitions/Linear Hypotheses
  • Definitions/Hypothesis Testing
  • Definitions/Linearity

Navigation menu

The Linear Hypothesis

Cite this chapter.

linear hypothesis means

  • George A. F. Seber 8  

Part of the book series: Springer Series in Statistics ((SSS))

3420 Accesses

1 Citations

Ini this chapter we consider a number of linear hypotheses before giving a general definition. Our first example is found in regression analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

linear hypothesis means

An Important Equivalence Result

linear hypothesis means

Simple and Multiple Linear Regression

linear hypothesis means

Multivariate Regression: Additional Topics

Author information, authors and affiliations.

Department of Statistics, The University of Auckland, Auckland, New Zealand

George A. F. Seber

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Seber, G.A.F. (2015). The Linear Hypothesis. In: The Linear Model and Hypothesis. Springer Series in Statistics. Springer, Cham. https://doi.org/10.1007/978-3-319-21930-1_2

Download citation

DOI : https://doi.org/10.1007/978-3-319-21930-1_2

Publisher Name : Springer, Cham

Print ISBN : 978-3-319-21929-5

Online ISBN : 978-3-319-21930-1

eBook Packages : Mathematics and Statistics Mathematics and Statistics (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Linear Hypothesis

In subject area: Engineering

Chapters and Articles

You might find these chapters and articles relevant to this topic.

Modern Spacecraft GNC

Stefano Silvestrini , ... Andrea Colagrossi , in Modern Spacecraft Guidance, Navigation, and Control , 2023

Feature engineering and polynomial regression

The linear hypothesis in Eq. (15.1) can easily be extended to capture more complex, nonlinear problems through the addition of nonlinear features or feature combinations. For example, for a two-dimensional (2D) input x = ( x 1 , x 2 ) , we can define new features x 3 = x 1 or x 4 = x 1 x 2 and add these to our hypothesis. The manual process of creating new features is called feature engineering and ideally involves previous domain knowledge about the problem at hand. A common, systematic way of increasing model complexity are polynomial features , which use polynomial combinations of the features with degree less than or equal to the specified degree. For example, for a 2D input x = ( x 1 , x 2 ) , the two-degree polynomial features are 1 , x 1 , x 2 , x 1 x 2 , x 1 2 , x 2 2 , resulting in the hypothesis:

Machine learning in heat transfer

C. Balaji , ... Sateesh Gedupudi , in Heat Transfer Engineering , 2021

10.3.2 Linear regression

As mentioned above, neural networks can be seen as sophisticated regression techniques. Before seeing how neural networks work, we will describe the simplest possible regression technique—linear regression—and see how it can be seen as a learning algorithm under certain circumstances.

Linear regression is, in essence, a regression algorithm with a linear model as the hypothesis function. While we have included this model in the current chapter as a stepping stone for neural networks, it is also independently useful in heat transfer. For instance, there are many situations in conduction where we encounter a linear model. Recall that for one-dimensional, steady-state conduction across a slab with constant properties and no heat generation, the temperature profile is linear in the axial direction. Other linear relations also occur in a “hidden” form within heat transfer. For instance, correlations involving the Nusselt number are power laws which will look linear when logarithms are taken.

As mentioned above, for single variable inputs and outputs x,y the linear hypothesis model is

Let us say that for a single variable problem we collect m data points x i ,   y i for i = 1 , 2 … m , and have to find the “best” line that passes through them. Based on the discussion in the previous section, this line would have to minimize the mean squared error or cost function given by

At this point, our optimal parameters for the hypothesis are those that minimize the above sum. These optimal parameters can be calculated in one of two ways:

Direct optimization —This involves solving for the parameters directly by imposing the following conditions at the minimum point (10.7) ∂ J ∂ w 0 = 0 ∂ J ∂ w 1 = 0

Applying the condition (10.7) to the cost function given by (10.6) results in the following analytical expressions for optimal parameters (the proof is left as an exercise for the reader; see Problem 10.1): (10.8) w 1 = m ∑ x i y i − ∑ x i ∑ y i m ∑ x i 2 − ∑ x i 2 w 0 = ∑ y i − w 1 ∑ x i m

where all summations ∑ should be read as ∑ i =   1 m .

Learning or iterative optimization —The above method of direct optimization has two important disadvantages. Firstly, it works only for simple, linear, or near-linear hypothesis functions. Secondly, every time we add a new data point, we would have to recalculate w 0 ,   w 1 from scratch using Eq. (10.8) . There is no way for us to utilize our previously calculated parameter values and improve them. The attentive reader would recognize the above as precisely the learning problem!

The solution to this learning problem was given before in Step 4 of the learning process. This solution is to initialize the parameters and iteratively keep refining them as new data comes in. We can use gradient descent or any of the multiple optimization algorithms. Within this text, we will be assuming that some effective algorithm (direct or iterative) has been used to obtain the optimized parameters and will be skipping the details. The interested reader is advised to look at specialized machine learning texts such as Goodfellow et al. (2016) or optimization texts such as Balaji (2019) for further details on optimization algorithms.

Example 10.1:

A long mild steel slab of thickness 10 cm having a thermal conductivity of k = 44.5 W / m K can be considered to be insulated on the top and bottom sides. There is no heat generation in the slab, and thermophysical properties can be assumed to be constant. The steady-state temperature recorded by the thermocouples at five selected locations inside the slab is given below. Determine the steady-state heat flux in the slab using appropriate assumptions. Use the direct method for finding the optimal parameters. The slab is 1     m 2 in the area in the direction normal to the heat transfer ( Table 10.1 ) Fig. 10.3 ) .

Table 10.1 . Steady-state temperature distribution recorded by thermocouples for Example 10.1 .

S. no.x (in cm) from left endT (in °C)
1134.51
2334.16
3532.50
4732.24
5930.41

linear hypothesis means

Figure 10.3 . Schematic diagram for Example 10.1 .

At steady state, the flux in this problem is constant across the slab. In particular, q = − k d T d x   . The question, therefore, boils down to finding d T d x .

The chief difficulty here is that we have only discrete values of the temperature. Further, as we will see below, the measured temperatures do not fall exactly on a straight line as expected, theoretically, in case of the temperature in a slab. This is because of “experimental errors” including noise in the thermocouple measurements. Therefore, we resort to finding the “best” line that fits this data. This now falls within the framework of the learning process we listed above.

Let us now follow the template given in the learning process in order to solve this problem.

Step 1: Formulation —The input and output variables for this problem are obvious. We choose x as the input and y = T as the output variable.

Step 2: Hypothesis —Before imposing a hypothesis function, we plot the given data points to see if we can observe a trend. Fig. 10.4 shows the plot of how our “output,” the temperature, varies with the “input” x, the location. We can immediately observe that the temperature variation has a decreasing trend that is roughly linear. Of course, we also know this from our knowledge of physics of the problem. A good hypothesis function, therefore, would be the linear function

linear hypothesis means

Figure 10.4 . Location versus temperature in example 10.1 .

Step 3: Data collection —The data is already collected in this problem and given the size of the data (just 5 points), it does not make sense to split it further into training and testing sets.

Step 4: Optimal parameters —The hypothesis function is linear and hence the parameters can be computed using the expressions given in Eq. (10.8) . Calculating these (see Exercise 10.2) results in

Since y ˆ = w 0 + w 1 x , the proposed model is

A plot of the above best fit along with the original thermocouple data is shown in Fig. 10.5 . One can notice that even though the model fit does not pass through any of the data points, it still fits the data quite well.

linear hypothesis means

Figure 10.5 . Location versus temperature in example 10.1 .

Now, d T d x = w 1 = − 0.506 ∘ C / c m . Therefore, the heat flux is given by

Model representation

Linear regression is the simplest regression algorithm and a classic example for the concepts and notation involved in supervised learning. Given a training data set of m samples:

where for each input vector x ∈ R N , the corresponding output D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) is included and a family of linear hypotheses h w ( x ) : X → Y ,

where each particular hypothesis is defined by a parameter vector D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , the task of linear regression is it to find the hypothesis which best describes the dataset D .

Linear regression

Cost function for regression.

This corresponds to finding the parameter values w for which the error between the given ground truth (GT) values y and the model predictions y ˆ = h ( x ) is minimized. The error for a given training example ( x ( i ) , y ( i ) ) is given by the loss function . For linear regression, it is defined by the squared error:

The performance on the whole training set is calculated by the cost function , also called the error function or objective function , which for linear regression is the average error over the whole data set, defined by the mean squared error (MSE) J ( w ) = 1 2 m ∑ i = 1 m ( h w ( x ( i ) ) − y ( i ) ) 2 .

By using the squared error, it is guaranteed that the result is always positive regardless of the sign of the predicted and target values. While the choice of the cost function depends on the problem at hand, MSE loss works well for most (not only linear) regression problems.

The objective is to find the hypothesis, i.e., the model parameters w , which minimize J ( w ) :

Parameter learning: gradient descent

Although the optimization problem for linear regression with MSE can be solved analytically using the normal equation, here we will introduce and focus on gradient descent (GD), a more general numerical approach which forms the basis for parameter learning in many other ML models, including ANNs and DL problems. Nevertheless, there exist several alternatives for numerical optimization methods, such as:

Newton and Gauss–Newton methods.

Levenberg–Marquardt algorithm.

Conjugate gradient.

The detailed description of each method goes beyond the scope of this book, but it is important to remark that the methods have common basis and, above all, similar implementations. Indeed, the Levenberg–Marquardt curve-fitting algorithm can be regarded as a combination of GD method and Gauss–Newton. The behavior of Levenberg–Marquardt algorithm resembles the GD method when the weights are far from their optimal values and acts more like the Gauss–Newton method when the weights are close to their optimal value.

Going back to GD, to illustrate the idea behind it, we will use the simplified case of N = 1 and w 0 = 0 , for which the hypothesis reduces to h w ( x ) = w 1 x 1 and can be represented by a straight line through the origin, with w 1 controlling the slope of the line. In this case, the cost function is a quadratic polynomial in w 1 , shown in Fig. 15.6 .

linear hypothesis means

Figure 15.5 . Three processing iterations of K-means clustering. Black stars mark the cluster means, and their position update is illustrated by the black tracks.

linear hypothesis means

Figure 15.6 . Linear regression and gradient descent.

GD uses the gradient ∇ J ( w ) = ( ∂ J ∂ w 0 , ∂ J ∂ w 1 , … , ∂ J ∂ w N ) T of the cost function with respect to the model parameters in order to systematically update the parameter values and to gradually approach a minimum of the cost function. This is illustrated in Fig. 15.6 for the simplified hypothesis: starting from an initial value of parameter w 1 , GD determines an increment Δ w 1 for which the error decreases, i.e., J ( w 1 + Δ w 1 ) < J ( w 1 ) . This corresponds to moving down the slope of J ( w 1 ) . Since the direction of the steepest positive slope is given by the partial derivative ∂ J ∂ w 1 , the update is done in the opposite direction, w 1 ′ = w 1 − η ∂ ∂ w 1 J ( w ) , where η is called the learning rate which determines the step width and is a hyperparameter which has to be chosen manually. This update is repeated until reaching a minimum of the cost function. For convex functions like the MSE cost function used in linear regression, GD reaches the global minimum. For different, nonconvex cost functions with local minima, this is, however, not guaranteed ( Fig. 15.7 ).

linear hypothesis means

Figure 15.7 . Gradient descent, local and global minima.

In the general N -dimensional case, the update for w 0 ′ and w k ′ ( k ≥ 1 ) is given by:

which, using the gradient, can be written as w ′ = w − α ∇ J ( w ) .

Generalization: under- and overfitting, training, test, and validation set

During training, optimization is performed with the objective to reduce the error on the training set . It is, however, not guaranteed that a model which works well on the training data will perform equally well on new data. Instead, a test set of previously unseen data is used to assess the final performance and generalization of the trained model. In order to make valid statements on the generalization capabilities, train and test sets have to meet certain properties: they have to be independent from each other, meaning that they cannot share data points; and it should be safe to assume that both data sets are drawn from the same probability distribution.

By comparing the error on the training set with the test (also called generalization) error, it can be determined if the model generalizes well, or if it is under- or overfitting the data ( Fig. 15.8 ):

linear hypothesis means

Figure 15.8 . Relationship between model capacity and error.

Underfitting . Both the training and test errors are high, as the model does not account for relevant information present in the training set. The model is said to have high bias/low variance because the assumptions the model is based on are too rigid and prevent it from capturing the variance in the data.

Overfitting . The gap between test and training error is large. The model learns properties and patterns which are very specific to the training data, leading to a small training error. It, however, fails to generalize to the unseen test data, yielding a large test error. This corresponds to low bias/high variance .

The propensity to over- or underfit is captured by the model capacity , which can be loosely defined as a model's ability to approximate complex problems and fit a variety of functions. In the polynomial regression example above, the model capacity increases with the degree of the added polynomial features: the higher the degree, the more variation can be captured by the model. In general, in order for a model to perform well, its capacity has to match the complexity of the specific task: a low-capacity model is unable to solve a complex problem ( Fig. 15.8 , top left), while on the other hand, a high-capacity model may fit the specific data too closely ( Fig. 15.8 , top right). This property of the model is called the bias–variance trade-off ( Fig. 15.8 , bottom) [ 6 ].

In addition to the trainable parameters or weights, a model is further defined by its hyperparameters : parameters which are not inferred during model training but are concerned with the learning algorithm's behavior (e.g., the learning rate α for GD) or the model selection, like the type or structure of the model (e.g., the degree in polynomial regression). Just as the training set has to be separate from the test set, a distinct validation set has to be used to assess the performance of the model during the tuning of the hyperparameters in order to guarantee that the model generalizes well to unseen data.

Uniaxial vibration fatigue

Janko Slavič , ... Miha Boltežar , in Vibration Fatigue by Spectral Methods , 2021

4.2.1 Damage accumulation

The hypothesis of linear damage accumulation, devised by Palmgren [54] and Miner [55] , is an established tool for adding up the contributions of the damaging cycles for the stress-life models. Each cycle is ascribed to a damage portion D i , proportional to the number of cycles-to-failure N ( s a ) (assumed at constant cyclic loading), given the stress amplitude of the respective damaging cycle s a :

where D is the total damage, N ( s a ) is determined from the S-N curve, and n i is the total number of cycles at amplitude s a .

The total damage at failure is often defined as D = 1 . It depends largely on loading type, material properties, and other circumstances. There is compelling research by Eulitz and Kotte [56] showing that D can take a value within a very large interval spanning multiple orders of magnitude from 0.01 to 10, where D < 1 for 90% of cases.

The Palmgren–Miner hypothesis, underpinning the stress-life approach, is based on some assumptions that should be considered thoughtfully in the course of the vibration fatigue analysis. Load sequence and interaction events that have a measurable effect on time-to-failure [57] are overlooked by using the linear hypothesis . The effect of the rate of damage accumulation, dependent on the loading frequency components, is also neglected. Nevertheless, the hypothesis of linear damage accumulation remains a key part of the fatigue analysis procedure.

Supervised learning: regression and classification

Logistic regression for binary classification.

In classification problems, the target variable or class label y which is to be predicted has a discrete value used to distinguish between two (binary classification) or more (multiclass classification) classes. Classical examples for classification problems include sentiment analysis, anomaly detection (e.g., fraud detection), or image classification (e.g., distinguish between cats and dogs, classification of handwritten digits). Binary classification can also be used for fault detection, where given some sensor output x , the objective is to determine if a spacecraft is working correctly ( y = 0 ) or deviates from its nominal performance ( y = 1 ).

Despite the slightly misleading name, logistic regression is not used for regression problems, but is one of the most popular and basic ML algorithms for classification. The name merely originates from its similarities to linear regression and from the use of the logistic function. Instead of directly calculating the binary class (0,1) for each input, the idea behind logistic regression is to (1) predict the conditional probability P ( y | x , w ) of label y being 1 given the input features x and the model parameters w and (2) applying a threshold to the probability value in order to predict the discrete class label, i.e.,

In order to obtain probability values between 0 and 1, logistic regression uses the linear hypothesis seen for linear regression,

and applies the sigmoid function (also called logistic function, hence the name logistic regression). The sigmoid function ( Fig. 15.12 ) is defined as σ ( z ) = 1 1 + e − z and maps any real value into the range 0–1. The model hypothesis for logistic regression is thus given by:

By defining an additional constant feature x 0 = 1 , logistic regression can be presented in a vectorized formulation. Given the n-dimensional feature vector x = [ x 0 x 1 … x n ] T and the parameter or weight vector w = [ w 0 w 1 … w n ] T , the input to the logistic function, z = w 0 + w 1 x 1 + … + w n x n , can be vectorized to z = w T x , and the model equation for logistic regression can be written as:

In order to illustrate the idea behind logistic regression, note that σ ( z ) ≥ 0.5 for z ≥ 0 and σ ( z ) < 0.5 for z < 0 , and thus:

In a 2D ( N = 2 ) binary classification example, this means that any example with features x 1 and x 2 which satisfy the equation w 0 + w 1 x 1 + w 2 x 2 ≥ 0 will result in a hypothesis prediction y ˆ = 1 . The straight line defined by w 0 + w 1 x 1 + w 2 x 2 = 0 is called the decision boundary as it separates the two regions for which h w ( x ) predicts either class 0 or 1. In higher dimensional problems, the decision boundary is formed by a hyperplane.

More complex models with nonlinear decision boundaries can be generated by adding polynomial features analogous to the approach seen earlier for polynomial regression. For example, by adding quadratic terms to Eq. (15.5) ,

the decision boundary can now take on more complex shapes. For illustration, parameter values w 0 = − 1 , w 1 = w 2 = 0 , w 3 = w 4 = 1 result in a decision boundary defined by the circle equation x 1 2 + x 2 2 = 1 shown in Fig. 15.9 . More complex decision boundaries are possible adding higher degree polynomial features.

linear hypothesis means

Figure 15.9 . Classification with linear and nonlinear decision boundary.

The decision boundary defined above is no property of the data set but of the model hypothesis, and hence of parameters which have to be learned from the training data. As seen for linear regression, this is done by solving an optimization problem involving an adequate loss function.

Cost function for binary classification

Because the nonlinearity of the logistic regression hypothesis h w ( x ) renders the MSE cost function used for linear regression to be nonlinear and nonconvex, the binary cross-entropy loss function is used instead, given by:

In the case of true label y ( i ) = 1 , L goes to infinity for h w → 0 , strongly penalizing an incorrect prediction, whereas the loss disappears ( L = 0 ) for h w = 1 . Analogously, penalization is inverted for y ( i ) = 0 , as shown in Fig. 15.10 .

linear hypothesis means

Figure 15.10 . Binary cross-entropy loss.

As for linear regression, the complete cost function is defined as the sum over all training samples. For that purpose, function Eq. (15.7) can be simplified into a single equation:

By multiplying the two logarithmic terms by y ( i ) and ( 1 − y ( i ) ) , respectively, only the one corresponding to the specific true class label y ( i ) will add to the total cost. The binary cross-entropy loss is one of the standard loss functions used for classification problems.

In contrast to the MSE loss function for linear regression, there is no closed form to determine the optimal parameters using Eq. (15.8) , and numerical methods like GD have to be used. The GD update equations for logistic regression are the same as Eq. (15.2) for linear regression, where h w now represents the hypothesis for logistic regression. Logistic regression models tend to overfit the data, particularly in high-dimensional settings. For this reason, regularization methods are often used to prevent the model from fitting too closely to the training data.

Artificial neural networks for multiclass classification

ANNs are the state-of-the-art for many ML problems. They are able to learn complex nonlinear hypotheses, for example, in nonlinear classification problems. One of the strengths of ANNs lies in their capability to automatically learn abstract feature representations of the input data, removing the need for manual feature engineering.

As seen in the previous sections, manual feature generation like the addition of polynomial features is a common and effective way to increase model complexity. However, in addition to increasing the tendency for overfitting, in practice, this method might not always be feasible, especially for problems which already come with a big number of input features, since depending on the order of polynomials to add the total number of features may increase drastically. Just by considering the addition of second-degree polynomial terms, the number of features increases roughly quadratically. Depending on the number of initial features and the data to process, this can become computationally infeasible.

As we will see in more detail discussing networks topologies, this is especially limiting for CV problems like image classification, where the input consists of whole images, or more precisely, the input features are given by the intensity values of each image pixel. For an image of size 100   ×   100, including only second-order polynomial features will result in millions of features, and it is not guaranteed that the increase in complexity will be sufficient. As we will see, ANNs can deal with a large input feature space by successively learning more and more abstract features automatically.

Universal approximation theorem

The founding theorem that proves the capability of ANNs to automatically learn abstract feature representations of the input data is the so-called universal approximation theorem. The classical form reads:

Let φ : R → R be a nonconstant, bounded, and continuous function (called the activation function). Let I m denote the m-dimensional unit hypercube [ 0,1 ] m . The space of real-valued continuous functions on I m is denoted by C ( I m ) . Then, given any ε > 0 and any function f ∈ C ( I m ) , there exists an integer N , real constants v i , b i ∈ R , and real vectors w i ∈ R m , ∀ i = 1 , … , N , such that we may define:

as an approximate realization of the function f , that is:

Roughly speaking, the theorem states that there is always a neural network architecture (number of layers, weights, and biases) to approximate a given function to a desired accuracy.

The name artificial neuronal network derives from the fact that it consists of networks of interconnected nodes, analogous to biological neurons in the brain. The neurons in an ANN are organized in layers. In the classic ANN model called MLP, neurons belonging to the same layer receive inputs from all the neurons of the previous layer and send their output to each neuron of the following layer ( Fig. 15.11 ). The MLP is a feedforward ANN—the flow from layer to layer always goes in the same direction—and it is fully connected , since every neuron is connected to every neuron in the next layer.

linear hypothesis means

Figure 15.11 . Multilayer perceptron/Fully Connected Neural Network with three layers. The b 0 ( 2 ) represents the bias of the second layer (often referred as first activation neuron).

The first layer of an ANN, which consists of the input features, is called the input layer , and accordingly, the last layer, which returns the final results, is called the output layer . All the intermediate layers are hidden layers , since in contrast to the input and output layer, where (in the case of supervised learning) we can compare the value of each node with its true value given in the training set, the true values for hidden layers are unknown. When counting the number of layers, the input layer is commonly not taken into account, and therefore an ANN with one input, one hidden, and one output layer is referred to as two-layer network. An ANN with one or more hidden layers is called a DNN. Analogously, a network without hidden layer is called a shallow neural network .

Fig. 15.11 shows a two-layer MLP for binary classification as an example, with the input layer representing k ( 1 ) input features, one hidden layer with k ( 2 ) units, and k ( 3 ) units in the output layer for the classification output. The parameters characterizing a MLP are the following:

N - The number of layers or depth of the network.

k ( l ) - The number of units in layer l .

g ( z ) - The nonlinear activation function applied to each unit (e.g., logistic function σ ( z ) ).

w i ( l ) - The k ( l − 1 ) -dimensional parameter or weight vector of unit i in layer l with elements w i j ( l ) .

b i ( l ) - The bias parameter of unit i in layer l .

z i ( l ) - The k ( l ) -dimensional input vector to unit i in layer l .

a i ( l ) - Activation, i.e., scalar output of unit i in layer l . All outputs of layer l are summarized in the k ( l ) -dimensional activation vector a ( l ) . To facilitate a generalized notation, we define a ( 0 ) = x and a ( L ) = y .

The input vector z to each unit is calculated from the vector of activations of all units of the previous layer,

The activations of unit i in layer j are then calculated applying the activation function:

With respect to Eq. (15.5) , here the logistic function σ has been substituted by the more general nonlinear activation function g ( z ) . If the logistic function is used as an activation function, each unit acts like a logistic regressor with inputs being the activations of all units of the previous layer. Hence, logistic regression can be viewed as a shallow, one-layer neural network. In practice, for ANNs, the logistic function σ is often replaced by different nonlinear activation functions ( Fig. 15.12 ). Some of the most common functions are:

linear hypothesis means

Figure 15.12 . Common activation function.

The sigmoid or logistic function ( Fig. 15.12 , left) maps values to the range ( 0,1 ) and is the default activation function when the output is interpreted as a probability, e.g., the probability of corresponding to a given class in a classification task.

The hyperbolic tangent ( Fig. 15.12 , center has a shape similar to the sigmoid function, but is zero-centered, returning values in the range between −1 and 1. It is also used in binary classification tasks.

The rectified linear unit ( ReLU ) activation function ( Fig. 15.12 , right) has become an extremely popular activation function for hidden units in DNNs and especially for CNNs, due to a series of advantages: compared to the sigmoid function, computation is more efficient, it provides better gradient propagation, and mitigates the problem of vanishing gradients which will be discussed later.

The softmax activation function can be seen as a generalization of the logistic function to multiple dimensions and is defined as:

It is commonly used in the output layer of a multiclass classification neural network in order to normalize the output to a probability distribution over the predicted output classes since it guarantees that the sum over all class probabilities adds up to 1.

The notation of the neural network equations can be further simplified by defining the weight matrix W ( l ) containing all the weights which define the connections between layer l and l − 1 . Thus, column i of matrix W ( l ) corresponds to the parameter vector w i ( l ) which contains the weights connecting all units in layer D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) with unit i in layer l . The equations summarizing all operations in layer l can then be written as:

D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) has dimensions D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , where “+1” corresponds to the bias unit which is added to the dimension of layer D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) .

Cost function for multiclass classification

As mentioned before, the choice of the loss function depends on the specific ML problem. In the example of linear regression, the mean square error was used; for binary classification using logistic regression, binary cross entropy was introduced. For the ANN, we will use multiclass classification with the categorical cross entropy as an example.

Where in binary classification the model prediction was restricted to two classes, labeled 0 or 1, in multiclass classification, the output labels correspond to D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) classes, where D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) . The neural network represents these classes by D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) output units, resulting in a D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) -dimensional output vector D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) . Using the softmax activation function in the output layer, the sum over all vector elements is 1, and thus each element y k can be interpreted as the probability of belonging to class k .

The categorical cross-entropy cost function for this case is calculated from the GT labels y and the predicted values y ˆ and is defined as:

where the subscript k indicates the k -th vector component and t indicates the training sample. In other words, the total categorical cross-entropy cost is obtained by summing over the binary cross-entropy cost of all D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) output units.

ANN parameter learning: backpropagation and gradient descent

Training of an ANN follows the same steps as seen earlier for linear and logistic regression: firstly, it requires the definition of a cost function which matches the problem in question and defines the optimization objective, followed by the optimization procedure itself, e.g., using GD. The main difference regarding ANNs is the increased complexity of the computation steps involved. The objective is to find the weights W which minimize the cost function, D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , which can be achieved through iterating the following steps:

For a given training sample D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , perform the model computations to obtain output y ˆ . For a feedforward ANN, this step is called forward propagation since calculations are performed layer by layer and information is propagated forward through the network.

Calculate the scalar loss or error D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) .

Use GD to update the model weights. To this end, we need to calculate the changes in the total error connected to changes in each single weight of the network, given by the partial derivatives D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , where w i j ( l ) denotes the weight between neuron j of the previous layer D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) and neuron i of the current layer l .

The last step poses the biggest challenge when it comes neuronal networks. The deep structure of an ANN means that the output or activation of each hidden unit does affect many units in the downstream layers, and therefore contributes to the total cost through many separate paths. In order to combine all these effects, an efficient and systematic method for the computation of the partial derivatives is required. The development of the backward propagation (or short, backprop) algorithm as an efficient method to calculate these loss derivatives was one of the major reasons for the increasing interest in ANNs in the 1980s [ 7 ]. Backward propagation calculates the loss derivatives for all hidden units with respect to their activation a , from where it is easy to get the loss derivatives of the weights going into the hidden unit.

In the forward pass, calculations follow the following schematic sequence:

In backpropagation, this order is reversed, and the changes of the error function with respect to the unit output y of the last layer is calculated first, then with respect to the unit input of the last layer, from where the changes of the error function with respect to the weights are obtained. From there, the process is repeated sequentially for the previous layers (where the unit outputs are the activations), until reaching the first hidden layer:

In this way, backpropagation traces the separate contributions to the error starting at the back, i.e., at the output layer, passing through the network layer by layer.

Backpropagation is based on the repeated application of the chain rule of calculus, which states that for a real number x and two functions D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) and f which map from a real number to a real number and are linked by y = g ( x ) and z = f ( g ( x ) ) = f ( y ) , the derivative of z with respect to x can be obtained by d z d x = d z d y d y d x . As a consequence, backpropagation requires both the loss function and the activation function to be differentiable.

The following example illustrates the calculations for any two units chosen from subsequent layers: the output layer l and a hidden layer D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , indicated by the superscripts. The subscripts j and i run over all units of layer l and D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , respectively.

The calculations involved are:

Apply the chain rule to calculate the derivative of the loss with respect to the total input received by unit j , ∂ J ∂ z j ( l ) = d a j ( l ) d z j ( l ) ∂ J ∂ a j ( l ) = a j ( l ) ( 1 − a j ( l ) ) ∂ J ∂ a j ( l ) where we assumed that the derivative of the logistic unit a = σ ( z ) is d a d z = a ( 1 − a ) . ∂ J ∂ a j ( l ) is the derivative of the loss with reference to the activation of unit j in layer l .

To get ∂ J ∂ w i j ( l ) , we observe that z j ( l ) is a linear function of the weights w i j ( l ) and the outputs of the previous layer, a i ( l − 1 ) . Again, we apply the chain rule: ∂ J ∂ w i j ( l ) = ∂ z j ( l ) ∂ w i j ( l ) ∂ J ∂ z j ( l ) = a i ( l − 1 ) ∂ J ∂ z j ( l ) where ∂ J ∂ z j ( l ) has already been calculated in step 1.

This can be repeated for all hidden layers. In order to backpropagate from layer l to D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) , we substitute l by D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) in step 1 and observe that we can determine the change of the error depending on the changes in the output of unit i in layer D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) by summing over all outgoing connections of unit i :

Here, D = ( ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , … , ( x ( m ) , y ( m ) ) ) characterizes the change of the total input to unit j with the changes in the output of unit i , which is simply the weight connecting unit i and j . The second term, ∂ J ∂ z j ( l ) , is known from step 1. The weight update for GD is then given by:

where η is the learning rate, a user-defined coefficient, as already noted.

Vanishing or exploding gradients pose a problem when training a DNN with gradient-based learning methods and backpropagation [ 8 , 9 ]. Since weights are updated proportionally to the partial derivative of the error function with respect to the current weight, the learning process can come to a rest if the gradient becomes very small (vanishing gradient) or become unstable for very large gradients (exploding gradient). In each layer, the weight updates are obtained from the gradients calculated from all later layers using the chain rule, effectively involving the multiplication of the gradients of those later layers. Thus, if gradients are smaller than 1, the resulting weight update gets smaller with each layer, leading to an exponential decay of the weight updates and preventing any learning progress in the earlier layers. The opposite effect of an exponential increase of the weight update can occur for gradients greater than 1.

Various solutions such as gradient clipping (i.e., restricting maximum gradient values to a fixed threshold), weight regularization, and alternative or modified activation functions have been proposed and are used to prevent vanishing and exploding gradients [ 9 ]. In residual networks, skip connections pass gradient information from a deeper layer directly to a nonadjacent previous layer, thus helping maintain signal propagation even in deeper networks [ 10 ]. For RNNs, which are especially affected by the problem as we will see later in the chapter, special architectures like the long short-term memory (LSTM) have been developed to prevent vanishing gradients [ 11 ].

GD is used to train neural networks in an iterative manner, with weight updates performed through repeated forward-backpropagation passes of the available training samples. Depending on the specific application, computational capacity, and data availability, different training strategies are typically applied which differ in the number of samples in each update step.

In batch learning or batch GD , the weight update is only executed after all the input-target data have been presented to the network. One complete presentation of the training dataset is typically called an epoch. The forward and backward pass is performed one after each epoch. In batch learning, the system is not capable of continuous learning while doing. The training dataset consists of all the available data. This generally takes a lot of time and computational effort, given the typical dataset sizes. For this reason, batch learning is generally performed on-ground. The system that is trained with batch learning first learns offline and then is deployed and runs without updating itself: it just applies what it has learnt.

Mini-batch learning is a variation of full batch learning, where the update is performed on subsets of the complete dataset, and thus each epoch consists of a number of forward-backpropagation passes. Mini-batch GD is widely used when the number of samples in the training dataset is very big and it becomes infeasible to perform the computations needed for the parameter update, which involve the calculation and storage of the cost function and its gradients, on the whole set at once. Stochastic gradient descent (SGD) is a special case of mini-batch learning with a batch size of one sample.

In incremental learning , often referred to as online learning , the system is trained continuously as new data instances become available. They could be clustered in mini-batches or come as standalone datum using SGD. Online learning systems are tuned to set how fast they should adapt to incoming data: typically, such parameter is again referred to as learning rate. A high learning rate means that the system reacts immediately to new data, by adapting itself quickly. However, a high learning rate means that the system will also tend to forget and replace the old data. On the other hand, a low learning rate makes the system stiffer, meaning that it will learn more slowly. Also, the system will be less sensitive to noise present in the new data or to mini-batches containing nonrepresentative data points, such as outliers. In addition, Lyapunov-based methods are very suitable for incremental learning due to their inherent stepwise trajectory evaluation of the stability of the learning rule.

The role of sensitivity analysis in the building performance analysis: A critical review

Zhihong Pang , ... Fuxin Niu , in Energy and Buildings , 2020

5.2.3 Linear analysis methods

The common methods used for linear models include the PEAR, Standard or Standardized Regression Coefficient (SRC), and Partial Correlation Coefficient (PCC).

The PEAR index is also known as the product moment correlation coefficient (PMCC). Its definition is shown in Eq. (1) , where X, Y, and N denotes the input, output, and sample size respectively. The value of PEAR index varies between −1 and 1, which represents the linear correlation between the input and the output [9] . Such a method is only suitable for the linear model.

The SRC uses a linear regression model as presented in Eq. (2) to measure the linear relationship between the input and the output. The terms a, b, x, and ɛ represent the intercept, slope, input, and residual due to the linear approximation respectively. The reliability of such a method is strongly dependent on the result of the linear regression fitting, i.e., the R squared value [123] .

When the input parameters demonstrate unneglectable interactions, the PCC is recommended to clean off the effect of inputs coupling. The sensitivity indexes obtained by the PCC eliminate the variations caused by the other inputs and only represent the correlation between the single input and the output. Menberg et al. [86] found that when the sample size is limited, the technically non-influential parameters may also be identified as important by this method.

A linear hypothesis testing is usually recommended to examine the feasibility of PEAR, SRC, and PCC [159] . For instance, an R squared value of 0.7 is often used as the threshold value to judge the reliability of the SRC method [16] . If the hypothesis is contradicted, the rank transformations from the previous methods (i.e., PEAR, SRC, and PCC) are recommended for the replacement [187] .

The transformations of the PEAR, SRC, and PCC are SPEA, Standardized Rank Correlation Coefficient (SRRC), and Partial Rank Correlation Coefficient (PRCC) respectively [ 43 , 159 , 188 ]. The rank transformation methods work by replacing the raw values of the inputs with their ranks [71] . Rank transformations were widely used in the past decades and are believed to be able to reduce the effects of extreme values. However, the capabilities of the rank methods were doubted by some researchers because these methods resulted in hard convergence and unstable outcomes [ 43 , 123 ]. Also, the rank transformation modifies the original models, thus rendering the results of sensitivities instead of providing direct qualitative results [ 16 , 165 ].

It should be noted that there is a dispute over the classification of the PEAR and SPEA methods. Mokhtari and Frey [148] and Saltelli and Bolado [189] classified the PEAR and SPEA into the correlation methods while Nguyen and Reiter [43] defended that the PEAR and SPEA should be classified into the regression-based methods as well. As a whole, these six regression-based SA methods are not suitable for the general nonlinear and nonmonotonic cases [182] .

Statistical Learning Theory as a Framework for the Philosophy of Induction

Gilbert Harman , Sanjeev Kulkarni , in Philosophy of Statistics , 2011

We now want to say something more about Popper's [1972; 2002] discussion of scientific method. Popper argues that there is no justification for any sort of inductive reasoning, but he does think there are justified scientific methods.

In particular, he argues that a version of structural risk minimization best captures actual scientific method (although of course he does not use the term “structural risk minimization”). In his view, scientists accept a certain ordering of classes of hypotheses, an ordering based on the number of parameters needing to be specified to be able to pick out a particular member of the class. So, for example, for real value estimation on the basis of one feature, linear hypotheses of the form y = ax + b have two parameters, a and b , quadratic hypotheses of the form y = ax 2 + bx + c have three parameters, a, b, and c , and so forth. So, linear hypotheses are ordered before quadratic hypotheses, and so forth.

Popper takes this ordering to be based on “falsifiability” in the sense at least three data points are needed to “falsify” a claim that the relevant function is linear, at least four are needed to “falsify” the claim that the relevant function is quadratic, and so forth.

In Popper's somewhat misleading terminology, data “falsify” a hypothesis by being inconsistent with it, so that the hypothesis has positive empirical error on the data. He recognizes, however, that actual data do not show that a hypothesis is false, because the data themselves might be noisy and so not strictly speaking correct.

Popper takes the ordering of classes of hypotheses in terms of parameters to be an ordering in terms of “simplicity” in one important sense of that term. So, he takes it that scientists balance data-coverage against simplicity, where simplicity is measured by “falsifiability” [ Popper, 2002 , section 43].

We can distinguish several claims here.

Hypothesis choice requires an ordering of nested classes of hypotheses.

This ordering represents the degree of “falsifiability” of a given class of hypotheses.

Classes are ordered in accordance with the number of parameters whose values need to be specified in order to pick out specific hypotheses.

The ordering ranks simpler hypotheses before more complex hypotheses.

Claim (1) is also part of structural risk minimization. Claim (2) is similar to the appeal to VC dimension in structural risk minimization, except that Popper's degree of falsifiability does not coincide with VC dimension, as noted in above. As we will see in a moment, claim (3) is inadequate and, interpreted as Popper interprets it, it is incompatible with (2) and with structural risk minimization. Claim (4) is at best terminological and may just be wrong.

Claim (3) is inadequate because there can be many ways to specify the same class of hypotheses, using different numbers of parameters. For example, linear hypotheses in the plane might be represented as instances of abx + cd, with four parameters instead of two. Alternatively, notice that it is possible to code a pair of real numbers a, b as a single real number c, so that a and b can be recovered from c. That is, there are functions such that f (a, b) = c, where f 1 ( c ) = a and f 2 ( c ) = b . 1 Given such a coding, we can represent linear hypotheses as f 1 ( c ) x + f 2 ( c ) using only the one parameter c. In fact, for any class of hypotheses that can be represented using P parameters, there is another way to represent the same class of hypotheses using only 1 parameter.

Perhaps Popper means claim (3) to apply to some ordinary or preferred way of representing classes in terms of parameters, so that the representations using the above coding functions do not count. But even if we use ordinary representations, claim (3) conflicts with claim (2) and with structural risk minimization.

To see this, consider the class of sine curves y = a + sin( bx ) that might be used to separate points in a one dimensional feature space, represented by the points on a line between 0 and 1. Almost any n distinct points in this line segment are shattered by curves from that class. So this class of sine curves has infinite “falsifiability” in Popper's sense (and infinite VC-dimension) even though only two parameters have to be specified to determine a particular member of the set, using the sort of representation Popper envisioned. Popper himself did not realize this and explicitly treats the class of sine curves as relatively simple in the relevant respect [1934, Section 44].

The fact that this class of sine curves has infinite VC dimension (as well as infinite falsifiability in Popper's sense) is some evidence that the relevant ordering of hypotheses for scientific hypothesis acceptance is not a simplicity ordering, at least if sine curves count as “simple”.

Earth climate identification vs. anthropic global warming attribution

Philippe de Larminat , in Annual Reviews in Control , 2016

5.2 D&A and fingerprinting

One of the main tools used in D&A is called ‘optimal fingerprinting’ , a concept introduced in the 1990’s ( Hasselman, 1993 ). The principle is as follows (see, for example, Hegerl and Zwiers 2011 ).

Regarding global temperature, fingerprints (or patterns) are defined as the changes in the simulated temperature in response to observed variations of each external forcing or driver, considered independently. The used simulation models are either some large digital general circulation models (GCM), either simple energy balance models. Unlike the approach of identification, these models are a priori fixed, and the D&A is not intended to revisit them. Making explicit linear hypothesis :

where y is the observed global temperature, X i the fingerprints associated with each indicator of forcing (e.g., human, solar and volcanic activities), and v results from internal variability or from any unlisted causes. Introducing possible model errors, Eq. (12) becomes:

where X   =   [ X 1  X 2   X 3 ], and where a is a vector of scaling factors, each nominally equals to 1. An estimate of a may be obtained by some linear regression (e. g. BLUE: Best Linear Unbiased Estimate):

where the matrix C is the covariance of internal variability signal, the determination of which we will return on. The variance of the estimate a ^ is given by the expression ( X T C   −   1 X )   −   1 , from which one can deduce the confidence intervals associated with estimates a ^ . Depending on whether the estimated intervals include (or not) the values 1 or 0, changes in y will be detected (or not), and will be attributed (or not) to the corresponding forcing factor.

DIRECT TEST OF BOUSSINESQ'S HYPOTHESIS AND THE K-TRANSPORT EQUATION USING EXPERIMENTAL, DNS AND LES DATA

F.G. Schmitt , Ch. Hirsch , in Engineering Turbulence Modelling and Experiments 5 , 2002

Results for Boussinesq's hypothesis

We present here the main results of the analysis of Boussinesq's hypothesis for the different test cases. Consequences and discussions are provided in the next section. Figure 3 shows a map of ρ RS for the experimental double annular jet data. Validity of Boussinesq's hypothesis corresponds to ρ RS = 1, represented by white regions. Regions of validity are of relatively limited extension. It is well-known (see e.g. Nisizima and Yoshizawa, 1986 ; Speziale, 1987 ) that since the normal stresses for simple shear flows are not equal (corresponding to an anisotropy in normal stresses), Boussinesq's hypothesis does not hold (see Schmitt and Hirsch, 2001 for more details). The plot of ρ RS (y′) is then a quantification of this non-validity, which is very useful to go further than qualitative analysis, and to indicate which regions are further from linear constitutive equation. This is shown in Figure 4 . Since close to the wall the viscous stress becomes dominant, the total stress is here represented, together with the Reynolds stress. It is seen that the visoucs stress has influence on the total up to y τ =20, but the alignment is bad for y τ >2. Close to the wall, the turbulent stress is not aligned with the strain, Boussinesq's hypothesis is wrong. For the total stress, Boussinesq's hypothesis is badly followed for 3 < y τ <70, showing that for these distances to the wall (corresponding roughly to the buffer layer) the linear term is of little relative weight. There is a minimum at around y τ ≈ 10.

linear hypothesis means

Figure 3 . A map of ρ RS for the experimental double annular jet data

linear hypothesis means

Figure 4 . Plot of ρ RS (y) for DNS databases: (1) total stress (turbulent + viscous stress) and (2) : turbulent stress only.

Since the annular pipe flow presents some interesting non-symmetry effects (between inner and outer cylinders), for this database ρ RS is shown in real instead of wall units in Figure 5 . For this flow, this ratio has an interesting shape. Boussinesq's hypothesis is never valid (except very close to the walls for the total stress), but it is the worse at the position r 0 = 0.87 corresponding to an asymmetric center position, which is closer to the inner cylinder than the outer one. As shown in Figure 6 , this position corresponds to the annulation of the mean velocity gradient tensor. The shear stress also vanishes at this position, whereas normal stresses are not isotropic, so that R is not vanishing. The different turbulent values at this position correspond then to the non-validity of Boussniesq's hypothesis and also of any polynomial non-linear constitutive equation such as used in non-linear K-ε models. This example shows that such non-linear constitutive equation cannot represent a full and coherent answer to the closure problem.

Figure 5 . Plot of ρ RS (r) for AP DNS database. The continuous line is the total stress and the dotted line the turbulent stress alone.

Figure 6 . Plot of the shear stress and mean velocity for the AP DNS database.

Figure 7 represents the same ratio ρ RS for the LES database. Light regions correspond to the validity of Boussinesq's hypothesis (ratio =1). An isoline represents the value ρ RS =cosπ/4=0.71, corresponding to a very roughly valid linear hypothesis . The region inside this isoline is not of wide extension. Since the tensors are very small in the inflow region, the alignment ratio is not plotted to avoid unnecessary scatter. For this database, results are to be interpreted with caution, since there is a relatively high numerical uncertainty, visible in the small-scale fluctuations in some regions of Figures 7 and 10 .

linear hypothesis means

Figure 7 . A map of ρ RS (r) for the LES database. The inflow region is not plotted. The isoline corresponds to an “angle” of π/4.

linear hypothesis means

Figure 10 . A map of the alignment ratio of the total flux and the gradient of the kinetic energy. The inflow is not represented. In the wake flow the white region surrounded by the isoline corresponds to a ratio of cos(π/6)=0.87.

We saw that Boussinesq's hypothesis is not valid for simple shear flows and complex turbulent flows. An interesting question is to check if it is recovered as a limit for nearly-homogeneous flows. The far wake flow of the wake experimental database has been tested in this purpose. This is shown in Fig. 8 : the mean velocity field is almost homogeneous, but the ratio ρ RS is very far from 1, and is smaller and smaller for sections further downstream. This shows that Boussinesq's hypothesis is not a “first order approximation” and thus indicates that polynomial developments (corresponding to non-linear constitutive equations in non-linear K-ε models), sometimes justified through Taylor developments around a “first order” linear relation, are not justified.

linear hypothesis means

Figure 8 . A map of mean velocities and ρ RS (y) for the experimental wake flow database. The mean velocity is almost uniform (in the center its values goes from 0.90 to 0.95), whereas ρ RS is far from 1, and decreases for increasing the distance downstream.

Related terms:

  • Energy Engineering
  • Equivalent Dose
  • Regression Coefficient
  • Real Number
  • Ionizing Radiation
  • Radiation Effect
  • Biological Tissue
  • Medical Exposure
  • Natural Background Radiation

IMAGES

  1. PPT

    linear hypothesis means

  2. PPT

    linear hypothesis means

  3. Algorithms of Linear Regression (h means hypothesis)

    linear hypothesis means

  4. Machine Learning & Deep Learning

    linear hypothesis means

  5. Structure of the linear hypothesis

    linear hypothesis means

  6. PPT

    linear hypothesis means

VIDEO

  1. Concept of Hypothesis

  2. Hypothesis Testing in Simple Linear Regression

  3. What Is A Hypothesis?

  4. Hypothesis testing: Linear & Multiple Regression in R

  5. Problem of Hypothesis test for differences between 2 population mean

  6. Introduction to Hypothesis Testing Part 2

COMMENTS

  1. Linear hypothesis - Encyclopedia of Mathematics

    Linear hypothesis. A statistical hypothesis according to which the mean $ a $ of an $ n $- dimensional normal law $ N _ {n} ( a , \sigma ^ {2} I ) $ ( where $ I $ is the unit matrix), lying in a linear subspace $ \Pi ^ {s} \subset \mathbf R ^ {n} $ of dimension $ s < n $, belongs to a linear subspace $ \Pi ^ {r} \subset \Pi ^ {s} $ of dimension ...

  2. 5.2 - Writing Hypotheses | STAT 200 - Statistics Online

    When writing hypotheses there are three things that we need to know: (1) the parameter that we are testing (2) the direction of the test (non-directional, right-tailed or left-tailed), and (3) the value of the hypothesized parameter.

  3. Linear Hypothesis - an overview | ScienceDirect Topics

    A 'Linear Hypothesis' is a term often used interchangeably with 'Linear Model'. In the field of Computer Science, it refers to a statistical method used for data analysis, such as regression analysis, analysis of variance, and multivariate analysis.

  4. 12.2.1: Hypothesis Test for Linear Regression - Statistics ...

    The hypotheses are: H0: β1 = 0. H1: β1 ≠ 0. The null hypothesis of a two-tailed test states that there is not a linear relationship between x and y. The alternative hypothesis of a two-tailed test states that there is a significant linear relationship between x and y.

  5. Definition:Linear Hypothesis - ProofWiki

    Definition. A linear hypothesis is a hypothesis which concerns linear functions of parameters in the context of regression analysis and analysis of variance. Examples Arbitrary Example. An example of a linear hypothesis: Let $H_0$ be the hypothesis that the difference between two treatment means $\tau_1$ and $\tau_2$ is zero, or takes a ...

  6. 17.1: Simple linear regression - Statistics LibreTexts

    Linear regression is a toolkit for developing linear models of cause and effect between a ratio scale data type, response or dependent variable, often labeled Y, and one or more ratio scale data type, predictor or independent variables, X. Like ANOVA, linear regression is a special case of the general linear model.

  7. The Linear Hypothesis - SpringerLink

    Let \(\mathbf{y} =\boldsymbol{\theta } +\boldsymbol{\varepsilon }\), where \(\boldsymbol{\theta }\) is known to belong to a vector space Ω, then a linear hypothesis H is a hypothesis which states that \(\boldsymbol{\theta }\in \omega\), a linear subspace of Ω.

  8. LINEAR MODELS IN STATISTICS - Department of Statistical Sciences

    12.7.3 General Linear Hypothesis 326 12.8 An Illustration of Estimation and Testing 329 12.8.1 Estimable Functions 330 12.8.2 Testing a Hypothesis 331 12.8.3 Orthogonality of Columns of X 333 13 One-Way Analysis-of-Variance: Balanced Case 339 13.1 The One-Way Model 339 13.2 Estimable Functions 340 13.3 Estimation of Parameters 341

  9. Linear Hypothesis - University of New Mexico

    The linear hypothesis is that the mean (average) of a random observation can be written as a linear com-bination of some observed predictor variables. For example, Coleman et al. (1996) provides observations on various schools. The dependent variable y consists oftheaverageverbaltestscoreforsixth-gradestudents. The report also presents ...

  10. Linear Hypothesis - an overview | ScienceDirect Topics

    The linear hypothesis in Eq. (15.1) can easily be extended to capture more complex, nonlinear problems through the addition of nonlinear features or feature combinations. For example, for a two-dimensional (2D) input x = ( x 1, x 2), we can define new features x 3 = x 1 or x 4 = x 1 x 2 and add these to our hypothesis.