Glossary of Major Statistical Terms and Concepts

 

Accuracy - The extent to which results of a calculation or the readings of an instrument approach the true values of the calculated or measured quantities. 

Alternative Hypothesis - A statement of what a statistical hypothesis test is set up to establish.  It is the hypothesis that is accepted if the null hypothesis is rejected. 

Analysis of Variance (ANOVA) - Analysis of variance (ANOVA) is a statistical test which compares the distribution of two or more sample groups to determine if one or more of the groups are significantly different from the others.

                                                

Bias - is a term which refers to how far the average statistic lies from the parameter it is estimating, that is, the error which arises when estimating a quantity. Errors from chance will cancel each other out in the long run, those from bias will not.

Binomial Distribution - Binomial distributions model (some) discrete random variables.  Typically, a binomial random variable is the number of successes in a series of trials, for example, the number of 'heads' occurring when a coin is tossed 50 times.

Binomial Distribution - Binomial distributions model (some) discrete random variables.  Typically, a binomial random variable is the number of successes in a series of trials, for example, the number of 'heads' occurring when a coin is tossed 50 times.

Categorical Data - A set of data is said to be categorical if the values or observations belonging to it can be sorted according to category. Each value is chosen from a set of non-overlapping categories. For example, shoes in a cupboard can be sorted according to color; the characteristic 'color' can have non-overlapping categories 'black', 'brown', 'red' and 'other'. People have the characteristic of 'gender' with categories 'male' and 'female'.

Census - An enumeration of the total population of interest. Since no sample is selected from the population, there is no sampling error. However, nonsampling errors are still possible in a census.

Central Limit Theorem – Refers to the Observation that as sample size increases, the means of samples drawn from a population of any distribution will approach a normal distribution.

 

Chi Square Distribution - A skewed distribution whose shape depends on the number of degrees of freedom. As the number of degrees of freedom increases, the distribution becomes more symmetrical.

 

Chi-Square Test - A non-parametric statistical test that compares research data with the expected results from a hypothesis.

Closed-Ended Questions - Closed-ended questions provide respondents with a pre-determined list of possible answers.

Codebook - Generically, any information on the structure, contents, and layout of a data file. Typically, a codebook includes: column locations and widths for each variable; definitions of different record types ; response codes for each variable; codes use to indicate non-response and missing data; exact questions and skip patterns used in a survey; and other indications of the content of each variable. Many codebooks also include frequencies of response. Codebooks vary widely in quality and amount of information included. They may be machine-readable or paper copy or microfiche.

 

Conditional probability - The probability of a particular event occurring, given that another event occurred.

 

Confidence Intervals - An interval within which an investigator will have a specified level of confidence (typically 90%, 95%, or 99%), the true value of an estimate lies.

Confounded – Two variables are confounded when their effects on a response variable cannot be distinguished from each other.  The confounded variables can either be explanatory variables or lurking variables.

 

Contingency table - A table used to classify sample observations according to two or more identifiable characteristics.

Continuous Data - A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example, height; weight; temperature; the amount of sugar in an orange; the time required to run a mile.

Continuous Random Variable - A continuous random variable is one which takes an infinite number of possible values. Continuous random variables are usually measurements. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

Correlation - A synonym for association or the relationship between variables. 

 

Correlation Coefficient – is a numerical expression of both the strength and direction of a straight line correlation.  Such correlation coefficients generally range between -1.00 (perfect negative correlation) and +1.00 (perfect positive correlation).

 

Data - a collection of facts from which conclusions may be drawn

 

Degrees of Freedom – The degrees of freedom of a statistical group are the number of values in the group which are free to vary. This number is usually one less than the sample size, the number of items in the group. The number of independent comparisons that can be made in a set of data. 2). The maximum number of quantities whose values are free to vary before the remainder of the quantities are determined.

Discrete Random Variable - A discrete random variable is one which may take on only a countable number of distinct values such as 0,1,2,3,4,........ Discrete random variables are usually (but not necessarily) counts. If a random variable can take only a finite number of distinct values, then it must be discrete. Examples of discrete random variables include the number of children in a family, the Friday night attendance at a cinema, the number of patients in a doctor's surgery, the number of defective light bulbs in a box of ten.

Errors and residuals in statistics - A statistical error is the amount by which an observation differs from its expected value; the latter being based on the whole population from which the statistical unit was chosen randomly. The expected value, being for instance the mean of the entire population, is typically unobservable. If the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The nomenclature arose from random measurement errors in astronomy. It is as if the measurement of the man's height were an attempt to measure the population mean, so that any difference between the man's height and the mean would be a measurement error.

A residual (or fitting error), on the other hand, is an observable estimate of the unobservable statistical error. The simplest case involves a random sample of n men whose heights are measured. The sample mean is used as an estimate of the population mean. Then we have:

Note that the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not independent. The sum of the statistical errors within a random sample need not be zero; the statistical errors are independent random variables if the individuals are chosen from the population independently.

In sum:

·         Residuals are observable; statistical errors are not.

·         Statistical errors are often independent of each other; residuals are not (at least in the simple situation described above, and in most others).

Estimate - is an indication of the value of an unknown quantity based on observed data.

Experiment - is any process or study which results in the collection of data, the outcome of which is unknown. In statistics, the term is usually restricted to situations in which the researcher has control over some of the conditions under which the experiment takes place.

Explanatory variable – is a variable that we think explains or causes changes in the response variable.

Factor - a factor is a major independent variable.

Frequency Table - A frequency table is a way of summarizing a set of data. It is a record of how often each value (or set of values) of the variable in question occurs. It may be enhanced by the addition of percentages that fall into each category.

Goodness of fit - The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question.

Hypothesis Testing - The formal process by which a decision is made concerning the rejection or acceptance of the null hypothesis.

Interval Scale - An interval scale is a scale of measurement where the distance between any two adjacent units of measurement (or 'intervals') is the same but the zero point is arbitrary. Scores on an interval scale can be added and subtracted but can not be meaningfully multiplied or divided. For example, the time interval between the starts of years 1981 and 1982 is the same as that between 1983 and 1984, namely 365 days. The zero point, year 1 AD, is arbitrary; time did not begin then. Other examples of interval scales include the heights of tides, and the measurement of longitude.

Kurtosis - characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.” Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution.”

 

Lurking Variable – is a variable that has an important effect on the relationship among the variables in a study but is not one of the explanatory variables studied.

 

Margin of Error - The uncertainty of a measured quantity. A statistic used to say how close a calculation is to the predicted value by a certain percentage.  The more precise the instrument or technique used for the measurement, the smaller the margin of error. 

 

Mean - The arithmetic average of a set of data in which the values of all observations are added together and divided by the number of observations

 

Measurement – is the process by which numbers are assigned to variables of interest. 

 

Measurement error - The extent to which there are discrepancies between survey results and the true value of what the survey researcher is attempting to measure. There are several possible sources of error here. Respondents may report inaccurate information because they do not have the required information, due to carelessness, or because they do not understand the question asked. Alternately, respondents may provide accurate information, but errors are introduced in the data processing stage due to keypunching, coding, or programming errors. Since it is often not possible to determine the "true value" of what one is trying to measure, precise estimates of measurement error are usually not possible. However, techniques exist for obtaining some information about the likely extent of measurement error. For example, information reported by individuals may be compared with appropriate institutional records on the individual.

 

Measure of reliability – statistical inference is usually accompanied by a measure of reliability – that is how good the inference is?  Because inferences are based on only a portion of the population, there is always a level of uncertainty in our inferences. Generally, the smaller the sample size the less certain we are about the inference.  An inference is incomplete without a measure of its reliability.

 

Median - the midpoint value obtained by ranking all values from highest to lowest and choosing the value in the middle. The median divides a population into two equal halves.

 

Mutually exclusive - The occurrence of one event means that none of the other events can occur at the same time.

Nominal Data - A set of data is said to be nominal if the values / observations belonging to it can be assigned a code in the form of a number where the numbers are simply labels. You can count but not order or measure nominal data. For example, in a data set males could be coded as 0, females as 1; marital status of an individual could be coded as Y if married, N if single.

Nonresponse – is the failure to obtain data from an individual selectd for a sample. 

Nonsampling Error - A general term applying to all sources of error, with the exception of sampling error.  Nonsampling errors occur from nonresponse, coding errors, computer processing errors, errors in the sampling frame, reporting errors, and other errors.

 

Normality - Many kinds of data, variables, etc. share a common characteristic. When distributed they all seem to form a bell-shaped curve. This bell-shaped curve, called the standard normal curve.  The normal distribution arises repeatedly in biology.  The heights and weights of human and animal populations seem to follow this normal distribution. We observe objects in nature and find that certain qualities of natural objects seem to fit the normal curve. Within a species of plant, some will grow tall while others grow short and some will grow to heights in between the tall and short ones. The same is true of animals. Many of the measurable traits of living objects seem to follow this normal curve.

Normal Distribution Normal distributions model (some) continuous random variables. Strictly, a Normal random variable should be capable of assuming any value on the real line, though this requirement is often waived in practice. For example, height at a given age for a given gender in a given racial group is adequately described by a Normal random variable even though heights must be positive.

Null Hypothesis - The prediction that an observed difference is due to chance alone and not due to a systematic cause; this hypothesis is tested by statistical analysis, and is either accepted or rejected.  For a T-test, the null hypothesis is that there is no difference between the two population means; for ANOVA, the null hypothesis is that there is no difference between the three or more population means, and; for a correlation analysis, the null hypothesis that there is no relationship between the variables being studied.

Operationalizing – Defining a concept so that it can be measured.

Ordinal Data - A set of data is said to be ordinal if the values / observations belonging to it can be ranked (put in order) or have a rating scale attached.  You can count and order, but not measure, ordinal data.

Outcome - A particular result of an experiment.

Out-of-scope - Sampling units that are not part of the population of interest. For example, in the National Survey of Recent College Graduates, only individuals who received a bachelor's or master's degree within a specified time frame are of interest. If an educational institution provided the name of an individual who failed to graduate, the individual would be considered out-of-scope for the survey. Information on this individual would not be included in the final estimates from the survey.

 

Outliers – Values or scores that do not exactly fit with the rest of the data.

 

Oversampling - Deliberately sampling a portion of the population at a higher rate than the remainder of the population.

 

P- value - A statistical term that describes the probability that something occurred by chance alone. The lower the P-value, the less likely that something occurred by chance.

 

Parametric Statistics - A group of statistical procedures that researchers use to test data that are normally distributed.

                                                                                                                    

Precision - The closeness of repeated measurements to the same value.

 

Probability - A value between zero and one, inclusive, describing the relative possibility (chance or likelihood) an event will occur.

 

Parameter - a quantity (such as the mean or variance) that characterizes a statistical population and that can be estimated by calculations from sample data

 

Poll - an inquiry into public opinion conducted by interviewing a random sample of people

 

Population - The group with a particular set of characteristics to which researchers attempt to generalize their findings from a smaller sample. These are the objects of generalizations for inferential statistics.

Poisson Distribution - Poisson distributions model (some) discrete random variables. Typically, a Poisson random variable is a count of the number of events that occur in a certain time interval or spatial area. For example, the number of cars passing a fixed point in a 5 minute interval; the number of calls received by a switchboard during a given period of time.

Power Analysis - A procedure that is used to determine the sample size needed to prevent a Type II error.

Precision - is a measure of how close an estimator is expected to be to the true value of a parameter.

Randomness - a basic statistical concept and property implying an absence of a plan, purpose or pattern, or of any tendency to favor one outcome rather than another

Random Sampling - is a sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has a known, but possibly non-equal, chance of being included in the sample.

Random Variable - The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed can be 'heads' or 'tails'. However, we often want to represent outcomes as numbers. A random variable is a function that associates a unique numerical value with every outcome of an experiment. The value of the random variable will vary from trial to trial as the experiment is repeated.

Range - The difference between the maximum and the minimum in a set of data.  (e.g., for a set of scores measured from 20 to 35, the range is 15);

 

Regression Analysis - A method for determining the association between a response variable and one or more explanatory variables.  Regression analysis is used to estimate or predict the relative influence of more than one variable on something (e.g., the effect of age, gender, and educational level on the prevalence of a disease).

 

Reliability - The measure of consistency for an assessment instrument. The instrument should yield similar results over time with similar populations in similar circumstances.

 

Research - means investigation or experimentation aimed at the discovery of new theories or laws and the discovery and interpretation of facts or revision of accepted theories or laws in the light of new facts.

 

Research Design - A systematic plan of what data to gather, from whom, how and when to collect the data, and how to analyze the data obtained.

 

Residuals - the difference between data observed and values expected.

 

Respondent. The individual or organization providing the information requested in the survey. The type of respondent influences what type of information can be obtained, e.g., individuals completing a degree may provide different information about the degree than a representative of the academic institution granting the degree would provide.

 

Response rate. Indicates the percentage of sample members who provided information in response to being surveyed. Care in interpreting response rates is necessary, because there is not one single uniformly accepted measure of response rate. One common measure, used extensively in demographic surveys, is the percentage of in-scope sample members who responded to the survey. In surveys that focus on estimating expenditures, the response rate is often calculated as the percentage of the total expenditures represented by responding sample members. This measure is often referred to as a weighted response rate (though weighting may also be used to adjust for different probabilities of sample selection).

 

Response variable – is a variable that measures an outcome or result of a study.

 

Sample - is a group of units selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions about the larger group.

 

Sample Design - The sampling procedure used to produce any type of sample.

Sampling Distribution - describes probabilities associated with a statistic when a random sample is drawn from a population.

Sampling error - The estimated discrepancy between a statistic and a parameter.  The difference in results for different samples of the same size is called sampling error. All things being equal, by increasing the sample size, from say 25 to 125 students, the sampling error will be reduced (but not eliminated) and the study findings can be assumed to be more reliable.

 

Scope of survey. The population to which the researcher plans to generalize his or her results. The scope of the survey may be limited by both theoretical and practical considerations. For example, while it may be of theoretical interest to obtain information on the characteristics of institutionalized individuals, practical difficulties often lead researchers to declare such individuals out-of-scope for a survey. Out-of-scope cases may be eliminated at the time of sample frame construction or during data collection or data processing.

 

Skewness - is an asymmetrical frequency distribution in which the values are concentrated on one side of the central tendency and trail out on the other side. If the trail is to the right or positive end of the scale, the distribution is said to be positively skewed.

 

Spurious - when the covariation observed between two variables is not due to the variables influencing each other, but because both are being influenced by some third variable

 

Standard Deviation - Standard deviation is a statistical measure of spread or variability.  It is the square root of the sum of the squared deviations from the mean divided by the number of scores minus one.

Standard Error – is an unbiased estimate of expected error in the sample estimate of a population mean, is the sample estimate of the population standard deviation (sample standard deviation) divided by the square root of the sample size (assuming statistical independence of the values in the sample):

Statistic – is a quantity that is calculated from a sample of data. It is used to give information about unknown values in the corresponding population. For example, the average of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.

Statistics - Statistics refers to more than “numerical descriptions.”  There are two broad areas of statistics: (1) describing, summarizing large masses of data, referred to as descriptive statistics and (2) drawing conclusions (making estimates, decisions, predictions, and etc.) about some set of data based on sampling, referred to as inferential statistics. 

Statistical Inference - is an estimate, prediction, or generalization about a population based upon information contained in a sample. 

Statistical Population – is an entire set of data about which we wish to make conclusions.  It is a set of units (usually, people, objects, transactions, or events)

 

Statistical Sample – is a portion or a subset of a statistical population.

 

Statistical test - Type of statistical procedure that is applied to data to determine whether the results are statistically significant (that is, the outcome is not likely to have resulted by chance alone.)

 

Stratification. A sampling technique in which sampling is done separately for separate parts of the population. Stratification is often used to ensure that one has an adequate number of sampling units with relatively rare characteristics (e.g., stratification may be done on race/ethnic status if one wishes to make comparisons among racial/ethnic groups).

Stratified Sampling - There may often be factors which divide up the population into sub-populations (groups / strata) and we may expect the measurement of interest to vary among the different sub-populations. This has to be accounted for when we select a sample from the population in order that we obtain a sample that is representative of the population. This is achieved by stratified sampling.  A stratified sample is obtained by taking samples from each stratum or sub-group of a population.

Statistically Significant – an observed effect of a size that would rarely occur by chance is called statistically significant.

Subsample. A sample selected from a sample frame that is itself a sample of a larger population. Often the original sample is used to identify individuals or organizations of interest or is used to sort units into groups to be sampled at different rates.

 

Surveys - Research using questionnaires or interviews to poll or obtain information.  Survey instruments are often administered by mail, handouts, personal and, telephone interviews, and the Internet, etc.

Target Population - is the entire group a researcher is interested in; the group about which the researcher wishes to draw conclusions.

Treatment – is any specific experimental condition applied to the subjects. 

 

T-test - A parametric statistical test of the difference between the means of two samples or the   sample at two different points in time.

 

Type I Error – When one rejects the null hypothesis and it is true.

 

Type II Error – When one fails to reject the null hypothesis when it is false.

 

Undercoverage – occurs when some groups in the population are left out of the process of choosing the sample.

 

Validity - The degree to which an instrument actually measures what it is supposed to measure.

 

Variable – is a characteristic or property of an individual population unit.  The name variable is derived from the fact that any particular characteristic may “vary” among the units in the population.  In social science research, for each unit of analysis , each item of data (e.g., age of person, income of family, consumer price index) is called a variable.

 

Variance - A measure of dispersion which is the mean of the squares of deviations of the observations from the population mean.  It is the square of the standard deviation


Sources:

 

http://www.esomar.org/index.php/glossary-c.html

 

http://www.cas.lancs.ac.uk/glossary_v1.1/Alphabet.html

 

http://www.epa.gov/evaluate/glossary/a-esd.htm