Glossary of Major Statistical Terms
and Concepts
Accuracy - The extent to which results of a calculation or the
readings of an instrument approach the true values of the calculated or
measured quantities.
Alternative Hypothesis - A
statement of what a statistical hypothesis test is set up to establish. It is the hypothesis that is accepted if the
null hypothesis is rejected.
Analysis of Variance (ANOVA) - Analysis of variance (ANOVA) is a statistical test which
compares the distribution of two or more sample groups to determine if one or
more of the groups are significantly different from the others.
Bias - is a term which refers to how far the average
statistic lies from the parameter it is estimating, that is, the error which
arises when estimating a quantity. Errors from chance will cancel each other
out in the long run, those from bias will not.
Binomial Distribution - Binomial
distributions model (some) discrete
random variables. Typically,
a binomial random variable is the number of successes in a series of trials,
for example, the number of 'heads' occurring when a coin is tossed 50 times.
Binomial Distribution - Binomial distributions model (some) discrete
random variables. Typically,
a binomial random variable is the number of successes in a series of trials,
for example, the number of 'heads' occurring when a coin is tossed 50 times.
Categorical Data - A set
of data is said to be categorical if the values or observations belonging to it
can be sorted according to category. Each value is chosen from a set of
non-overlapping categories. For example, shoes in a cupboard can be sorted according
to color; the characteristic 'color' can have non-overlapping categories
'black', 'brown', 'red' and 'other'. People have the characteristic of 'gender'
with categories 'male' and 'female'.
Census - An enumeration of the total population
of interest. Since no sample is selected from the population, there is
no sampling error. However, nonsampling errors are still possible in a census.
Central Limit Theorem Refers to the Observation
that as sample size increases, the means of samples drawn from a population of
any distribution will approach a normal distribution.
Chi Square Distribution - A skewed distribution whose shape depends on the
number of degrees of freedom. As the number of degrees of freedom increases,
the distribution becomes more symmetrical.
Chi-Square Test - A non-parametric statistical test that compares
research data with the expected results from a hypothesis.
Closed-Ended Questions - Closed-ended questions provide respondents with a
pre-determined list of possible answers.
Codebook - Generically, any
information on the structure, contents, and layout of a data file. Typically, a
codebook includes: column locations and widths for each variable;
definitions of different record types ; response
codes for each variable; codes use to indicate non-response and
missing data; exact questions and skip patterns used in a survey; and other
indications of the content of each variable. Many codebooks also include frequencies
of response. Codebooks vary widely in quality and amount of information
included. They may be machine-readable or paper copy or microfiche.
Conditional probability -
The probability of a particular event occurring, given that another event
occurred.
Confidence Intervals - An interval within which an investigator will have
a specified level of confidence (typically 90%, 95%, or 99%), the true value of
an estimate lies.
![]()
Confounded Two variables are confounded when their effects on a response variable
cannot be distinguished from each other.
The confounded variables can either be explanatory variables or lurking
variables.
Contingency table - A
table used to classify sample observations according to two or more
identifiable characteristics.
Continuous Data - A set
of data is said to be continuous if the values / observations belonging to it
may take on any value within a finite or infinite interval. You can count,
order and measure continuous data. For example, height; weight; temperature;
the amount of sugar in an orange; the time required to run a mile.
Continuous Random Variable - A
continuous random variable is one which takes an infinite number of possible
values. Continuous random variables are usually measurements. Examples include
height, weight, the amount of sugar in an orange, the time required to run a mile.
Correlation - A synonym for association or the
relationship between variables.
Correlation Coefficient is a
numerical expression of both the strength and direction of a straight line
correlation. Such correlation
coefficients generally range between -1.00 (perfect negative correlation) and
+1.00 (perfect positive correlation).
Data - a collection of facts from which conclusions may be drawn
Degrees of Freedom The degrees of freedom of a statistical group are
the number of values in the group which are free to vary. This number is
usually one less than the sample size, the number of items in the group. The number of independent comparisons that can be made in a set of
data. 2). The maximum number of quantities whose values are free to vary before
the remainder of the quantities are determined.
Discrete Random
Variable - A discrete random variable is one which may take on
only a countable number of distinct values such as 0,1,2,3,4,........ Discrete
random variables are usually (but not necessarily) counts. If a random variable
can take only a finite number of distinct values, then it must be discrete.
Examples of discrete random variables include the number of children in a
family, the Friday night attendance at a cinema, the number of patients in a
doctor's surgery, the number of defective light bulbs in a box of ten.
A residual (or fitting error), on the other hand, is an observable
estimate of the unobservable statistical error. The simplest case
involves a random sample of n men whose heights are measured. The sample
mean is used as an estimate of the population
mean. Then we have:
Note that the sum of the residuals within a random sample is necessarily
zero, and thus the residuals are necessarily not independent.
The sum of the statistical errors within a random sample need not be zero; the
statistical errors are independent random variables if the individuals are chosen from the population
independently.
In sum:
·
Residuals are
observable; statistical errors are not.
·
Statistical
errors are often independent of each other; residuals are not (at least in the
simple situation described above, and in most others).
Estimate - is an
indication of the value of an unknown quantity based on observed data.
Experiment - is any
process or study which results in the collection of data, the outcome of which is
unknown. In statistics, the term is usually restricted to situations in which
the researcher has control over some of the conditions under which the
experiment takes place.
Explanatory
variable is a variable that we
think explains or causes changes in the response variable.
Factor - a factor
is a major independent variable.
Frequency Table - A frequency table
is a way of summarizing a set of data. It is a record of how often each value
(or set of values) of the variable in question occurs. It may be enhanced by
the addition of percentages that fall into each category.
Hypothesis Testing - The formal
process by which a decision is made concerning the rejection or acceptance of
the null hypothesis.
Interval Scale - An
interval scale is a scale of measurement where the distance between any two
adjacent units of measurement (or 'intervals') is the same but the zero point
is arbitrary. Scores on an interval scale can be added and subtracted but can
not be meaningfully multiplied or divided. For example, the time interval
between the starts of years 1981 and 1982 is the same as that between 1983 and
1984, namely 365 days. The zero point, year 1 AD, is arbitrary; time did not begin
then. Other examples of interval scales include the heights of tides, and the
measurement of longitude.
Kurtosis - characterizes the relative
peakedness or flatness of a distribution compared with the normal distribution.
Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis
indicates a relatively flat distribution. Kurtosis characterizes the relative
peakedness or flatness of a distribution compared with the normal distribution.
Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis
indicates a relatively flat distribution.
Lurking Variable is a variable that has an
important effect on the relationship among the variables in a study but is not
one of the explanatory variables studied.
Margin of Error - The uncertainty of a measured
quantity. A statistic used to say how close a calculation is to the predicted
value by a certain percentage. The more
precise the instrument or technique used for the measurement, the smaller the
margin of error.
Mean - The arithmetic average of a set of data in which the
values of all observations are added together and divided by the number of
observations
Measurement
is the process by which numbers are assigned to variables of interest.
Measurement error - The extent to which there are
discrepancies between survey results and the true value of what the survey
researcher is attempting to measure. There are several possible sources of
error here. Respondents may report inaccurate information
because they do not have the required information, due to carelessness, or
because they do not understand the question asked. Alternately, respondents may
provide accurate information, but errors are introduced in the data processing
stage due to keypunching, coding, or programming errors. Since it is often not
possible to determine the "true value" of what one is trying to
measure, precise estimates of measurement error are usually not possible.
However, techniques exist for obtaining some information about the likely
extent of measurement error. For example, information reported by individuals
may be compared with appropriate institutional records on the individual.
Measure of reliability statistical inference is usually accompanied by a
measure of reliability that is how good the inference is? Because inferences are based on only a
portion of the population, there is always a level of uncertainty in our
inferences. Generally, the smaller the sample size the less certain we are
about the inference. An inference is
incomplete without a measure of its reliability.
Median - the midpoint value obtained by ranking all values from
highest to lowest and choosing the value in the middle. The median divides a
population into two equal halves.
Mutually exclusive - The
occurrence of one event means that none of the other events can occur at the
same time.
Nominal Data - A set
of data is said to be nominal if the values / observations belonging to it can
be assigned a code in the form of a number where the numbers are simply labels.
You can count but not order or measure nominal data. For example, in a data set
males could be coded as 0, females as 1; marital status of an individual could
be coded as Y if married, N if single.
Nonresponse is the failure to obtain data from an individual
selectd for a sample.
Nonsampling Error - A general term applying to all sources of error, with
the exception of sampling error.
Nonsampling errors occur from nonresponse, coding errors, computer
processing errors, errors in the sampling frame, reporting errors, and other
errors.
Normality - Many kinds of data, variables, etc. share a common
characteristic. When distributed they all seem to form a bell-shaped curve.
This bell-shaped curve, called the standard normal curve. The normal distribution arises repeatedly in
biology. The heights and weights of
human and animal populations seem to follow this normal distribution. We
observe objects in nature and find that certain qualities of natural objects
seem to fit the normal curve. Within
a species of plant, some will grow tall while others grow short and some will
grow to heights in between the tall and short ones. The same is true of
animals. Many of the measurable traits of living objects seem to follow this
normal curve.
Normal Distribution Normal
distributions model (some) continuous
random variables. Strictly, a
Null Hypothesis - The prediction that an observed difference is due
to chance alone and not due to a systematic cause; this hypothesis is tested by
statistical analysis, and is either accepted or rejected. For a T-test, the null hypothesis is that
there is no difference between the two population means; for ANOVA, the null
hypothesis is that there is no difference between the three or more population
means, and; for a correlation analysis, the null hypothesis that there is no
relationship between the variables being studied.
Operationalizing Defining a concept so that it can be measured.
Ordinal Data - A set
of data is said to be ordinal if the values / observations belonging to it can
be ranked (put in order) or have a rating scale attached. You can count and order, but not measure,
ordinal data.
Outcome -
A particular result of an experiment.
![]()
Out-of-scope - Sampling units that are not part of the population
of interest. For example, in the National
Survey of Recent College Graduates, only individuals who received a
bachelor's or master's degree within a specified time frame are of interest. If
an educational institution provided the name of an individual who failed to
graduate, the individual would be considered out-of-scope
for the survey. Information on this individual would not be included in the
final estimates from the survey.
Outliers Values or scores that do not exactly fit with the rest of
the data.
Oversampling - Deliberately sampling a portion
of the population at a higher rate than the remainder of the population.
P- value - A statistical term that describes the probability
that something occurred by chance alone. The lower the P-value, the less likely
that something occurred by chance.
Parametric Statistics - A group of statistical procedures that researchers use to
test data that are normally distributed.
Precision -
The closeness of repeated measurements to the same value.
Probability - A value
between zero and one, inclusive, describing the relative possibility (chance or
likelihood) an event will occur.
Parameter - a quantity (such as the mean or variance) that
characterizes a statistical population and that can be estimated by
calculations from sample data
Poll - an inquiry into public opinion conducted by interviewing
a random sample of people
Population - The group with a particular set of characteristics to
which researchers attempt to generalize their findings from a smaller sample.
These are the objects of generalizations for inferential statistics.
Poisson Distribution - Poisson
distributions model (some) discrete
random variables. Typically, a Poisson random variable is a count of
the number of events that occur in a certain time interval or spatial area. For
example, the number of cars passing a fixed point in a 5 minute interval; the
number of calls received by a switchboard during a given period of time.
Power
Analysis - A procedure that is used to determine the sample size needed to
prevent a Type II error.
Precision - is a
measure of how close an estimator is expected to be to the true value of a
parameter.
Randomness - a basic statistical concept and property implying an
absence of a plan, purpose or pattern, or of any tendency to favor one outcome
rather than another
Random Sampling - is a
sampling technique where we select a group of subjects (a sample) for study
from a larger group (a population). Each individual is chosen entirely by
chance and each member of the population has a known, but possibly non-equal,
chance of being included in the sample.
Random Variable - The
outcome of an experiment need not be a number, for example, the outcome when a
coin is tossed can be 'heads' or 'tails'. However, we often want to represent
outcomes as numbers. A random variable is a function that associates a unique
numerical value with every outcome of an experiment. The value of the random
variable will vary from trial to trial as the experiment is repeated.
Range - The difference between the maximum and the minimum in a
set of data. (e.g., for a set of scores
measured from 20 to 35, the range is 15);
Regression Analysis - A method for determining the
association between a response variable and one or more explanatory variables. Regression analysis is used to estimate or
predict the relative influence of more than one variable on something (e.g.,
the effect of age, gender, and educational level on the prevalence of a
disease).
Reliability
- The measure of consistency for an assessment instrument. The instrument
should yield similar results over time with similar populations in similar
circumstances.
Research - means investigation or
experimentation aimed at the discovery of new theories or laws and the
discovery and interpretation of facts or revision of accepted theories or laws
in the light of new facts.
Research Design - A systematic plan of what data to gather, from whom, how and when to
collect the data, and how to analyze the data obtained.
Residuals - the difference between data
observed and values expected.
Respondent. The individual or organization
providing the information requested in the survey. The type of respondent
influences what type of information can be obtained, e.g., individuals
completing a degree may provide different information about the degree than a
representative of the academic institution granting the degree would provide.
Response rate. Indicates the percentage of sample
members who provided information in response to being surveyed. Care in
interpreting response rates is necessary, because there is not one single
uniformly accepted measure of response rate. One common measure, used
extensively in demographic surveys, is the percentage of in-scope
sample members who responded to the survey. In surveys that focus on estimating
expenditures, the response rate is often calculated as the percentage of the
total expenditures represented by responding sample members. This measure is
often referred to as a weighted response rate (though weighting may also be
used to adjust for different probabilities of sample selection).
Response variable is a variable that
measures an outcome or result of a study.
Sample - is a group of units selected from
a larger group (the population). By studying the sample it is hoped to draw
valid conclusions about the larger group.
Sample Design - The sampling procedure used to produce any
type of sample.
Sampling
Distribution - describes probabilities associated with a statistic
when a random sample is drawn from a population.
Sampling error - The estimated discrepancy between a statistic and a
parameter. The difference in results for
different samples of the same size is called sampling error. All things being
equal, by increasing the sample size, from say 25 to 125 students, the sampling
error will be reduced (but not eliminated) and the study findings can be
assumed to be more reliable.
Scope of survey. The population
to which the researcher plans to generalize his or her results. The scope of
the survey may be limited by both theoretical and practical considerations. For
example, while it may be of theoretical interest to obtain information on the
characteristics of institutionalized individuals, practical difficulties often
lead researchers to declare such individuals out-of-scope
for a survey. Out-of-scope cases may be eliminated at the time
of sample
frame construction or during data collection or data processing.
Skewness - is an asymmetrical frequency
distribution in which the values are concentrated on one side of the central
tendency and trail out on the other side. If the trail is to the right or
positive end of the scale, the distribution is said to be positively skewed.
Spurious - when the covariation observed between two variables
is not due to the variables influencing each other, but because both are being
influenced by some third variable
Standard Deviation - Standard deviation is a statistical
measure of spread or variability. It is the square root of the sum of the
squared deviations from the mean divided by the number of scores minus one.
Standard Error is an
unbiased estimate of expected error in the sample
estimate of a population
mean, is the sample estimate of the population standard deviation
(sample standard
deviation) divided by the square root of the sample size (assuming
statistical independence of the values in the sample):
Statistic
is a quantity that is calculated from a sample of data. It is used to give
information about unknown values in the corresponding population. For example,
the average of the data in a sample is used to give information about the
overall average in the population from which that sample was drawn.
Statistics - Statistics
refers to more than numerical descriptions.
There are two broad areas of statistics: (1) describing, summarizing
large masses of data, referred to as descriptive statistics and (2) drawing
conclusions (making estimates, decisions, predictions, and etc.) about some set
of data based on sampling, referred to as inferential statistics.
Statistical Inference - is an
estimate, prediction, or generalization about a population based upon
information contained in a sample.
Statistical Population is an entire set of data about which we wish to
make conclusions. It is a set of units
(usually, people, objects, transactions, or events)
Statistical Sample is a portion or a subset of a statistical population.
Statistical test - Type of statistical
procedure that is applied to data to determine whether the results are
statistically significant (that is, the outcome is not likely to have resulted
by chance alone.)
Stratification. A sampling technique in which sampling is done
separately for separate parts of the population.
Stratification is often used to ensure that one has an adequate number of
sampling units with relatively rare characteristics (e.g., stratification may
be done on race/ethnic status if one wishes to make comparisons among
racial/ethnic groups).
Stratified Sampling - There
may often be factors which divide up the population into sub-populations
(groups / strata) and we may expect the measurement of interest to vary among
the different sub-populations. This has to be accounted for when we select a
sample from the population in order that we obtain a sample that is
representative of the population. This is achieved by stratified sampling. A stratified sample is obtained by taking
samples from each stratum or sub-group of a population.
Statistically
Significant an observed effect of
a size that would rarely occur by chance is called statistically significant.
Subsample.
A sample
selected from a sample frame that is itself a sample of a larger
population.
Often the original sample is used to identify individuals or organizations of
interest or is used to sort units into groups to be sampled at different rates.
Surveys - Research using
questionnaires or interviews to poll or obtain information. Survey instruments are often administered by
mail, handouts, personal and, telephone interviews, and the Internet, etc.
Target
Population - is the entire group a researcher is interested in;
the group about which the researcher wishes to draw conclusions.
Treatment
is any specific experimental condition applied to the subjects.
T-test - A
parametric statistical test of the difference between the means of two samples
or the sample at two different points in time.
Type I Error
When one rejects the null hypothesis and it is true.
Type II Error
When one fails to reject the null hypothesis when it is false.
Undercoverage
occurs when some groups in the population are left out of the process of
choosing the sample.
Validity -
The degree to which an instrument actually measures what it is supposed to
measure.
Variable
is a characteristic or property of an individual population unit. The name variable is derived from the fact
that any particular characteristic may vary among the units in the
population. In social science research,
for each unit of analysis , each item of data (e.g., age
of person, income of family, consumer price index) is called a variable.
Variance -
A measure of dispersion which is the mean of the squares of deviations of the
observations from the population mean.
It is the square of the standard deviation
Sources:
http://www.esomar.org/index.php/glossary-c.html
http://www.cas.lancs.ac.uk/glossary_v1.1/Alphabet.html
http://www.epa.gov/evaluate/glossary/a-esd.htm