Work in Progress
Causality: Usually stated as : If X happens, then Y will follow as a result. For example, if we reduce the speed limit to 55 mph, then we can reduce the number of highway fatalities.
Or stated as a conclusion: X caused Y.
Causal statements link the independent variable (which we think of causing or producing a change in the dependent variable) to the dependent variable.
But it is difficult to prove causality. Correlation does not prove causality. A common mistake is in concluding that x caused y, when it really did not. Just because there is an increase in the number of storks and an increase in the birth rate, does not mean that the storks caused the birthrate to increase.
Necessary elements: time-order, co-variation, elimination of rival explanations, and a logical theory.
Census: Data collected from all members of the population of interest.
Chi-Square test:
A statistical test to estimate the probability of getting the results solely because of random chance. Chi square is particularly designed for testing the significance between categorical variables (nominal and ordinal, and interval data which has been collapsed into categories) using cross tabulation analysis.
Coefficient of Determination:
In linear regression analysis, this tells you the extent to which your independent variable explains the variance in your dependent variable. (r2). When it is close to 0, it means there is no relationship; when it is close to 1, it means there is a strong relationship because the independent variable explains almost all of the variation. An r2 of .5 is respectable.
Note: in multiple regression, it's called the multiple coefficient of determination. It's shown as R2.
Confidence interval:
A confidence interval is the interval that is very likely to contain the true population mean based on your results from a sample drawn from that population.
Confidence level:
The estimated probability that a population parameter lies within a given confidence interval. Used as an expression of the accuracy of sample statistics to infer to the population. You want to be 95% confident that the true population value resides within a specified range of values.
Confidence levels and confidence intervals go together. One way to present it is as follows: In 19 out of 20 cases (95% of the time) the results based on this sample will differ by no more than 5% percentage points in either direction from what would been obtained by seeking out all the people in our population of interest.
Controls:
If we want to know whether an observed relationship between two variables can be accounted for by a third variable, we can "hold constant" or "control for" that third variable. If the relationship persists, then the third variable has no effect. If the initial relationship disappears, then the relationship is said to be spurious. If specific relationships emerge, then the 3rd variable helps to specify the nature of the relationship.
For example: you may want to look at the GPA of college athletes as compared to non-athletes. But then you think there might be differences by gender. So you ask the computer to look at GPA of college athletes as compared to non-athletes, controlling for gender. You then analyze the GPA for male athlete and male non-athletes and the GPA for female athletes and female non-athletes.
Cross-tabulations (or contingency tables):
Method for analyzing nominal, ordinal, or interval data which has been converted into categories variables. The table is interpreted by looking at the percentage distribution of the dependent variable for each category of the independent variable. For example: Of the 300 respondents with a high SAT score, 10% flunked out of college in their freshman year. In contrast, of the 200 respondents with a low SAT score, 30% flunked out of college in their freshman year. One could conclude that there is a 20% difference in "Flunking out" based on SAT scores. (In this case, SAT scores are the independent variable and "flunking out" is the dependent variable.
Contingency tables are usually (but not always) set up with the dependent variable on the side (rows) and the independent variable across the top (columns). The percentages are then calculated for each category of your independent variable.
Flunking Out High SAT Low SAT
Yes 10% 30%
No 90% 70%
(n= 300) (n=200)
Dependent Variable:
This is the variable that you are trying to understand or explain. In terms of the particular theory used, the dependent variable is thought to be the result of (or related to) some other factor (the independent variable). It 'depends' upon another factor.
For example: highway fatalities is the dependent variable, since it is believed that the number or rate of fatalities depend upon (or is explained by) the speed limit.
Descriptive Statistics:
Techniques to describe and summarize quantitative data for entire an population. The most commonly used descriptive statistics are: frequencies, percentages, rates, means, medians, ranges, and standard deviations.
Dummy Variable:
A variable with only two categories, such as gender: male or female. (also called dichotomous).
Durbin-Watson: to test if adjacent observations are related. It ranges from 0-4. Rule of thumb: 1.5-2.5 is OK.
Frequency Distribution:
A description of the number of times the various attributes or values are observed. eg. 10 people were under 20 years of age, while 30 people were 20 years of age or older.
Generalization:
The ability to make a statement about a larger population based on the results of a sample from that same population. Note: you cannot generalize to a population which is different from the one you drew your sample from.
For example: if your sample is from men over 21, then you can't talk about men 21 or younger and you can't talk about women at all.
Hypothesis:
Your expectations about the nature of two or more variables. It is a statement of something that ought to be observed if your theory about the variables is correct.
For example: Latchkey kids are more likely to use drugs and become juvenile delinquents than those who are not latchkey kids.
Hypothesis test:
The empirical determination of whether the expectations that are hypothesized are found in the real world. To test an hypothesis, you need to do the following:
1. Formulate your hypothesis. Be as specific as possible about what you expect to happen or observe.
2. Then, state it as a null hypothesis (usually that there is no difference in the dependent variable as a result of your independent variable).
3. Collect data relevant to your hypothesis.
4. Evaluate your data in light of your hypothesis. Did you get the outcome you expected?
5. Decide whether to "reject" or "fail to reject" your null hypothesis. Look at both the logic as well as the chances of the null hypothesis being true.
6. Revise your hypothesis as necessary and test again.
Independent Variable:
A variable which is believed to explain changes in the dependent variable. They are considered "independent" because of your particular theory; there is nothing permanent about this label.
Inferential Statistics:
A body of statistical computations which enable the researcher to make inferences from sample results to some larger population, i.e. to generalize from a sample to a population. These statistics enable you to estimate how an entire population would have responded if they had all been surveyed.
When dealing with inferential statistics, your are always concerned about sampling errors. You need to compute confidence levels (95% is the norm), confidence intervals (usually +/- 5%), and conduct statistical significance tests.
Interquartile: Displays the values in the top 25%, bottom 25%, and the remaining 50% .
Kurtosis: Large values indicate the distribution has "heavy tails." That is, the data contain
some values that are very distant from the mean. Normal curve, Kurtosis will = 0. Positive=> tails are heavier; Negative ==> tails are lighter.
Least Squares:
In regression analysis, you want to fit a line to the data in a way which minimizes the distance between the line (which represents the predicted values if there is a relationship between the two variables) and the actual data. (aka ordinary least squares).
Levels of precision:
This refers to the type of data; i.e. whether it is:
a. interval/ratio (real numbers such as age, income);
b. ordinal (assigned numbers reflecting a scale, such as "high", "medium" and "low", or an extent scale); or,
c. nominal (categorical with no ranking, such as race or religion).
Measures of Central Tendency:
One number which summarizes and describes the average for a set of data. There are three measures:
Mean: the arithmetic average of a set of data.
Median: the middle point in a set of data.
Mode: the most common value in a set of data.
Measures of Dispersion:
A way to summarize the amount of variation in a set of data which describes how much the data clusters around the mean:
Variance: differences from the mean.
Standard deviation: most common measure. The standard deviation shows how the data varies from the mean. One standard deviation shows where 66% of the data resides; 2 standard deviations tells you where 95% of the data resides in this particular distribution.
Range: difference between the highest and lowest values.
Multicollinearity:
A condition when independent variables in a regression equation are highly correlated with each other.
Null Hypothesis:
This is a single hypothesis which negates all possible true values. The null hypothesis is stated as an assumption that there is no difference (or no relationship) between the variables your are examining. It is the opposite of what you would like to conclude.
For example: You want to conclude there is a relationship between age and voting behavior (that older folks are more likely to vote than younger folks). Your null hypothesis states that there is no difference between people of different ages and their voting behavior. If your statistical test (called significance test) turns out to be less than .05, then you can reject the null hypothesis. If your statistical test is greater than .05, then your results "fail to reject the null hypothesis".
Operationalization: defining a concept in measurable terms so it can be studied.
Ordinal Measures:
A ranking order of categories going from most to least of a variable; it assumes an underlying continuum. For example, extent scales are ordinal measures.
Parameter:
A measure used to summarize the characteristics of the universe from which your sample was drawn. It assumes a normal distribution (bell curve).
Partial Correlation:
A correlation between 2 variables, when one or more other variables are controlled for.
Percentage Distribution:
A description of the proportional distribution of all observations for each category of the variable. Of the 300 respondents, 55% are women and 45% are men. Generally, use percents whenever possible, since it is more analytic; show the percents along with the total number of respondents (upon which the percentages are based), so there is a context for interpreting the percentages.
Precision in measures:
Keeping the units of measurement relatively fine. For example, reporting a person's income in dollars is more precise than rounding it off to the nearest thousand dollars. However, too much precision in measures can be a nuisance. For example, if you are reporting survey results based on a sample, rounding to the nearest whole number is precise enough. It really depends upon your research question; it dictates how precise you really need to be.
Population: the total set of items or people you are interested in studying.
Probability: How likely it is that certain events will occur; or an estimate of how likely it is that the results are due to chance.
Product-moment correlation coefficient, a.k.a. "r":
Measures how widely the data points spread around the regression line. This coefficient compares a set of data with and ideal of a perfect relationship, and assigns a score ranging from 0 (no relationship at all) to 1 (a perfect relationship). The more closely the data approximates the ideal, the closer the score is to 1. It is also not true that r = .6 is twice as strong as r = .3. This is reminiscent of the difference between ordinal and interval measurement: we know that the higher the absolute value of r, the stronger the relationship, but we don't know how much stronger one relationship is than another. Note: generally, you want to use r2 = Coefficient of determination.
Quantitative Research:
Focus is on numerical measures of things on mathematical analysis. Useful in summarizing data into useful information. Limited in understanding why things happen or when what you wish to study doesn't appear in quantifiable terms..
Qualitative Research:
Focus is on non-numerical understanding of the world: observations, verbal statements about things. Useful for understanding context and perspective, complex phenomenon that is non-numerical.
Random Sample:
A sample in which every member of the population has an equal chance of being selected. Also called a probability sample. This is the fundamental requirement for using inferential statistics. Simple random and stratified random sampling are the most common.
Range: presents the highest and lowest values in a set of data.
Ratio: Is an interval measure for which, in addition to be able to measure the difference between different values of the measure, it is possible to assign the value zero to some point on the measure. One difference between interval and ratio measurement is that, whereas it is possible to add and subtract interval measures, it is possible to add and subtract, multiply and divide ratio measures. It is possible to say that a given score is twice as great as another score, given ratio measurements.
Regression Coefficient:
Used in predicting the value of the dependent variable, for any given value of the independent variable. It is the slope of the regression line, which says that for any one unit change in the independent variable there is so much change in the dependent variable.
For example, if the average age of the housing stock is a predictor of fires, then as the average age of the housing stock increases by one year, then number of fires you can expect will increase by 1.5. The regression coefficient (or slope) = 1.5.
Reliability of a measure:
A measure is reliable to the extent that it gives the same result again and again if the measurement is repeated. Analogy of measuring with a yardstick made of wood vs. a yardstick made of elastic: the wooden yardstick will tend to give reliable measures every time, while the elastic tape measure will give different results.
Sample: a subset of the population. Can be random or nonrandom selected.
Sampling Distribution:
A sampling distribution shows what proportion of the time each particular result could be expect to occur, if the sampling technique you have used were repeated a very large number of times and the set of assumptions (including the null hypothesis) were true. That is, it gives the probability of getting any particular result, if you took repeated samples. It enables you to estimate the probability of getting the observed result; it lets you estimate the risk of rejecting a null hypothesis when it is really true and should not be rejected.
Sampling Error:
Samples rarely, if ever, exactly mirror the population. Therefore, you need to estimate how closely the sample statistics vary from the population; the standard error is an estimate of the degree of error to be expected. The larger your sample, the smaller the standard error.
Sampling Frame:
The rules and procedures which specify how a sample is to be selected, including a statement about the characteristics used for selection.
Scattergram:
A graphic which plots all of the observations for how each case scored on both the independent and dependent variable.
Significance test:
It tells us how likely it is that we could have gotten our results by chance alone. This is only used with sample data. Note: significance does not say anything about whether the results are significant in terms of importance or meaningfulness; statistical significance just refers to whether the results are due to chance. For example, there may be a real 3% difference between attitudes of older people and younger people with respect to savings which is statistically significant. But, so what?
Meaningful significance, rather than statistical significance, is a different issue. Meaningful significance is highly subjective and depends upon the question your research is addressing. 3% may be very meaningful or it may not be.
Skewness:
is a measure of how well a set of data conforms to a normal distribution. The close to 0, the less the skewness. A positive number indicates that the mean will be higher than the median. A negative number manses that the mean will be lower than the median. The more the skewness, the better the argument to use the median as the measure.
Standard deviation:
Obtained by squaring the average deviation. The small the number, the less variation or the less the dispersion. The standard deviation assumes a bell curve (normal distribution) and is set up so that 68% of all the values are one standard deviation from the mean, and 96% of all the values are within two standard deviations from the mean. If the mean is 133 and the standard deviation is 14.4, we know that 68% of all the values are between 133 +/- 14.4 or between 147.4 and 118.6. We also know that 96% of the values are between 133 +/-28.4 (2 standard deviations or between 161.4 and 104.6.
Standard Error of the Mean:
This is the standard deviation for a sampling distribution of a sample. It is an estimate of the amount of error in a sample estimate of a population mean. This allows us to state the level of confidence that the estimated population means lies within a specified range (i.e. the confidence interval).
Statistics:
Statistics originally grew out of the need to keep records for the state. The name "statistics," in fact, derives from the Latin statisticus, meaning "of state affairs."
Statistics includes two main activities: statistical inference and statistical measurement (including the measurement of relationships). In the most narrow view, statistics refer to sample measures and characteristics of the sample.
Spurious Correlation:
When an association between X and Y disappears when a control is applied, the original association between X and Y is said to be spurious.
Suppressor Variable:
When a relationship between X and Y is not apparent but another variable may be suppressing the relationship. When it is controlled for, the relationship between X and Y becomes apparent.
Type I error:
A conclusion that there is a true difference between two populations when in fact there is not. Or put another way, a type I error occurs when you reject a null hypothesis that is actually true.
Type II error:
A conclusion that two populations are not different from each other when in fact they are. Or, put another way, a type II error occurs when you fail to reject a null hypothesis when in reality there is indeed a difference.
Unit of Analysis:
The object whose characteristics are being measured--for example, if you measuring individual characteristics in explaining student drop-outs, then your unit of analysis is "students propensity to drop out." If you measuring school characteristics in explaining drop-out rates, then your unit of analysis is "school characteristics" in predicting drop-out rates of schools.
Validity of a measure:
A measure is valid if it tends to measure the concept it is meant to measure: Are you measuring what you think you are measuring?
Face validity: used when all else fails: does these measures, on their face, make sense to the reader?
Variables:
Attributes or characteristics whose variation are of interest to us. The same variable may be an independent variable in one theory and a dependent variable in another. The designation of "dependent" or "independent" depends upon our theory or assumptions, not on some inherent quality of the attribute itself.
Variance:
A measure of how widely the observed values of a variable may among themselves. Calculated as the average squared deviation of values from their means. If they do not vary at all, the variance will be zero. The more the values vary among themselves, the further each will be from the mean of them all, and the greater the sum of the squared deviations from the mean will be. Thus, the more they may among themselves, the higher their variance will be.
Weight:
Provides a way to more accurately estimate a population mean or proportion based on data from a stratified sample. It is computed by dividing the number of items in each strata by the sample size for each strata.