Data Analysis Unit 1&2
Data
Analysis
Syllabus
This
course begins with some basic concepts and terminology that are fundamental to
data analysis and inference. This course introduces the student to collection
and presentation of data. It also discusses how data can be summarized and
analysed for drawing statistical inferences. The students will be introduced to
important data sources that are available and will also be trained in the use
of statistical software to analyse data.
The
main objectives of this course are:
1.
To
know about the process of data collection and presentation.
2.
To
learn the way of obtaining secondary data from various sources.
3.
To
discover the technique of processing and analysing the data for drawing
statistical inferences.
Learning
Outcomes
After completing this course you shall be able to:
Understand the data, its types of measurement and collection.
1.
Know
the procedure of determining the sample size and distribution.
2.
Outline
the designing of questionnaire and procedure of pre – testing.
3.
Perform
the coding, editing, classification, tabulation and graphical representation of
data.
4.
Analyse
the univariate and bivariate frequency distributions.
5.
Understand
the theoretical foundations of different tests and their role in hypothesis
testing.
6.
Design
the composite index numbers.
Summary
Course layout
Module No. |
|
|
Unit-1 |
Introduction and overview |
|
1 |
Data and Types of Measurement |
2 |
Primary and Secondary Data: Merits and Demerits |
3 |
Methods of Collecting Primary Data |
4 |
Population and Sample: Merits and Demerits |
5 |
Sampling Methods: Random and Non-Random |
6 |
Sampling Size and Distribution |
7 |
Sampling and Non - Sampling Errors |
8 |
Designing a Questionnaire: Editing and Pretesting |
9 |
Types of Interview Techniques |
10 |
Methods of Collecting Secondary Data |
|
Unit-2 |
11 |
Data Processing - Editing and Coding |
12 |
Data Processing - Classification and Tabulation |
13 |
Cross Tabulation and its Significance |
14 |
Practical Problems |
15 |
Graphical Representation of Data - Line Graph, Bar Diagram and Pie Chart |
16 |
Graphical Representation of Data – Histograms and Ogives |
17 |
Practical Problems of Graphical Representation |
|
Unit-3 |
18 |
Univariate Frequency Distribution-Measures of Central Tendency |
19 |
Univariate Frequency Distribution-Measures of Dispersion |
20 |
Univariate Frequency Distribution- Skewness, Moments and Kurtosis |
21 |
Numerical Problems: Univariate Frequency Distribution |
22 |
Bivariate Frequency Distribution-Correlation, Various Methods |
23 |
Bivariate Frequency Distribution-Regression Analysis |
24 |
Numerical Problems: Bivariate Frequency Distribution |
25 |
Estimation of Population Parameters |
26 |
Methods of Estimation |
27 |
Unbiased Estimator of Population Mean |
28 |
Unbiased Estimator of Population Variance |
|
Unit-4 |
29 |
Basic concepts of inference |
30 |
Testing of hypothesis-types, uses |
31 |
Testing of hypothesis: t test |
32 |
Testing of hypothesis: F test |
33 |
Testing of hypothesis: Z test |
34 |
Testing of hypothesis: Chi Square test |
35 |
ANOVA- one way and Interpretation |
36 |
ANOVA- two way and Interpretation |
37 |
Numerical Problems |
38 |
Basics of Index Numbers |
39 |
Price and Quantity Indices and their Properties |
40 |
Numerical Problems of Index Numbers |
Course Content
1-
Introduction:
Data is the foundation of research, statistics,
and decision-making. The way we measure and classify data determines the types of analyses we can
perform. Understanding different measurement
scales (nominal, ordinal,
interval, ratio) is crucial for choosing the right statistical methods.
1. Nominal Scale (Categorical Data)
·
Definition: Used for labeling variables without any quantitative
value or order.
·
Key
Features:
o Categories are distinct but not ranked.
o No mathematical operations (addition, subtraction) are
meaningful.
·
Examples:
o Gender (Male, Female, Non-binary)
o Marital Status (Single, Married, Divorced)
o Colors (Red, Blue, Green)
·
Permissible
Statistics:
o Frequency counts
o Mode (most frequent category)
o Chi-square tests (for relationships between categories)
2. Ordinal Scale (Ordered Categories)
·
Definition: Data with a meaningful
order, but the differences between
values are not consistent.
·
Key
Features:
o Can be ranked, but intervals are not measurable.
o No true zero point.
·
Examples:
o Education Level (High School < Bachelor’s <
Master’s < PhD)
o Likert Scale (Strongly Disagree, Disagree, Neutral,
Agree, Strongly Agree)
o Economic Status (Low, Medium, High)
·
Permissible
Statistics:
o Median (better than mean for central tendency)
o Percentiles, rank-order correlation (Spearman’s rho)
o Non-parametric tests (Mann-Whitney U, Kruskal-Wallis)
3. Interval Scale (Numeric with Equal
Intervals, No True Zero)
·
Definition: Numeric data where differences between values are
meaningful, but no true zero (zero does not mean "absence").
·
Key
Features:
o Equal intervals between values.
o Can perform addition and subtraction, but not
multiplication/division.
·
Examples:
o Temperature in Celsius or Fahrenheit (0°C does not mean
"no temperature")
o IQ Scores (An IQ of 0 does not mean "no
intelligence")
o Calendar Years (2020, 2021, 2022)
·
Permissible
Statistics:
o Mean, standard deviation
o Correlation, t-tests, ANOVA, regression
4. Ratio Scale (Numeric with True Zero)
·
Definition: The most informative scale—has equal intervals + a true zero (zero means "none").
·
Key
Features:
o All arithmetic operations (+, -, ×, ÷) are valid.
o Allows for meaningful ratios (e.g., "twice as
much").
·
Examples:
o Height (0 cm means no height)
o Weight (0 kg means no weight)
o Age (0 years means birth)
o Income ($0 means no income)
·
Permissible
Statistics:
o All statistical methods (mean, median, mode, SD, CV)
o Geometric mean, harmonic mean
o All parametric tests (t-tests, ANOVA, regression)
Comparison Table of Measurement Scales
Feature |
Nominal |
Ordinal |
Interval |
Ratio |
Categories |
Yes |
Yes |
No |
No |
Order |
No |
Yes |
Yes |
Yes |
Equal Intervals |
No |
No |
Yes |
Yes |
True Zero |
No |
No |
No |
Yes |
Example |
Colors |
Survey Ratings |
Temperature (°C) |
Weight (kg) |
Why Does This Matter?
·
Statistical
Tests: Different scales require
different analyses (e.g., Chi-square for nominal, t-tests for interval/ratio).
·
Data
Interpretation: Misclassifying data
can lead to incorrect conclusions (e.g., calculating mean for ordinal data is
misleading).
·
Research
Design: Helps in choosing the right
survey questions and measurement tools.
2: Primary and
Secondary Data: Merits and Demerits
Data collection is a crucial part of research,
and it can be classified into primary
data and secondary data. Both
have their own advantages and disadvantages, depending on the research
objectives, resources, and time constraints.
1. Primary Data
Definition: Data collected firsthand by the researcher for a specific
purpose.
Merits of Primary Data:
1.
Original
& Specific – Tailored to
meet the exact needs of the research.
2.
Reliable
& Accurate – Collected
directly, reducing chances of errors or bias.
3.
Up-to-date – Reflects current trends and information.
4.
Control
Over Methodology – Researcher
decides the sampling, tools, and techniques.
5.
No
Copyright Issues – Owned by
the researcher, so no legal restrictions.
Demerits of Primary Data:
1.
Time-consuming – Requires planning, execution, and analysis.
2.
Costly – Involves expenses in surveys, interviews, or
experiments.
3.
Labor-intensive – Needs trained personnel for data collection.
4.
Limited
Scope – May not cover a wide
population due to resource constraints.
5.
Respondent
Bias – Participants may
provide inaccurate or biased responses.
2. Secondary Data
Definition: Data collected by someone else for a different purpose but
reused for research.
Merits of Secondary Data:
1.
Time-saving – Already available, so no need for fresh
collection.
2.
Cost-effective – Cheaper than primary data collection.
3.
Large
Sample Size – Often covers
broader demographics (e.g., government reports).
4.
Historical
Comparison – Useful for
longitudinal studies.
5.
Convenience – Easily accessible through books, journals, databases,
etc.
Demerits of Secondary Data:
1.
Outdated – May not reflect current scenarios.
2.
Less
Control – Researcher cannot
verify accuracy or methodology.
3.
Relevance
Issues – Might not fully align
with research objectives.
4.
Potential
Bias – Original collector’s
bias may persist.
5.
Copyright
Restrictions – Some data may
require permissions or payments.
3 Methods of Collecting Primary Data
Primary data is collected firsthand by the researcher for a
specific study. The choice of method depends on the research objectives,
budget, time, and type of data needed.
1. Surveys & Questionnaires
Description: Structured forms with predefined questions
distributed to respondents.
Types:
·
Online
Surveys (Google Forms,
SurveyMonkey)
·
Paper-based
Questionnaires
·
Telephone
Surveys
·
Face-to-face
Surveys
✔ Pros:
·
Cost-effective for
large samples
·
Easy to analyze
(quantitative data)
·
Can reach a wide
audience
✖ Cons:
·
Risk of low response
rates
·
Possible biased or
inaccurate responses
·
Limited depth in
answers
2. Interviews
Description: Direct interaction where the researcher asks
questions verbally.
Types:
·
Structured
Interviews (Fixed questions)
·
Unstructured
Interviews (Open-ended,
conversational)
·
Semi-structured
Interviews (Mix of fixed and
flexible questions)
✔ Pros:
·
Detailed and
in-depth responses
·
Clarifications
possible
·
Useful for sensitive
topics
✖ Cons:
·
Time-consuming
·
Expensive (travel,
recording, transcription)
·
Interviewer bias may
influence responses
3. Observations
Description: Systematic watching and recording of behavior or
events.
Types:
·
Participant
Observation (Researcher
engages in the activity)
·
Non-participant
Observation (Researcher
remains detached)
·
Structured
Observation (Predefined
checklist)
·
Unstructured
Observation (Flexible, notes
taken freely)
✔ Pros:
·
Real-time, authentic
data
·
No reliance on
self-reported data
·
Useful for
behavioral studies
✖ Cons:
·
Observer bias
possible
·
Time-intensive
·
Ethical concerns
(privacy issues)
4. Experiments
Description: Controlled studies where variables are manipulated
to observe effects.
Types:
·
Lab
Experiments (Controlled
environment)
·
Field
Experiments (Real-world
setting)
✔ Pros:
·
High level of
control over variables
·
Establishes
cause-effect relationships
·
Replicable
✖ Cons:
·
Artificial settings
may not reflect reality
·
Expensive and
complex
·
Ethical concerns
(e.g., medical trials)
5. Focus Group Discussions (FGDs)
Description: Group interviews with 6-12 participants discussing
a topic.
✔ Pros:
·
Rich qualitative
data
·
Group dynamics
generate new ideas
·
Quick way to gather multiple
perspectives
✖ Cons:
·
Dominant
participants may influence others
·
Difficult to
organize
·
Moderator bias
possible
6. Case Studies
Description: In-depth analysis of a single individual, group, or
event.
✔ Pros:
·
Detailed insights
·
Useful for rare or unique
cases
·
Combines multiple
data sources
✖ Cons:
·
Not generalizable
·
Time-consuming
·
Subjectivity in
interpretation
7. Diaries & Self-Reports
Description: Participants record their activities, thoughts, or
experiences over time.
✔ Pros:
·
Provides longitudinal
data
·
Reduces recall bias
·
Personal perspective
✖ Cons:
·
Requires participant
commitment
·
May be incomplete or
inaccurate
·
Difficult to verify
Choosing the Right Method
Factor |
Best Method |
Quick data |
Surveys, Secondary data |
Deep insights |
Interviews, Focus
Groups, Case Studies |
Behavioral data |
Observations,
Experiments |
Large samples |
Surveys, Online
questionnaires |
4. Population and Sample: Merits and Demerits
Population and Sample: Merits and Demerits
In research, population refers
to the entire group under study, while a sample is a
smaller, manageable subset selected to represent the population. Both have
advantages and disadvantages based on research goals, resources, and accuracy
requirements.
1. Population (Census Study)
Definition: Data collected from every
member of the target group.
✔ Merits of Studying
the Entire Population:
1.
High
Accuracy – No sampling errors
since all units are included.
2.
Complete
Data – Provides insights into
every subgroup.
3.
No
Sampling Bias – Eliminates
selection bias.
4.
Reliable
for Policy Making – Used in
national censuses for government decisions.
5.
Generalizable
Results – Findings apply
directly to the whole population.
✖ Demerits of Studying
the Entire Population:
1.
Expensive – High costs in data collection, especially for
large populations.
2.
Time-Consuming – Takes longer to gather and process data.
3.
Impractical
for Large Groups – Difficult
if the population is infinite or widely dispersed.
4.
Resource-Intensive – Requires more manpower and logistics.
5.
Possible
Non-Response Errors – Some
individuals may refuse to participate.
2. Sample (Sampling Study)
Definition: A representative subset of the population used for analysis.
✔ Merits of Sampling:
1.
Cost-Effective – Cheaper than studying the entire population.
2.
Time-Saving – Faster data collection and analysis.
3.
Practical
for Large Populations – Useful
when the population is too big to study fully.
4.
Easier
Data Management – Smaller
datasets are simpler to analyze.
5.
Flexibility – Different sampling techniques (random,
stratified, cluster) can be applied.
✖ Demerits of
Sampling:
1.
Sampling
Error – Results may differ
from the true population values.
2.
Risk of
Bias – Poor sampling methods
(e.g., convenience sampling) can skew results.
3.
May Miss
Subgroups – Some segments may
be underrepresented.
4.
Requires
Expertise – Proper sampling
techniques are needed for accuracy.
5.
Less
Detailed – Does not capture
every individual’s data.
Comparison Table: Population vs. Sample
Factor |
Population (Census) |
Sample |
Accuracy |
✅ High |
❌ Possible error |
Cost |
❌ Expensive |
✅ Affordable |
Time |
❌ Lengthy |
✅ Faster |
Feasibility |
❌ Difficult for large groups |
✅ Practical |
Bias Risk |
✅ Minimal |
❌ Depends on method |
When to Use Population vs. Sample?
·
Use
Population (Census):
o When absolute
precision is needed (e.g.,
national census).
o When the population
is small and accessible.
o When budget
and time are not constraints.
·
Use
Sampling:
o When quick,
cost-effective results are
needed.
o When the population
is too large or scattered.
o When destructive
testing is involved (e.g.,
product quality checks).
5: Sampling Methods: Random and Non-Random
Sampling Methods: Random (Probability) vs.
Non-Random (Non-Probability)
Sampling is the process of selecting a subset of
individuals from a population to represent the whole group. Sampling methods
are broadly classified into Random
(Probability) Sampling and Non-Random (Non-Probability) Sampling, each with its own strengths and weaknesses.
1. Random (Probability) Sampling
Definition: Every member of the population has a known, non-zero chance of
being selected.
✔ Merits of Random
Sampling:
1.
High
Representativeness – Minimizes
bias since selection is unbiased.
2.
Statistical
Generalizability – Results can
be projected to the entire population.
3.
Reduced
Researcher Bias – No human
interference in selection.
4.
Reliable
for Inferential Statistics –
Supports hypothesis testing.
5.
Fair
& Transparent – Uses
randomization techniques (e.g., random number tables).
✖ Demerits of Random
Sampling:
1.
Requires
Complete Population List –
Difficult if the population is undefined.
2.
Time-Consuming
& Costly – Needs proper
randomization procedures.
3.
May Still
Have Sampling Errors –
Randomness doesn’t guarantee perfect representation.
4.
Impractical
for Large, Dispersed Populations –
Hard to access all members.
Common Random Sampling Techniques:
·
Simple
Random Sampling (Equal chance
for all, e.g., lottery draw).
·
Stratified
Sampling (Divides population
into subgroups, then randomly samples each).
·
Systematic
Sampling (Selects every nth individual from
a list).
·
Cluster
Sampling (Randomly selects
groups/clusters instead of individuals).
·
Multistage
Sampling (Combines multiple
random methods).
2. Non-Random (Non-Probability) Sampling
Definition: Selection is not
random; some members have zero chance of
being included.
✔ Merits of Non-Random
Sampling:
1.
Quick
& Cost-Effective – No need
for a full population list.
2.
Useful
for Exploratory Research –
Helps in early-stage studies.
3.
Convenient – Easy to implement (e.g., surveying available
people).
4.
Good for
Qualitative Studies – Focuses
on specific groups (e.g., case studies).
5.
Flexible – Adaptable to research needs.
✖ Demerits of
Non-Random Sampling:
1.
High Risk
of Bias – Results may not
represent the population.
2.
Low
Generalizability – Cannot
statistically infer findings.
3.
Researcher
Influence – Personal judgment
affects selection.
4.
Unreliable
for Quantitative Research –
Not suitable for statistical analysis.
Common Non-Random Sampling Techniques:
·
Convenience
Sampling (Selects easily
accessible subjects, e.g., mall surveys).
·
Purposive
(Judgmental) Sampling (Researcher
handpicks based on criteria).
·
Snowball
Sampling (Existing
participants recruit others, e.g., hidden populations).
·
Quota
Sampling (Selects a fixed
number from each subgroup, but not randomly).
Comparison Table: Random vs. Non-Random
Sampling
Factor |
Random Sampling |
Non-Random Sampling |
Selection Basis |
Random, unbiased |
Subjective, convenient |
Representativeness |
✅ High |
❌ Low (biased) |
Generalizability |
✅ Strong |
❌ Weak |
Cost & Time |
❌ High |
✅ Low |
Best For |
Quantitative research |
Qualitative research |
When to Use Which Method?
·
Use
Random Sampling If:
o You need statistically
valid, generalizable results.
o The population
is well-defined (e.g., voter
surveys).
o Hypothesis testing is
required.
·
Use
Non-Random Sampling If:
o The population
is hard to access (e.g.,
homeless individuals).
o You need quick,
preliminary insights (pilot
studies).
o The study is qualitative (e.g., interviews, case studies).
6: Sampling Size and Distribution
Sampling Size and Distribution in Research
Sampling size (how many
units are selected) and distribution (how samples are spread across subgroups) are
crucial for ensuring accurate, reliable results. Below is a detailed breakdown:
1. Sampling Size
Definition: The number of observations or individuals included in the
sample.
✔ Factors Influencing
Sample Size:
1.
Population
Size
o Larger populations may not need proportionally larger
samples (diminishing returns).
2.
Margin of
Error (Confidence Interval)
o A smaller margin requires a larger sample (e.g., ±3% vs.
±5%).
3.
Confidence
Level (e.g., 95%, 99%)
o Higher confidence = larger sample needed.
4.
Variability
(Standard Deviation)
o More diverse populations require larger samples.
5.
Budget
& Time Constraints
o Practical limitations may restrict sample size.
Common Sample Size Formulas:
·
For
Continuous Data (Mean): =(2×2)2n=E2(Z2×σ2)
o Z = Z-score
(e.g., 1.96 for 95% confidence)
o σ =
Population standard deviation (estimate)
o E = Margin of
error
·
For
Proportions (Categorical Data):
=(2×(1−))2n=E2(Z2×p×(1−p))
p = Estimated
proportion (use 0.5 for maximum variability)
✔ Merits of Proper
Sample Size:
✅ Improves accuracy (reduces
sampling error).
✅ Enhances reliability (results
are more trustworthy).
✅ Balances cost & precision (avoids oversampling).
✖ Risks of Incorrect
Sample Size:
❌ Too small → High error, low power (Type II errors).
❌ Too large → Wasted resources (diminishing returns).
2. Sampling Distribution
Definition: The way sample units are selected across subgroups (e.g., age,
gender, region).
Types of Sampling Distributions:
1.
Normal (Bell
Curve) Distribution
o Most common in random sampling.
o Used in parametric tests (e.g., t-tests, ANOVA).
2.
Skewed
Distribution
o Non-symmetric (e.g., income data).
o Requires non-parametric tests (e.g., Mann-Whitney U).
3.
Stratified
Distribution
o Ensures proportional representation of subgroups.
4.
Cluster
Distribution
o Samples grouped naturally (e.g., schools, villages).
✔ Importance of Proper
Distribution:
✅ Reduces bias (ensures
all groups are fairly represented).
✅ Improves generalizability (findings apply to the whole population).
✅ Enhances statistical power (better hypothesis testing).
✖ Issues with Poor
Distribution:
❌ Underrepresentation (some
groups are missed).
❌ Overrepresentation (some
groups dominate results).
❌ Increased sampling error (biased
estimates).
Key Considerations for Sampling
Aspect |
Best Practices |
Sample Size |
Use power analysis or
standard formulas (e.g., Cochran’s). |
Distribution |
Ensure proportional
stratification if subgroups exist. |
Randomness |
Prefer probability
sampling for generalizability. |
Bias Control |
Avoid convenience
sampling in quantitative studies. |
When to Adjust Sample Size/Distribution?
·
Increase
Sample Size If:
o High variability in data.
o Small effect size expected.
o High confidence level needed.
·
Adjust
Distribution If:
o Subgroups have different characteristics.
o Certain segments are hard to reach (use oversampling).
7:Sampling and Non - Sampling Errors
Sampling Errors vs. Non-Sampling Errors: Key
Differences
1. Sampling Errors
Definition: Errors occurring due to the natural variation between the sample
and the population.
Causes:
·
Random selection
differences
·
Small sample size
·
Inadequate sampling
technique
Types:
·
Random
Sampling Error: Natural
fluctuation in results due to chance
·
Selection
Bias: Systematic exclusion of
certain population segments
·
Sampling
Frame Error: Inaccuracies in
the population list used for sampling
Characteristics:
✅ Inherent
in sampling process
✅ Decreases
with larger sample sizes
✅ Can be
estimated statistically
Example:
If a political poll samples 1,000 voters but misses rural areas, the results
may not reflect the true voting population.
2. Non-Sampling Errors
Definition: Errors unrelated to sampling that occur in data
collection/processing.
Causes:
·
Poor questionnaire
design
·
Interviewer bias
·
Data entry mistakes
·
Respondent errors
(lying, misunderstanding)
Types:
·
Measurement
Error: Flaws in data
collection instruments
·
Response
Bias: Systematic pattern of
incorrect answers
·
Processing
Error: Mistakes in data
coding/analysis
·
Non-response
Error: Missing data from
unwilling participants
Characteristics:
❌ Can occur
in both samples and censuses
❌ Not
reduced by increasing sample size
❌ Often more
serious than sampling errors
Example:
If survey questions are confusing, even perfect sampling will yield inaccurate
results.
Comparison Table
Feature |
Sampling Errors |
Non-Sampling
Errors |
Origin |
Sampling process |
Data
collection/processing |
Relation to Sample Size |
Decreases with larger
samples |
Unaffected by sample
size |
Predictability |
Can be calculated (margin
of error) |
Difficult to quantify |
Control Methods |
Better sampling
techniques |
Improved survey design,
training |
Example |
Rural voters omitted in
poll |
Leading questions bias
responses |
Minimizing Both Types of Errors
For Sampling Errors:
·
Use appropriate
sampling techniques (stratified > convenience)
·
Increase sample size
when possible
·
Ensure complete
sampling frames
For Non-Sampling Errors:
·
Pretest
questionnaires
·
Train interviewers
thoroughly
·
Use clear, unbiased
questions
·
Implement data
validation checks
8: Designing a Questionnaire: Editing and Pretesting
Designing a Questionnaire: Editing and
Pretesting
A well-designed questionnaire is crucial for
collecting reliable data. Two key steps in questionnaire development are editing (refining
questions) and pretesting (trial testing). Below is a structured guide:
1. Editing the Questionnaire
Purpose: Improve
clarity, relevance, and structure before data collection.
Key Editing Steps:
Aspect |
Checklist |
Clarity |
- Avoid jargon/complex
words |
Bias |
- Remove
leading/suggestive questions |
Flow |
- Start with easy,
non-sensitive questions |
Length |
- Keep it concise (5-10
min for surveys) |
Format |
- Consistent scales
(e.g., all 1-5 ratings) |
Sensitivity |
- Review ethical
concerns (e.g., personal/triggering topics) |
Common Pitfalls to Fix:
❌
Double-barreled questions ("Do
you like the product’s price and quality?")
❌ Ambiguous
terms ("Do you often
exercise?" → Define
"often")
❌
Overlapping response options ("Age:
20-30, 30-40")
2. Pretesting the Questionnaire
Purpose: Identify
problems before full-scale deployment.
Pretesting Methods:
Method |
How It Works |
Best For |
Cognitive Interviews |
Participants verbalize
thought process while answering |
Testing question
comprehension |
Expert Review |
Researchers/statisticians
critique the design |
Identifying technical
flaws |
Pilot Survey |
Small-scale trial (5-10%
of target sample) |
Checking timing and flow |
Behavior Coding |
Observing
interviewer-respondent interactions |
Spotting confusing
questions |
What to Evaluate in Pretesting?
1.
Participant
Feedback:
o Were any questions confusing?
o How long did it take to complete?
2.
Data
Quality Checks:
o Are responses varied (or all "neutral")?
o Any skipped questions?
3.
Logistical
Issues:
o Technical glitches (for online surveys)
o Interviewer difficulties (for in-person surveys)
Example Pretest Revision
Before Pretest:
"How satisfied are you with our
services?"
·
Very Satisfied
·
Satisfied
·
Neutral
·
Dissatisfied
After Pretest (Problems Found):
·
"Services"
was too vague → Specify (e.g., customer support, delivery speed).
·
Added "Very
Dissatisfied" for balance.
Revised Version:
"How satisfied are you with our [specific
service]?"
·
Very Satisfied
·
Satisfied
·
Neutral
·
Dissatisfied
·
Very Dissatisfied
9:Types of Interview Techniques
Types of Interview Techniques in Research
Interviews are a key qualitative (and sometimes
quantitative) data collection method. The choice of technique depends on
research goals, structure needs, and participant dynamics. Below are the main
types:
1. Structured Interviews
Description: Follows a strict script with predetermined
questions in fixed order.
Characteristics:
✔ All respondents answer the same questions
✔ Closed-ended (e.g., yes/no, multiple-choice)
✔ Quantitative analysis friendly
When to Use:
·
Large-scale surveys
(e.g., census, market research)
·
Standardized
comparisons (e.g., job candidate assessments)
Pros:
✅ High
reliability (consistent data)
✅ Easy to
administer and analyze
✅ Reduces
interviewer bias
Cons:
❌ Inflexible
(no follow-up questions)
❌ May miss
nuanced responses
2. Unstructured Interviews
Description: Open-ended, conversational format with no fixed
questions.
Characteristics:
✔ Free-flowing discussion
✔ Participant-led (researcher follows the respondent’s
lead)
✔ Qualitative focus
When to Use:
·
Exploratory research
(e.g., anthropology, case studies)
·
Sensitive topics
(e.g., mental health, trauma narratives)
Pros:
✅ Rich,
detailed data
✅ Adaptable
to participant’s perspective
✅ Uncovers
unexpected insights
Cons:
❌ Hard to
analyze (no standardization)
❌
Time-consuming
❌ Risk of
interviewer bias
3. Semi-Structured Interviews
Description: Hybrid approach with core questions + flexibility
for follow-ups.
Characteristics:
✔ Predefined key questions
✔ Allows probing (e.g., "Can you elaborate?")
✔ Balances structure and depth
When to Use:
·
Most qualitative
studies (e.g., social sciences, UX research)
·
When comparing
themes across participants
Pros:
✅ Structured
yet flexible
✅ Easier to
analyze than unstructured
✅ Captures
both breadth and depth
Cons:
❌ Moderator
skill affects outcomes
❌ May drift
off-topic without control
4. Focus Group Interviews
Description: Group discussion (6–12 people) moderated by a
researcher.
Characteristics:
✔ Interactive (participants debate, agree/disagree)
✔ Explores group norms and social dynamics
When to Use:
·
Market research
(e.g., product feedback)
·
Studying
cultural/social behaviors
Pros:
✅ Diverse
perspectives in one session
✅ Observes
real-time reactions
✅
Cost-effective (multiple respondents at once)
Cons:
❌ Dominant
participants may skew data
❌
Confidentiality challenges
5. Telephone/Online Interviews
Description: Conducted remotely via phone/video call (e.g.,
Zoom) or chat.
Characteristics:
✔ Structured or semi-structured
✔ Logistically convenient
When to Use:
·
Geographically
dispersed participants
·
Time-sensitive
studies (e.g., election polls)
Pros:
✅ Wider
reach
✅ Lower cost
(no travel)
Cons:
❌ Non-verbal
cues may be missed
❌ Tech
issues (connectivity, distractions)
Comparison Table
Type |
Structure Level |
Data Type |
Best For |
Structured |
High |
Quantitative |
Surveys, standardized
assessments |
Unstructured |
Low |
Qualitative |
Exploratory research |
Semi-Structured |
Moderate |
Qualitative |
Thematic analysis |
Focus Groups |
Variable |
Qualitative |
Group dynamics, opinions |
Remote Interviews |
Variable |
Both |
Geographically dispersed
samples |
How to Choose?
1.
Need
standardization? → Structured
2.
Exploring
new ideas? →
Unstructured/Semi-structured
3.
Studying
group behavior? → Focus groups
4.
Budget/time
constraints? → Remote
interviews
10 Methods of Collecting Secondary Data
Methods of Collecting Secondary Data
Secondary data refers to information collected
by others for purposes other than the current research. It is cost-effective
and time-saving but requires careful evaluation for relevance and reliability.
Below are the main methods of collecting secondary data:
1. Published Sources
A. Books & Journals
·
Academic
Books: Scholarly publications
with in-depth analysis.
·
Research Journals: Peer-reviewed articles (e.g., JSTOR, PubMed).
·
Industry
Reports: Market analyses
(e.g., IBISWorld, Statista).
✔ Pros: Credible, well-researched.
✖ Cons: May be outdated; access restrictions.
B. Government Publications
·
Census
Data: Population, economic
statistics (e.g., U.S. Census Bureau).
·
Economic
Surveys: Labor, trade reports
(e.g., World Bank, IMF).
·
Legal
Documents: Court rulings,
policy papers.
✔ Pros: Highly reliable, large-scale data.
✖ Cons: Bureaucratic delays; may lack granularity.
C. Media Sources
·
Newspapers/Magazines: Historical & current event analysis.
·
TV/Radio
Archives: Broadcast reports.
✔ Pros: Timely, real-world context.
✖ Cons: Potential bias; less rigorous.
2. Unpublished Sources
A. Internal Organizational Records
·
Company reports,
sales data, customer feedback.
✔ Pros: Specific to research needs.
✖ Cons: Proprietary; may require permissions.
B. Thesis/Dissertations
·
University research
repositories (e.g., ProQuest).
✔ Pros: Detailed, niche topics.
✖ Cons: Variable quality; unpublished work.
C. NGO & Institutional Reports
·
WHO, UNICEF, Amnesty
International publications.
✔ Pros: Expert-compiled, issue-specific.
✖ Cons: Advocacy bias possible.
3. Digital & Online Sources
A. Public Databases
·
Google
Dataset Search, Kaggle, Data.gov
✔ Pros: Free, vast datasets.
✖ Cons: Varying accuracy; requires cleaning.
B. Social Media& Web Scraping
·
Twitter trends,
Reddit discussions, blog analyses.
✔ Pros: Real-time public sentiment.
✖ Cons: Ethical/privacy concerns; noise in data.
C. Commercial Data Providers
·
Nielsen, Bloomberg,
Euromonitor.
✔ Pros: High-quality, industry-specific.
✖ Cons: Expensive; licensing restrictions.
4. Historical & Archival Data
·
Libraries/Archives: Old manuscripts, letters, photos.
·
Museum
Collections: Cultural artifact
records.
✔ Pros: Unique longitudinal insights.
✖ Cons: Fragmented; hard to digitize.
Evaluation Criteria for Secondary Data
Before use, assess:
1.
Relevance: Does it address your research question?
2.
Accuracy: Is the source credible (e.g., peer-reviewed)?
3.
Timeliness: Is the data up-to-date?
4.
Methodology: How was it originally collected?
5.
Bias: Any political/corporate influence?
Advantages of Secondary Data
✅ Cost-effective (no
primary collection needed).
✅ Time-saving (immediate
access).
✅ Large-scale data (e.g.,
national censuses).
✅ Historical comparisons (long-term
trends).
Disadvantages of Secondary Data
❌ May not fit research needs (lack of customization).
❌ Quality varies (unverified
sources).
❌ Outdated information (e.g.,
old surveys).
❌ Access restrictions (paywalls,
permissions).
When to Use Secondary Data?
·
Literature
reviews (academic research).
·
Market
trends analysis (business
strategy).
·
Policy
evaluation (government/NGOs).
·
Preliminary
research (before primary data
collection).
11 Data Processing - Editing and Coding
Data Processing: Editing and Coding
Data processing transforms raw collected data
into a structured, analyzable format. Two critical steps are editing (cleaning
data) and coding (categorizing responses). Below is a detailed
breakdown:
1. Data Editing
Purpose: Identify
and correct errors, inconsistencies, and missing values in raw data.
Types of Errors to Detect & Fix:
Error Type |
Example |
Solution |
Incomplete Data |
Blank survey responses |
Contact respondent or
exclude if critical |
Inconsistent Data |
Age = "25" but
birth year = "1990" |
Cross-check with other
responses |
Outliers |
Income =
"$1,000,000" (unrealistic) |
Verify or remove if
erroneous |
Format Errors |
Date written as
"12-10-22" (MM-DD or DD-MM?) |
Standardize format |
Editing Techniques:
·
Field
Editing: Quick on-site checks
during data collection.
·
Central
Editing: Thorough review
post-collection using software (Excel, SPSS).
·
Logical
Checks: Ensure responses align
(e.g., "Pregnant: Yes" → Gender = Female).
✔ Pros of Editing:
✅ Improves
data accuracy.
✅ Reduces
bias from errors.
✖ Challenges:
❌
Time-consuming for large datasets.
❌ Subjective
decisions (e.g., handling outliers).
2. Data Coding
Purpose: Convert
qualitative responses (text) or open-ended answers into quantitative categories
for analysis.
Coding Methods:
A. Pre-Coding (Structured Data)
·
Used for
closed-ended questions (e.g., multiple-choice).
·
Assign numbers in
advance:
o "Yes" = 1, "No" = 2
o Likert scale: "Strongly Agree" = 5 →
"Strongly Disagree" = 1
B. Post-Coding (Unstructured Data)
·
Applied to
open-ended responses (e.g., interviews).
·
Steps:
1.
Read
responses to identify themes.
2.
Create a
codebook (e.g.,
"Cost" = 1, "Quality" = 2).
3.
Assign
codes manually or with
software (NVivo, Atlas.ti).
Coding Best Practices:
·
Mutually
Exclusive: No overlap between
categories.
·
Exhaustive: All responses fit a category (+ "Other"
option).
·
Reliability: Multiple coders should agree (test with inter-coder reliability).
Example: Coding Open-Ended Feedback
Response: "The
product is expensive but works well."
·
Code 1: Price (Expensive)
·
Code 2: Quality (Works well)
✔ Pros of
Coding:
✅ Enables
statistical analysis.
✅ Simplifies
complex qualitative data.
✖ Challenges:
❌ Coder bias
in theme identification.
❌
Time-intensive for large text datasets.
Software Tools for Editing & Coding
Task |
Tools |
Editing |
Excel, SPSS, R (dplyr),
Python (Pandas) |
Coding |
NVivo, Atlas.ti, MAXQDA,
Dedoose |
Key Takeaways
1.
Editing ensures clean,
error-free data by fixing
inconsistencies.
2.
Coding transforms text/narrative data into quantifiable categories.
3.
Automate
where possible (e.g., Excel
filters for editing; AI-assisted coding for text).
12 Data Processing - Classification and Tabulation
Data Processing: Classification and
Tabulation
After editing and coding, the next steps in data
processing are classification (grouping data into categories) and tabulation (organizing
data into tables). These steps make data analysis more efficient and
meaningful.
1. Classification of Data
Definition: Sorting data into homogeneous groups based on shared
characteristics.
Types of Classification
A. Qualitative Classification
Groups data based on non-numerical attributes:
·
Example:
o Gender: Male, Female, Other
o Occupation: Engineer, Teacher, Doctor
B. Quantitative Classification
Groups data based on numerical values:
·
Example:
o Age Groups: 0–18, 19–35, 36–50, 51+
o Income Brackets: <30�,30K,30K–60�,>60K,>60K
C. Temporal Classification
Arranges data by time periods:
·
Example:
o Sales Data: 2020, 2021, 2022
o Monthly Rainfall: Jan, Feb, Mar
D. Spatial Classification
Groups data by geographical location:
·
Example:
o Country-wise GDP: USA, China, India
o Regional Survey Responses: North, South, East, West
Rules for Effective Classification
✔ Mutually
Exclusive – No overlap between
categories.
✔ Exhaustive – All possible categories are covered.
✔ Purpose-Driven – Aligns with research objectives.
✔ Pros:
✅ Simplifies
complex datasets.
✅
Facilitates comparison across groups.
✖ Challenges:
❌ Loss of
detail if categories are too broad.
❌
Subjectivity in defining groups.
2. Tabulation of Data
Definition: Presenting classified data in structured tables for
analysis.
Types of Tabulation
A. Simple (One-Way) Tabulation
·
Summarizes data
based on one variable.
·
Example:
Age Group |
No. of Respondents |
18–25 |
50 |
26–35 |
75 |
B. Cross (Two-Way) Tabulation
·
Shows relationship
between two variables.
·
Example:
Gender |
Likes Product (Yes) |
Likes Product (No) |
Male |
40 |
30 |
Female |
55 |
25 |
C. Complex (Multi-Way) Tabulation
·
Analyzes three or more variables (e.g.,
age × gender × income).
·
Often used in
advanced statistical software (SPSS, R).
Components of a Well-Structured Table
1.
Title – Clearly describes the table’s content.
2.
Stub (Row
Headings) – Categories being
compared.
3.
Caption
(Column Headings) – Variables
measured.
4.
Body – Actual data values.
5.
Footnotes – Explanations (if needed).
✔ Pros of
Tabulation:
✅ Makes
trends and patterns visible.
✅ Easy to
interpret (compared to raw data).
✖ Challenges:
❌
Over-simplification if too condensed.
❌ Misleading
if percentages/ratios are miscalculated.
Comparison: Classification vs. Tabulation
Feature |
Classification |
Tabulation |
Purpose |
Groups data into
categories |
Organizes data into
tables |
Output |
Categories (e.g., age
groups) |
Structured tables |
Complexity |
Can be qualitative or
quantitative |
Usually numerical |
Usage |
Pre-tabulation step |
Final presentation step |
Key Takeaways
1.
Classification groups data logically before analysis.
2.
Tabulation presents data clearly for interpretation.
3.
Cross-tabulation helps identify relationships between variables.
Would you like a template for designing tables
in research reports?
13 Cross Tabulation and its Significance
Cross-Tabulation and Its Significance in Data
Analysis
Cross-tabulation (or crosstab) is a statistical method
used to analyze the relationship between two or more categorical variables by organizing data into a contingency table.
It helps researchers identify patterns, trends, and correlations in datasets.
1. What is Cross-Tabulation?
·
A table that
displays the frequency distribution of variables.
·
Shows how one
variable’s categories relate to another’s.
·
Commonly used
in survey research, marketing,
and social sciences.
Example of a Cross-Tabulation Table
Gender |
Likes Tea (Yes) |
Likes Tea (No) |
Total |
Male |
30 |
20 |
50 |
Female |
45 |
15 |
60 |
Total |
75 |
35 |
110 |
·
Variables: Gender × Tea Preference
·
Insight: More females (45) prefer tea than males (30).
2. Significance of Cross-Tabulation
A. Identifies Relationships Between Variables
·
Helps detect associations (e.g.,
Does gender affect product preference?).
·
Example:
o "Do men prefer coffee more than women?"
B. Simplifies Complex Data
·
Breaks down large
datasets into meaningful subgroups.
·
Example:
o "How does age group (18–25 vs. 26–35) impact
smartphone brand choice?"
C. Supports Hypothesis Testing
·
Used in chi-square tests to
check if variables are independent.
·
Example:
o "Is there a significant link between education level
and voting preference?"
D. Enhances Decision-Making
·
Businesses use
crosstabs for market segmentation,
customer behavior analysis, and A/B testing.
·
Example:
o "Which age group responds best to discount
offers?"
E. Visualizes Trends Clearly
·
Can be represented
as stacked bar charts,
heatmaps, or pivot tables.
3. How to Perform Cross-Tabulation?
Step 1: Select Variables
·
Choose two categorical variables (e.g., Gender × Product Rating).
Step 2: Create a Contingency Table
Variable A \
Variable B |
Category 1 |
Category 2 |
Total |
Group 1 |
Count |
Count |
Total |
Group 2 |
Count |
Count |
Total |
Total |
Total |
Total |
Grand Total |
Step 3: Calculate Percentages (Optional)
·
Row %: Percentages across each row.
·
Column %: Percentages down each column.
Step 4: Analyze & Interpret
·
Look for patterns, outliers, or significant differences.
·
Use chi-square tests for
statistical validation.
Step 5: Visualize (Optional)
·
Stacked
Bar Chart: Shows proportions.
·
Heatmap: Highlights high/low frequencies.
4. Applications of Cross-Tabulation
Field |
Use Case |
Market Research |
Customer preference
analysis (e.g., Age × Brand Loyalty) |
Healthcare |
Disease prevalence by
gender/region |
Politics |
Voting behavior by
age/income |
Education |
Student performance by
gender/socioeconomic status |
5. Advantages & Limitations
✔ Advantages
✅ Easy to
understand (even for non-statisticians).
✅
Reveals hidden patterns in categorical data.
✅ Works
with small and large datasets.
✖ Limitations
❌ Only works
for categorical variables (not continuous data).
❌ Cannot
prove causation (only correlation).
❌
Requires sufficient sample size for reliable results.
6. Tools for Cross-Tabulation
·
Excel (Pivot Tables)
·
SPSS (Custom Tables)
·
R (table()
or xtabs()
functions)
·
Python (Pandas crosstab()
)
Key Takeaways
1.
Cross-tabulation
helps compare categorical
variables.
2.
It is widely used
in business, healthcare, and
social sciences.
3.
Always pair
with statistical tests
(chi-square) for validation
14 Practical Problems
15 Graphical Representation of Data - Line Graph, Bar Diagram and Pie Chart
Graphical Representation of Data: Line
Graphs, Bar Diagrams & Pie Charts
Graphs transform complex data into visual insights. The
3 most common types are line
graphs, bar diagrams, and pie charts –
each serving unique purposes. Here's a detailed comparison with examples:
1. Line Graph
Purpose: Show trends over time (continuous
data).
When to Use?
✔ Tracking changes (e.g., stock prices, temperature)
✔ Comparing multiple trends (e.g., sales of Product A vs.
B)
Example: Monthly Sales (2023)
Month |
Sales ($) |
Jan |
5,000 |
Feb |
7,000 |
... |
... |
Graph:
📈 A rising line from Jan (5K) to Dec (15K) shows sales growth.
Pros & Cons
✅ Pros:
·
Highlights trends
clearly.
·
Works for large time
ranges.
❌ Cons:
·
Poor for categorical
comparisons.
2. Bar Diagram
Purpose: Compare discrete categories.
Types:
·
Vertical
Bar Chart: Default for most
comparisons.
·
Horizontal
Bar Chart: For long category
names.
·
Stacked
Bar Chart: Shows part-to-whole
relationships.
When to Use?
✔ Comparing groups (e.g., sales by region)
✔ Ranking items (e.g., top 5 products)
Example: Product Sales (Q1 2023)
Product |
Sales ($) |
A |
20,000 |
B |
15,000 |
C |
10,000 |
Graph:
📊 Three vertical bars: A (tallest), B (medium), C (shortest).
Pros & Cons
✅ Pros:
·
Easy to read even
for non-experts.
·
Flexible (works for
most comparisons).
❌ Cons:
·
Cluttered with too
many categories.
3. Pie Chart
Purpose: Show proportions of a whole.
When to Use?
✔ Displaying market share.
✔ Budget allocation breakdowns.
Example: Market Share (%)
Company |
Share |
A |
45% |
B |
30% |
C |
25% |
Graph:
🥧 A circle divided into 3 slices: A (largest), B, C.
Pros & Cons
✅ Pros:
·
Intuitive for
part-to-whole relationships.
❌ Cons:
·
Hard to compare
similar-sized slices.
·
Useless for
trends/time data.
Comparison Table
Feature |
Line Graph |
Bar Diagram |
Pie Chart |
Best For |
Trends over time |
Category comparisons |
Proportions |
Data Type |
Continuous |
Discrete |
Percentages |
Clarity |
High (for trends) |
High |
Low (if many slices) |
Limitations |
No categories |
No trends |
No trends/categories |
Common Mistakes to Avoid
1.
Line
Graphs: Using for non-time
data (e.g., comparing cities).
2.
Bar
Charts: Overloading with
>10 categories.
3.
Pie
Charts: Including >5 slices
or tiny percentages (<2%).
16. Graphical Representation of Data –
Histograms and Ogives
A. Histogram
Purpose: Display
the frequency distribution of continuous
data using bars.
Key Features:
✔ Bars are adjacent (no gaps, unlike bar charts).
✔ X-axis represents class intervals (e.g.,
0-10, 10-20).
✔ Y-axis shows frequency (count or percentage).
When to Use?
✔ Analyzing exam scores, income ranges, or age groups.
✔ Identifying data distribution (normal, skewed, bimodal).
Example:
Marks Range |
No. of Students |
0-20 |
5 |
20-40 |
12 |
40-60 |
25 |
Graph:
📊 Bars increase from left (0-20) to peak (40-60), then decrease.
Pros & Cons:
✅ Pros:
·
Shows data spread
and skewness clearly.
·
Works for large
datasets.
❌ Cons:
·
Requires equal bin
sizes for accuracy.
B. Ogive (Cumulative Frequency Curve)
Purpose: Plot cumulative frequencies to
analyze data distribution.
Types:
1.
Less Than
Ogive: Shows cumulative
frequencies up to each class.
2.
More Than
Ogive: Shows cumulative
frequencies above each class.
When to Use?
✔ Finding medians, quartiles, or percentiles.
✔ Comparing distributions (e.g., test scores across
years).
Example (Less Than Ogive):
Marks ≤ |
Cumulative Students |
20 |
5 |
40 |
17 (5+12) |
60 |
42 (17+25) |
Graph:
📈 A rising curve starting at (20,5) and ending at (60,42).
Pros & Cons:
✅ Pros:
·
Helps estimate
percentiles easily.
·
Smooths data trends.
❌ Cons:
·
Less intuitive for
beginners.
17. Practical Problems of Graphical
Representation
1. Misleading Scales
Problem:
·
Truncated Y-axis
exaggerates small differences (e.g., starting at 50 instead of 0).
Solution:
✔ Always start numerical axes at zero unless
justified.
2. Overcrowded Graphs
Problem:
·
Too many bars/lines
make trends unreadable.
Solution:
✔ Use grouped
bar charts or small multiples.
3. Incorrect Graph Choice
Problem:
·
Using pie charts for
time-series data.
Solution:
✔ Match graph type to data (see Section 16).
4. Ignoring Data Distribution
Problem:
·
Histograms with
uneven bin sizes distort patterns.
Solution:
✔ Use equal bin widths; label axes clearly.
5. Lack of Context
Problem:
·
Graphs without
titles/units confuse viewers.
Solution:
✔ Add titles,
axis labels, and legends.
Comments
Post a Comment