Welcome to this Statistics-1 Week 1 Summary post. It’s not an explanatory article, hope you keep that in mind. Once you are done with your lessons, this post should come in very handy though. Although the best notes are the ones you make yourself, if you are short of time, you can use this. Or you can use this as template to prepare your own summary- that would be the best thing. So go on, and if you find any errors, do let me know in the comment below
Week 1
Statistics 1
Summary
- All of us are either creating data or contributing towards collection of data or are data ourselves.
- 50 years back, statistics were just about numbers and categorical data.
- Data is defined as figures collected, analyzed, and summarized for presentation and interpretation. It could be numerical, or it could be textual.
- Raw data by itself does not tell anything. We must extract information from that data.
- Primary reason to collect data is to know about the characteristics or attributes of groups, whether it is people, places, events, or anything else.
- Data can either be collected onsite or it can be extracted from certain websites which have already published data. e.g., data.gov.in
- Collecting data that is not already available in a structured and scientific way is also an important knowledge.
- For information in a database to be useful it must be structured organised and we must know the context of the numbers and text that it holds. Such a collection of structured and organised data is called data set.
- A variable is intuitively something that varies. And formally, it is defined as a characteristic or attribute that varies across all units.
- A case or observation is defined as a unit from which data is collected.
- Example of case would be students.
- Examples of variables would be name, marks, date of birth etc. of the student.
- Non availability of data is different from a data taking a value zero.
- Columns represent variables. Rows represent cases/ observation. Mnemonic to remember – CVRO.
- The same type of value should be recorded for each variable. Also, the same kind of units should be used across all observations.
- Old Definition of Statistics- Summarizing Data
- Recent definition- Drawing inference from data
- Current Definition– Statistics is the art of learning from data. It is concerned with the collection of data, their subsequent description, and their analysis, which often leads to the drawing of conclusions. (Sheldon Ross).
- Descriptive Statistics- The part of statistics concerned with description and summarization of data using certain parameters is called descriptive statistics. Its purpose is to examine and explore information for its own intrinsic interest only. A descriptive study may be performed either on a sample or on a population. E.g. Analyzing performance of students in an exam that was conducted.
- Inferential Statistics – The part of statistics concerned with the drawing of conclusions about population from sample data is called inferential statistics. This information is obtained from a sample of population and the purpose of the study is to use that information to make predictions about the population. E.g. Predicting students’ performance in the next exam based on exams already conducted,
- The ability to draw conclusions from data requires accountability or possibility of chance. i.e., some knowledge of probability
- The total collection of all the elements that we are interested in is called a population(N)
- A sample(n) is a subset of the population(N) that is being studied.
- A sample must be representative of all the distinct properties of the population since the purpose of sample is to get information about the population.
- A sample should be randomly selected so that the conclusions are unbiased.
- Summary statistics can be for a population or a sample
- Data is primarily of two types categorical or qualitative and numerical or quantitative.
- Numerical data is further of two types discrete and continuous.
- Discrete variable involves a count of something where is a continuous variable involves a measurement of something.
- Marks obtained in exam is discrete where is weight of students in a class is continuous.
- Categorical data identifies group membership
- Numerical data has measurement units and describes numerical properties of cases.
- All data in a numerical variable in a data table must have the same unit.
- A measurement unit is a scale that defines the meaning of numerical data.
- All numbers are not numerical data, for example mobile numbers and Jersey numbers are qualitative.
- Time plot is graph of a time series showing time over X axis and quantity over Y axis in a chronological order.
- Time series data is data recorded over time. Here the data varies with respect to time for a particular entity in space.
- Cross sectional data is data observed at the same time. Here values in the data set vary with respect to space/location, with time being constant.
- There are four scales of measurement – the nominal, ordinal, interval, and ratio. Developed by Psychologist Stanley Smith Stevens in 1946.
- When the data for a variable consists of labels or names used to identify the characteristic of an observation the scale of measurement is considered a nominal scale sometimes, they might be numerically coded. There is no ordering in the variable. Nominal data is the foundation of quantitative research and is among the most used measurement scales.
- When the order or rank of nominal data is meaningful it becomes an ordinal scale of measurement. Example -cold, warm, and hot for temperature.
- When an ordinal data has numeric value and the interval between values is expressed in terms of a fixed unit of measure then it becomes an interval scale of measurement. The value of 0 is arbitrary in interval scale, meaning there is no absolute zero and hence ratios have no meaning. Numerical values can only be added or subtracted. Difference between two values can be found. It can only be added or subtracted, for example 20 degrees Celsius or 40 degrees Fahrenheit for temperature.
- 40 degrees Celsius is not twice as hot as 20 degrees Celsius but 40 Kelvin is twice as hot as 20Kelvin.
- When an interval scale of data has a meaningful ratio then it becomes ratio scale. Example temperature in Kelvin, height, weight, age, marks etc. It can be added, subtracted, multiplied, or divided. True zero exists. True or absolute zero means the property of that variable will be zero when the variable takes the value 0.
- Data which has definite order and can be arranged accordingly is called structured data such as height, weight, age, income of employees etc.
- Variables with interval scale of measurement can be converted into other variables with ratio scale of measurement by performing subtraction. For example, rating given by users on a scale of one to five has interval scale of measurement but difference between the ratings varies from zero to four and is therefore ratio scale of measurement.
- For best experience, if you are reading this on phone, put it in landscape mode so the comparison table below will render appropriately.
Nominal | Ordinal | Interval | Ratio |
Categorical | Categorical | Numerical | Numerical |
1st Level of Measurement | 2nd Level of Measurement | 3rd level of Measurement | 4th Level of Measurement |
Order doesn’t matter | Order Matters | Order Matters | Order Matters |
No Rank | Rank exists | ||
Only Mode is defined | Mode and Median is defined | Mean, Mode, and Median is defined | Mean Mode and Median is defined |
No absolute zero but have fixed unit of measure and can take negative value | Absolute/True Zero exists and has a fixed unit of measure but can’t take negative values. | ||
No Operations allowed. Only Comparison is possible. | No Operations allowed. Only Comparison is possible. | Can Add or Subtract Only Since they have definite difference between the variables, but multiplication and division is not possible because difference between variables is not comparable. | Can Add/Subtract /Multiply and Divide |
Nominal | Ordinal | Interval | Ratio |
Can be assigned a label | Can be assigned a label | Can be assigned a label | Can be assigned a label |
Can be assigned a numerical code and the code can be random | Can be assigned a numerical code and the code can be random | Can be assigned a numerical code and the code cannot be random | Can be assigned a numerical code and the code can be random |
Difference between two is not defined | Difference between two is not defined, and if defined is meaningless | Difference between two is defined and meaningful and equidistant. | The difference between two is defined, meaningful and equidistant. |
Ratio, Coefficient of Variation and Geometric Mean is Not defined | Ratio, Coefficient of Variation and Geometric Mean is Not Defined | Ratio, Coefficient of Variation and Geometric Mean is Not Defined | Ratio, Coefficient of Variation and Geometric Mean is defined |
Nominal | Ordinal | Interval | Ratio |
Frequency Distribution can be calculated | Frequency Distribution can be calculated | Frequency Distribution can be calculated | Frequency Distribution can be calculated |
Standard Deviation not defined | Standard deviation not defined | Standard Deviation defined | Standard Deviation Defined |
No Origin of Scale, no idea where the scale starts or ends | No Origin of Scale, no idea where the scale starts or ends | ||
E.g. Name, Blood Group, Phone brands. Temperature (comfortable/Not Comfortable), Jersey Number, Mobile number, Pin code, Gender | E.g. Ranking, Ratings by users (good, average, bad), Temperature (Hot/Very Hot), Year, Military Titles (Brigadier/Colonel/Major), Headache (Severe/Mild/None) etc. | Eg. Celsius and Fahrenheit Scale, Time, Ratings given by users (1 to 5 stars), GPA (grade point average) | Eg. Kelvin Scale, Duration, difference between ratings given by users (0 to 4), Angle Measured in Degrees, Amount of money |
Hope you find this Statistics-1 Week 1 Summary for Data Science helpful. As a polite reminder, this is just that- a summary. You still must put in your time and effort going through class lectures and practicing questions before you can ace your exams for Statistics-1 Week 1.
For summary of other weeks of Statistics 1 for Data Science, checke these:-
See ya!
Peace!
✌🏻
An elegant and simple site.
And Great Job with the notes!!!
Thanks Kundan, that was the goal- to keep it simple.