Friday, February 09, 2007

Data Types

There are two data types: 1. Attribute 2. Continuous. Attribute data has countable quality characteristics for example, number of defects, Number of defectives, Number of NCs, etc. Continuous data on the other hand has measurable quality characteristics. For example, length of a spark plug, weight of a spark plug, temperature at which the spark plug has maximum efficiency, etc.

If a software project just collects data on whether each milestone is met or not met, it is collecting attribute data. This does not tell us whether we have overshot or under met the expectations.

Another example that shows the difference between attribute and continuous data: In a glass (drinking water glass) manufacturing industry, there are two teams, which assure that length of the glass is of stipulated length. The first team uses Vernier Calipers to measure the length. If the glass is of stipulated length, it passes the quality check, otherwise not. This type where the length of glass is MEASURED, is called Continuous data. The second team uses the go-noGO gauge technique. Here, the glass is allowed to pass through two separate raised platforms. The first platform has allows glass of stipulated length, while the second one allows only shorter. The inspection items are passes thru both the platforms one after another. If any glass passes thru both of them, then it is of shorter length than desired. If it does not pass thru any of the two platforms, it is of longer length. This way of gauging relies on ATTRIBUTE data, because the team checks for Yes/No condition for each glass.

Attribute data does not need costly implements. In our example, the second method is far cheaper than engaging vernier calipers, but we lose a lot of detail.

Note: Difference between defects, and defectives. “Defects” is the total number of defects in all the pieces inspected. “Defectives” is the count of items which have defects. For example, in a water glass manufacturing industry, in a lot of 100, these defects were found in one inspected item: the length of the glass is improper, has cracks. In another inspected item these defects were found: shape was malformed.

So, out of 100, the defectives here are 2 glasses, while the defects are three (for glass 1, length and cracks, and glass 2 the malformed shape). “Defects” therefore is always a better representative of the abnormalities / deviations, than “Defectives”.

Note that continuous data can be converted to attribute data, but vice versa. So, it is always better to go for continuous data if there is a possibility to measure it.

Quartile Deviations – Sample & Analysis

This table represents the marks scored in math exam by students in XII-A, XII-B, and XII-C sections. The Q4, Q3, Q2, Q1, and Q0 are the quartiles. In lay terms, the quartiles, divide the range of marks into 4 sections.


Q0-Q1 is the first quartile
Q1-Q2 is the second quartile
Q2-Q3 is the third quartile
Q3-Q4 is the fourth quartile


For XII-A, there is not much variation between Q3 and Q4. This means, there is little variation among the top performers of the class. Q3 and Q2 show some variation.

Mean, Media, and Mode

All these three are central tendencies. They are central score among a set of scores. Mean is heavily influenced by extreme values hence is not suitable for measuring process performance.

[Mean is also called average; median is the middle value in a set of sorted data; mode is the value repeating most of the times]

An illustration representing the fallibility of Mean and merit of Median is given below:

These are the marks obtained by students in mathematics in a particular class.

Marks
95
45
34
67
78
99
87
89
67
56
45
65
65
67
87
84
96

Here, the mean is 66.65, and the median is 67. If the mathematics teacher is asked to improve the MEAN MARKS BY 20 (i.e. performance should be so enhanced that mean becomes 80), it would be quite an easy task. Since mean can be boosted by inflating the extreme values, the teacher might pick up the brightest students of the class (students who have already scored quite high), and improve their performance. For example, a student at 87% can be easily trained to perform at 100%. (while neglecting the weak students, as training them and expecting a good performance so as to boos the overall mean is a pretty time consuming task…and that too without a promise of success).

If on the other hand the teacher had been asked to raise the median by 20, then it would not have been easy. For the median to increase, the performance of at least half of the class needs to be improved. Half of the class has to score more than the set target.

In Customer satisfaction index, for example, it is better to focus on median than on mean.

So, median is a better representation of a set of data compared to mean.

Formula for median in Excel =median(a2:a12)

Formula for quartiles:

Q1 = Quartile(a2:12, 1)
Q2 = Quartile(a2:12, 2)
Q3 = Quartile(a2:12, 3)

Standard deviation gives a measure of dispersion (the extent to which values vary from the mean). Therefore standard deviation is a good measure for process performance rather than mean, median or mode.

2. Measure

This is the second phase in the DMAIC phases of Six Sigma. A measurement system is created, which helps in knowing Ys and identifying potential Xs for the Six Sigma initiative. A measurement system is established to ensure that the data collected for the six sigma project is accurate.

In Define Phase – the phase prior to Measure, the potential projects (problems/opportunities (Ys)) are identified. Approximations of the size of the six sigma project are taken to draft a schedule. In the Measure phase, the actual indicators are identified and the quantum of work is identified. This gives the correct estimate of the volume of work on hand, which helps in accurate estimations.

The data collected for the six sigma green belt project should have the following characteristics:
  • Accurate (Observed value should be equal to the actual value), no matter how many times the task is performed.
  • Repeatable (When a person performs the task twice, he should be able to yield the same results)
  • Reproducible: (When two persons measure the same item, the results should be identical.). An example of reproducibility is software estimates. No matter who does it, the estimates should be in close proximity to one another (i.e. they should not vary much).
  • Stable: The results should be stable over a period of time.
The roadmap for "Measuring" is as shown in the diagram above.

Note: that the first three steps could have been done in the Define phase itself. In the define phase, an approximate of Y’s volume is taken, while in the Measure phase, the actual volume of Y is calculated. If in the Define phase, only the approximate idea of size of Y is known, then the first three steps are required in Measure phase, otherwise not. On the other hand if you know the size in the Define phase itself, then the first three steps in the Measure phase can be avoided.

To summarize, we carry out the following under the Measure phase:

1. To select the appropriate Y, we use the following:

a. Sigma Level (Performance of Y)
b. RTY (Rolled Throughput Yield)
c. CP(Inherent process capability), and CpK (Resultant process capability)

2. Identify the Xs and prioritize, we use the following:

a. Process mapping
b. Fish bone diagrams
c. Pareto analysis
d. FDM (Function Deployment Method)

At the end of step 2, we have the list of prioritized Xs.

Y = f(X)

An alternate way to look at Six Sigma

Y = f(X)

Here we shall talk about what this straight line function is all abouta and how it leads us to DMAIC.

Y is a function of X. Its value depends on the value assigned to X. Y, thus is dependent, while X is not. Y is called KPOV (Key Process Output Variable). X is called KPIV (Key Process Input Variable).

Y is the output. X is the input. To get results, we should focus on inputs (Xs), not on outputs (Ys). For example, commonly, companies focus on sales target, but not variables / processes that affect the sales target. When variables / processes that control the sales target are identified, and fine tuned, the sales target is automatically brought under control.

Talking in terms of software defects, if all causes of bugs are identified and addressed (all Xs ), then there is no need to test the final product! The final testing can be ignored. Though this is a an idealistic statement, this is what six sigma tries to achieve - reduce the causes of errors so that final inspection can be ignored.

Inspections, manual in particular are never error free. So, no matter how many cycles of review a code undergoes, possibility of error oversight still remains. Therefore, inspections dont really help.

Dell computers for example packages its computer components such that there is no chance of wrong fittings of parts - incompatible system elements would just not fit. Error proofing is done. (Dell call center handlers therefore are confident of letting their customers open the system and repair it as per their instructions given online...)

In software industry, this means modular programming, which yeild good benefits. Modules are pretested, self-containing entities that just need to be integrated and a final system integration test done.

Let us for example say to improve process performance, Eureka Forbes has several Ys to choose from :- Sales, Number of products sold per month, etc. Of them lets consider Sales.

Y = Sales
For this Y, following are the possible Xs
X1 = Product Quality
X2 = Product Features
X3 = Price
X4 = Advertisement Effectiveness
X5 = Sales Force Effectiveness

Of these Xs, lets pick up X5 (Sales force effectiveness) and consider this as Y. Now, for this Y, the possible Xs are:

X1 = Training Effectiveness
X2 = Recruitment and Selection Effectiveness
X3 = Attrition

Next, lets take X1 (Training Effectiveness as Y). For this, the possibel Xs are:

X1 = Trainer Competence
X2 = Duration of Training
X 3 = Training Content

Thus, Y = f (X) helps us in drilling down from output to input to help us select green belt projects. Green belt projects usually have fixed time frame. They have to be chosen such that they are completed well within the time frame. Y = f(X) helps in choosing the Xs, and the corresponding Ys that are dependent on those Xs.

The challenges faced while drilling down for Xs are:

1. Identification of Ys (Which Ys to choose)
2.

a. Measurability of Y
- Current Y
- Target Y

b. Identification of Xs

3. Identification of vital Xs among the identified Ys: Focusing on all Xs may not be yielding. There could be vital Xs whose fine tuning would give results.

4. Improve vital Xs and verify their impact on Y

5. Sustaining the improvements

The above five points are nothing but D-M-A-I-C.

1 = D
2 = M
3 = A
4 = I
5 = C

Visualizing Next Word Prediction - How to LLMs Work?

 https://bbycroft.net/llm