In order to explore our data set we therefore need to apply a different body of mathematics which is appropriate for our cause. Recent developments in relational database technology, database mining methods, and knowledge elicitation (Expert Systems) came to our rescue. The following treatment is based on the ID3 induction algorithm. Because these new techniques may not be familiar to the reader and because of their importance in our debate, we will give a short explanation rather than simply quote the results.
For the purpose of discussion, consider a very small but typical portion of our database based on ten cases. (Note: these are for the illustration of these new methods of analysis and these cases are not intended to imply or categorize any stereotypes through these examples.)
Case
Purchasing decision
Country
Function
Gender
1
universalist
US
senior manager
male
2
universalist
UK
junior manager
male
3
particularist
UK
senior manager
female
4
universalist
US
senior manager
female
5
particularist
VEN
senior manager
female
6
particularist
VEN
senior manager
male
7
particularist
UK
senior manager
male
8
particularist
VEN
junior manager
male
9
universalist
UK
junior manager
female
10
universalist
US
junior manager
male
In the domain of data mining, the various items are called "attributes" rather than factors. This helps to differentiate between parametric factor analysis methods or variables. For simplification at this stage, the first attribute, "dimension score," has been given only two values; namely whether a respondent is likely to adopt a "universalist" or "particularist" purchasing decision. This is called the goal attribute.
We shall see later how we can use data mining where the goal attribute is not restricted in this way to two extreme values. Indeed, any of the attributes can be multistate.
The basic principle is to find the relative importance of the various attributes in determining the goal attribute. If we normalize (arrange) the data to the so-called third normal form in separate tables (as we would for representation in a relational database), we obtain:
1: Cases Sorted by Country
5
particularist
VEN
6
particularist
VEN
8
particularist
VEN
2
universalist
UK
3
particularist
UK
7
particularist
UK
9
universalist
UK
1
universalist
US
4
universalist
US
10
universalist
US
2: Cases Sorted by Manager Function
3
particularist
senior
1
universalist
senior
5
particularist
senior
6
particularist
senior
7
particularist
senior
4
universalist
senior
2
universalist
junior
8
particularist
junior
9
universalist
junior
10
universalist
junior
3: Cases Sorted by Gender
1
universalist
male
2
universalist
male
6
particularist
male
7
particularist
male
8
particularist
male
10
universalist
male
3
particularist
female
4
universalist
female
5
particularist
female
9
universalist
female
When we look at the attribute gender in table 3, we see that we can't determine the goal attribute - i.e., whether males or females are universalistic or particularistic in their purchasing decisions - from a given gender.
Similarly, for either a junior or senior manager function, the goal attribute can't be uniquely determined from table 2. When we look at the attribute country in table 1 we find that in all cases where, for example country = US, we can correctly determine that the goal is universalistic. If we know "country," we can correctly classify six of the ten examples in our data set. In data mining terminology, the attribute "country" is therefore said to have the highest information content.
For the full database, we can compute the amount of entropy for each attribute. This gives us a measure of the uncertainty of classification of our goal by each attribute. As the entropy increases, the amount of uncertainty we gain by adding each attribute increases. However, what we really want to know is how much information there is when we know the value(s) of any particular attribute.
If HC(attribute value) is the entropy of attribute of class "c" then this is given by:
Thus, the entropy of classification for Management Function is 'senior manager' is:
HC(function is senior)
-f(particularist(function is senior) × logf(particularist(function is senior) -f(universalist(function is senior) × logf(universalist(function is senior)
=
-4/6log(4/6)-2/6log(2/6)
=
0.918
Similarly,
HC(function is junior)
= -f(particularist(function is junior) × log f(particularist(function is junior) -f(universalist(function is senior)junior) × log f(universalist(function is junior)
= -l/4log(l/4)-3/4log(3/4)
= 0.811
Hence, for the overall value of H(function), we simply weight these by the ten cases:
HC(manager function) = 6/10 x 0.918 + 4/10 x 0.811 = 0.8752
Repeating this procedure for the other attributes we obtain:
HC(gender) = 1.0
HC(country) = 0.4
Since HC(gender) = 1.0, i.e. maximum uncertainty, this tells us that there is no information about the goal contained in the attribute "gender." This is consistent with Table 3 which shows that half the males and half the females are of each goal.
Because HC(country) has the lowest entropy of classification, then this corresponds to the least uncertainty. In other words, "country" has the highest information content and thus "country" is the major contributor in explaining the cultural orientation on this dimension of this consumer. Manager function has a smaller contribution.
Implementing the Induction Algorithm