Выбрать главу

In order to explore our data set we therefore need to apply a different body of mathematics which is appropriate for our cause. Recent developments in relational database technology, database mining methods, and knowledge elicitation (Expert Systems) came to our rescue. The following treatment is based on the ID3 induction algorithm. Because these new techniques may not be familiar to the reader and because of their importance in our debate, we will give a short explanation rather than simply quote the results.

For the purpose of discussion, consider a very small but typical portion of our database based on ten cases. (Note: these are for the illustration of these new methods of analysis and these cases are not intended to imply or categorize any stereotypes through these examples.)

Case

Purchasing decision

Country

Function

Gender

1

universalist

US

senior manager

male

2

universalist

UK

junior manager

male

3

particularist

UK

senior manager

female

4

universalist

US

senior manager

female

5

particularist

VEN

senior manager

female

6

particularist

VEN

senior manager

male

7

particularist

UK

senior manager

male

8

particularist

VEN

junior manager

male

9

universalist

UK

junior manager

female

10

universalist

US

junior manager

male

In the domain of data mining, the various items are called "attributes" rather than factors. This helps to differentiate between parametric factor analysis methods or variables. For simplification at this stage, the first attribute, "dimension score," has been given only two values; namely whether a respondent is likely to adopt a "universalist" or "particularist" purchasing decision. This is called the goal attribute.

We shall see later how we can use data mining where the goal attribute is not restricted in this way to two extreme values. Indeed, any of the attributes can be multistate.

The basic principle is to find the relative importance of the various attributes in determining the goal attribute. If we normalize (arrange) the data to the so-called third normal form in separate tables (as we would for representation in a relational database), we obtain:

1: Cases Sorted by Country

5

particularist

VEN

6

particularist

VEN

8

particularist

VEN

2

universalist

UK

3

particularist

UK

7

particularist

UK

9

universalist

UK

1

universalist

US

4

universalist

US

10

universalist

US

2: Cases Sorted by Manager Function

3

particularist

senior

1

universalist

senior

5

particularist

senior

6

particularist

senior

7

particularist

senior

4

universalist

senior

2

universalist

junior

8

particularist

junior

9

universalist

junior

10

universalist

junior

3: Cases Sorted by Gender

1

universalist

male

2

universalist

male

6

particularist

male

7

particularist

male

8

particularist

male

10

universalist

male

3

particularist

female

4

universalist

female

5

particularist

female

9

universalist

female

When we look at the attribute gender in table 3, we see that we can't determine the goal attribute - i.e., whether males or females are universalistic or particularistic in their purchasing decisions - from a given gender.

Similarly, for either a junior or senior manager function, the goal attribute can't be uniquely determined from table 2. When we look at the attribute country in table 1 we find that in all cases where, for example country = US, we can correctly determine that the goal is universalistic. If we know "country," we can correctly classify six of the ten examples in our data set. In data mining terminology, the attribute "country" is therefore said to have the highest information content.

For the full database, we can compute the amount of entropy for each attribute. This gives us a measure of the uncertainty of classification of our goal by each attribute. As the entropy increases, the amount of uncertainty we gain by adding each attribute increases. However, what we really want to know is how much information there is when we know the value(s) of any particular attribute.

If HC(attribute value) is the entropy of attribute of class "c" then this is given by:

Thus, the entropy of classification for Management Function is 'senior manager' is:

HC(function is senior)

-f(particularist(function is senior) × logf(particularist(function is senior) -f(universalist(function is senior) × logf(universalist(function is senior)

=

-4/6log(4/6)-2/6log(2/6)

=

0.918

Similarly,

HC(function is junior)

= -f(particularist(function is junior) × log f(particularist(function is junior) -f(universalist(function is senior)junior) × log f(universalist(function is junior)

= -l/4log(l/4)-3/4log(3/4)

= 0.811

Hence, for the overall value of H(function), we simply weight these by the ten cases:

HC(manager function) = 6/10 x 0.918 + 4/10 x 0.811 = 0.8752

Repeating this procedure for the other attributes we obtain:

HC(gender) = 1.0

HC(country) = 0.4

Since HC(gender) = 1.0, i.e. maximum uncertainty, this tells us that there is no information about the goal contained in the attribute "gender." This is consistent with Table 3 which shows that half the males and half the females are of each goal.

Because HC(country) has the lowest entropy of classification, then this corresponds to the least uncertainty. In other words, "country" has the highest information content and thus "country" is the major contributor in explaining the cultural orientation on this dimension of this consumer. Manager function has a smaller contribution.

Implementing the Induction Algorithm