(Last edited 20140310 at 15:19)
We are following the previous post to mention the statistical use of Transactional Analysis, in order to get concrete data from the samples (subject, customers, patients) examined:
In order to achieve greater effectiveness in illustration, we’ll proceed step by step.
Suppose you have the software X, that analyzes 400 patients, of whom we know some things, but not all of these will be communicated to the software X itself. Example from the preceding text:
Id. | NP | CP | NC | CC(PP) | ACS | ACR | A | Birth | Sex | Script | Categ. |
#25 | 30 | 70 | 50 | 30 | 10 | 10 | 80 | 1950 | F (1) | 14 | |
#26 | 35 | 65 | … | … | … | … | … | … | … | … | |
… | … | … | … | … | … | … | … | … | … | … | |
#400 | 49 | 51 | … | … | … | … | … | … | … | … |
We decide to communicate to software, for example, the PAC, the age, the critical Game (which raises the problems), the position of life (suitably coded), the script (suitably coded), the profession of his father and everything you want: these variables will be transformed (all) in quantitative, with special precautions. Suppose to know the birthplace of various subjects and that this information is not communicated to the software X.
Let’s say the software X to split the 400 patients into categories and that these should be at most 20 (usually, the root of 400). The computer starts to work (we’ll see better below in Brief on Algorithm).
It splits the 400 subjects in a largest number of 20 categories: suppose we get fifteen. The software writes on the last column (categ.) the assigned category, for each of the 400 subjects.
Id. | NP | CP | NC | CC(PP) | ACS | ACR | A | Birth | Sex | Script | Categ. |
#25 | 30 | 70 | 50 | 30 | 10 | 10 | 80 | 1950 | F (1) | 14 | 15 |
#26 | 35 | 65 | … | … | … | … | … | … | … | … | 3 |
… | … | … | … | … | … | … | … | … | … | … | |
#400 | 49 | 51 | … | … | … | … | … | … | … | … | 2 |
Now suppose we add, for each of the 400 subjects, the Zip code:
Id. |
Categ. |
Zip code |
#25 |
15 |
31046 |
#26 |
3 |
70415 |
… |
… |
… |
#400 |
2 |
42425 |
At this point you might investigate whether there are correlations: for example, if with statistical significance we had discovered that the 8.th category is populated by people born in Tuscany, we discovered that the PAC standardized for Tuscany has a particular profile. In essence, if software X is blind on certain topics, any correlations on these topics have a high degree of trust.
You can also create a profile or median average or modal for the Tuscany region and calculate relative standard deviations (StDev): it is easy to attribute recognition labels to new subjects or, of course, raise the X software with a new list (or with the same expanded list) of subjects.
Appearing not names or other elements of identification, it is clear that more professionals can join their archives for pooling the data in order to get better analysis.
Algorithm overview.
The algorithm has existed for twenty years and it works perfectly. The limit of the rows is 65536 and the number of available columns is 256: actually these are the limits of a Microsoft Excel sheet that serves as a container. The algorithm mainly used is the algorithm of Teuvo Kohonen (self-organizing Map): genetic algorithm Not parametrized
Now suppose that the dimensions are three and that there are three subjects: the first subject (2, 2, 2) will be much closer to the third subject (2, 4, 6) than to the second subject (14, 25.8). In such a case, if I had asked for no more than two categories, I would certainly have the first subject and the third subject in a category and the second subject in another category. The two categories will be numbered with the number 1 and number 2 but do not know (and actually does not matter absolutely) what will be the category 1 and the category 2.
In this case just shows the answer is trivial and does not need any software, however…
Suppose you now have 400 subjects (or 6000), each with 32 different values for 32 discrete variables, for which set it has been asked as sub-division no more than 20 categories.
This is a space up to twenty dimensions, that for the computer is not a problem but for us humans, instead, it is.
Software X, using a non-trivial machine, offers an acceptable solution after an hour if you want a very accurate result. It can vary however from half an hour to three hours, depending on variables.
Almost never, in the real world, the software is able to complete the work completely: it remains in fact (for example) the subject # 35 and # 41 that disturb each other: when I enter the # 35 in category 15, vector spaces change and at that point the # 41 is perfectly equidistant (as vectoriality) from the category 27 and from the category 15.
The computer try now to move the # 35 in category 27 but the new vectoriality repeats the problem: this usually happens at the end of the genetic process, when the subjects may be placed in multiple categories as well. Truncating the optimization, we will be with our two last subjects into two categories in an arbitrary way, but these will be equidistant between their subjects still preached, and then, being the subject exactly on boundary territories, the residual can be attributed to any category between the neighbors.