Funcionamiento de la Tabla de Exploración de Co-ocurrencia
Tutorial basado en un correo de Thomas Muhr (desarrollador principal de ATLAS.ti) a la lista de correo de usuarios el 19-10-2009
La TEC es un instrumento muy útil, pero tiene algunas peculiaridades, que es necesario detallar.
La TEC muestra para cada par de códigos el recuento de sus co-ocurrencias en todos los documentos actuales.
Si tiene un filtro de familia de documento primario activado puede obtener una tabla que afecte solamente a esta parte de los datos
Para crear una familia de documentos primarios debe hacerlo en Documents | Edit families | Open family browser (Documentos | Editar familias | Abrir el visor de familias)
Tras crear la familia puede activar el filtro de la misma con un doble clic de ratón, se resalta en negrita, o bien en Documents | Filter | Families (Documentos | Filtro | Familias), y seleccionando la familia correspondiente.
Cada celda de la tabla –que representa una pareja de códigos – también nos muestra un coeficiente normalizado junto al recuento, este coeficiente debería variar entre 0 (los códigos no co-ocurren) y 1 (los códigos co-ocurren en cualquier lugar en el que se usen). Este índice de co-ocurrencia (C-índice, ver García 2006) toma en consideración el recuento de la concurrencia de cada código:
c := n12/(n1 + n2) –
n12. (n12 = co-occurrence frequency of two codes c1 and c2, n1 and n2 being
their occurrence frequency).
UNDER
CONSTRUCTION
The CTE displays for
each pair of codes the count of their co-occurrence in all current documents.
Each cell – which represents a code pair - also displays a normalized
coefficient along with the count, which should vary between 0 (codes do not
co-occur) and 1 (codes co-occur wherever they are used). This Co-occurrence
index (C-index, see Garcia, 2006) takes the occurrence count of each code into
account:
c := n12/(n1 + n2) – n12. (n12 = co-occurrence frequency of two codes c1 and c2,
n1 and n2 being their occurrence frequency).
The coefficient is only displayed unless you have disabled this option.
What you may experience is the following:
1. Mismatch. The number of quotations in the cell drop down list does not always
resemble the cell’s frequency count, which can be larger.
2. Out of range. The C-index exceeds the 0..1 range it is supposed to stay with.
3. Funny circles. Cells can have additional visual cues, e.g., a red, yellow or
orange circle.
1. Mismatch
-------------
The co-occurrence frequency does not count single quotations it counts
co-occurrence „events“. If a single quotation is coded by two codes, this would
count as a single co-occurrence. The complications arise when we take
overlapping quotations into account. In such a case when each of the two
quotations is coded by one of the codes, this also counts as a single
co-occurrence. However, in the cell drop down list you will find both
quotations. In fact there are currently no means to discriminate between a
single quotation’s „strong“ co-occurrence and the „weak“ case for two quotations
in close proximity. The drop down list will display an ordered list of all
quotations for all co-occurrence events for the pair of codes. We may need to
improve this by displaying single and pairs of quotations as groups.
2. Out of range
----------------
The c-index (structurally resembling the Tanimoto and Jaquard Coefficient, which
are similarity measures) assumes separate non-overlapping text entities. Only
then can we expect a correct range of values.
However, ATLAS.ti’s quotations may overlap to any degree. Overlaps would only
then bear no problem if there wasn't any „coding redundancy“ (the ones you can
eliminate using the Coding Analyzer). Let's look at a few scenarios.
Case 1: two differently coded quotations overlap, We assume no more quotations
available. Let P1 be a textual document, q1 and q2 be quotations and a,b be
codes. q1 is coded with a, q2 is coded with b.

Using c := n_ab/(n_a + n_b) – n_ab (renamed variables to match our code
notation) we get:
n_ab = 1 one co-occurrence of a and b
n_a = 1, n_b = 1 a and b each code exactly one quotation.
c = 1/(1 + 1) – 1 = 1 Wow, maximum co-occurrence!
Case 2: q1 is coded with both codes a and b, the overlapping quotation q2 is
coded with b.

n_ab = 2. q1 alone counts for a co-occurrence event and the overlapping q1*q2
for another.
n_a = 1, n_b = 2
c = 2/(1 + 2) – 2 = 2!! Bad! This value is twice the allowed maximum.
Conclusion: the C index is not appropriate to correctly represent co-occurrence
in overlapping texts. We either need to find a formula that does or we need to
„normalize“ our quotations, that is, to eliminate overlapping before calculating
an index.

after eliminating the overlap between q1 and q2 we get three quotations. q1'
coded with a and b, q1*2 coded with a and b, q2' coded with b:

n_ab = 2, n_a = 2, n_b = 3
c = 2/(2 + 3) – 2 = 2/3 = 0.67 which looks rather nicely. It is in the allowed
range and it correctly takes into account that of the three possible
co-occurrence events only two apply.
3. Circles
-----------
Circles with different colors are painted into a cell's upper right corner when
certain conditions apply.


The red circle: When the c-index exceeds 1.
The yellow circle: an inherent issue with the C-index and similar
measures is that it is distorted by code frequencies that differ too much. In
such cases the coefficient tends to be much smaller than the actual
co-occurrence's semantic significance. For instance, if you had coded 100
quotations with code "depression" and 10 with "mother" and you had 5
co-occurrences:
n_dep = 100, n_mother = 10, n_dep-mother = 5
c = 5/(100 + 10) - 5 = 5/105 = 0.048
A c index of only 0.048 may slip your eye easily, although code "mother" appears
in 50% of all its applications with code "depression". Looking from code
"depression" only 5% co-occurr with code "mother".
If the ratio between the codes frequencies exceeds a cerain threshold (currently
5 but will be user definable) the yellow light goes on in the cell. So whenever
a cell shows the yellow marker it should invite you to look into the
co-occurrences of this cell despite a low c-index.
Note: When the mouse rests over a cell with a yellow mark, a pop-up displays the
ratio of the two codes.

The orange circle is simply a mixture of the two conditions above.
Conclusions for our users and for us: Despite the above described deficiencies
of the chosen normalization method (C-index) for overlapping data entities and
its distortion by unequal coding frequencies, the main purpose of the
co-occurrence explorer is still met: its navigational capabilities and
explorative approach. The co-occurrence count and the c-index in combination
with additional colored hints are still helpful.
For precise quantitative hypothesis testing purposes some issues need to be
improved, e.g. the partitioning of quotations into non-overlapping segments.
Navigation can be improved by grouping co-occurrence events. In any case,
co-occurrence measures need to be clearly understood, not only for the
mechanical problems above but also for semantic issues involved in their
meaningful interpretation (e.g., mixed application of codes with different level
like broader and sub terms). Furthermore, you need to be aware of the artifacts
enforced by a table approach like being reduced to a pairwise comparison. Higher
order co-occurrences which would take more than two codes into account need more
elaborate methods (clustering).
Garcia (2004) http://www.miislita.com/semantics/c-index-1.html
lunes, 22 de febrero de 2010