ID: 481
I have 350 examples that were annotated with binary labels by two different annotators. How do I compute inter-annotator agreement, including confidence interval?
Aligned LLM's Response:
To
compute
inter
-
annot
ator
agreement
,
you
can
use
a
variety
of
metrics
such
as
Fle
iss
'
k
appa
,
Scott
'
s
pi
,
or
Cohen
'
s
k
appa
.
These
metrics
measure
the
degree
of
agreement
between
two
or
more
an
not
ators
on
a
categor
ical
task
such
as
label
ing
examples
.
\n
\n
To
compute
inter
-
annot
ator
agreement
with
a
confidence
interval
,
you
can
use
a
statistical
method
such
as
the
Mc
N
em
ar
test
or
the
Co
ch
ran
'
s
Q
test
.
These
tests
can
provide
a
p
-
value
and
a
confidence
interval
for
the
observed
agreement
between
the
an
not
ators
.
\n
\n
Here
'
s
an
example
of
how
you
can
compute
inter
-
annot
ator
agreement
with
a
confidence
interval
using
Fle
iss
'
k
appa
:
\n
\n
1
.
First
,
you
need
to
calculate
the
observed
agreement
between
the
two
an
not
ators
.
You
can
do
this
by
counting
the
number
of
times
the
an
not
ators
agreed
and
disag
reed
on
each
example
,
and
divid
ing
by
the
total
number
of
examples
.
\n
2
.
Next
,
you
need
to
calculate
the
expected
agreement
between
the
two
an
not
ators
if
they
were
randomly
label
ing
the
examples
.
You
can
do
this
by
multip
lying
the
marginal
prob
abilities
of
each
label
for
each
an
not
ator
,
and
sum
ming
over
all
possible
pairs
of
labels
.
\n
3
.
Finally
,
you
can
calculate
Fle
iss
'
k
appa
by
sub
tract
ing
the
expected
agreement
from
the
observed
agreement
,
divid
ing
by
1
-
the
expected
agreement
,
and
multip
lying
by
4
.
Fle
iss
'
k
appa
ranges
from
-
1
to
1
,
where
-
1
indicates
no
agreement
,
0
indicates
agreement
no
better
than
chance
,
and
1
indicates
perfect
agreement
.
\n
4
.
To
compute
a
confidence
interval
for
Fle
iss
'
k
appa
,
you
can
use
the
Mc
N
em
ar
test
.
The
Mc
N
em
ar
test
provides
a
p
-
value
and
a
confidence
interval
for
the
observed
agreement
between
the
two
an
not
ators
.
You
can
use
this
p
-
value
and
confidence
interval
to
determine
whether
the
observed
agreement
is
statist
ically
significant
.
\n
\n
Here
'
s
some
sample
Python
code
that
demonstr
ates
how
to
compute
inter
-
annot
ator
agreement
with
a
confidence
interval
using
Fle
iss
'
k
appa
and
the
Mc
N
em
ar
test
:
\n
``
`
python
\n
import
numpy
as
np
\n
from
sc
ip
y
.
stats
import
m
cn
em
ar
\n
\n
#
Example
data
\n
labels
=
np
.
array
([
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
1
,
1
,
0
,
1
,
0
,
1
,
0
,
Legend: Shifted positions |
Marginal positions |
Unshifted positions
Token Distributions