ID: 481 
 
  
  
     
 I have 350 examples that were annotated with binary labels by two different annotators. How do I compute inter-annotator agreement, including confidence interval? 
 
 
 
  
  
 
   Aligned LLM's Response: 
  
    			To
			compute
			inter
			-
			annot
			ator
			agreement
			,
			you
			can
			use
			a
			variety
			of
			metrics
			such
			as
			Fle
			iss
			'
			k
			appa
			,
			Scott
			'
			s
			pi
			,
			or
			Cohen
			'
			s
			k
			appa
			.
			These
			metrics
			measure
			the
			degree
			of
			agreement
			between
			two
			or
			more
			an
			not
			ators
			on
			a
			categor
			ical
			task
			such
			as
			label
			ing
			examples
			.
			\n
			\n
			To
			compute
			inter
			-
			annot
			ator
			agreement
			with
			a
			confidence
			interval
			,
			you
			can
			use
			a
			statistical
			method
			such
			as
			the
			Mc
			N
			em
			ar
			test
			or
			the
			Co
			ch
			ran
			'
			s
			Q
			test
			.
			These
			tests
			can
			provide
			a
			p
			-
			value
			and
			a
			confidence
			interval
			for
			the
			observed
			agreement
			between
			the
			an
			not
			ators
			.
			\n
			\n
			Here
			'
			s
			an
			example
			of
			how
			you
			can
			compute
			inter
			-
			annot
			ator
			agreement
			with
			a
			confidence
			interval
			using
			Fle
			iss
			'
			k
			appa
			:
			\n
			\n
			1
			.
			First
			,
			you
			need
			to
			calculate
			the
			observed
			agreement
			between
			the
			two
			an
			not
			ators
			.
			You
			can
			do
			this
			by
			counting
			the
			number
			of
			times
			the
			an
			not
			ators
			agreed
			and
			disag
			reed
			on
			each
			example
			,
			and
			divid
			ing
			by
			the
			total
			number
			of
			examples
			.
			\n
			2
			.
			Next
			,
			you
			need
			to
			calculate
			the
			expected
			agreement
			between
			the
			two
			an
			not
			ators
			if
			they
			were
			randomly
			label
			ing
			the
			examples
			.
			You
			can
			do
			this
			by
			multip
			lying
			the
			marginal
			prob
			abilities
			of
			each
			label
			for
			each
			an
			not
			ator
			,
			and
			sum
			ming
			over
			all
			possible
			pairs
			of
			labels
			.
			\n
			3
			.
			Finally
			,
			you
			can
			calculate
			Fle
			iss
			'
			k
			appa
			by
			sub
			tract
			ing
			the
			expected
			agreement
			from
			the
			observed
			agreement
			,
			divid
			ing
			by
			
			1
			-
			the
			expected
			agreement
			,
			and
			multip
			lying
			by
			
			4
			.
			Fle
			iss
			'
			k
			appa
			ranges
			from
			-
			1
			to
			
			1
			,
			where
			-
			1
			indicates
			no
			agreement
			,
			
			0
			indicates
			agreement
			no
			better
			than
			chance
			,
			and
			
			1
			indicates
			perfect
			agreement
			.
			\n
			4
			.
			To
			compute
			a
			confidence
			interval
			for
			Fle
			iss
			'
			k
			appa
			,
			you
			can
			use
			the
			Mc
			N
			em
			ar
			test
			.
			The
			Mc
			N
			em
			ar
			test
			provides
			a
			p
			-
			value
			and
			a
			confidence
			interval
			for
			the
			observed
			agreement
			between
			the
			two
			an
			not
			ators
			.
			You
			can
			use
			this
			p
			-
			value
			and
			confidence
			interval
			to
			determine
			whether
			the
			observed
			agreement
			is
			statist
			ically
			significant
			.
			\n
			\n
			Here
			'
			s
			some
			sample
			Python
			code
			that
			demonstr
			ates
			how
			to
			compute
			inter
			-
			annot
			ator
			agreement
			with
			a
			confidence
			interval
			using
			Fle
			iss
			'
			k
			appa
			and
			the
			Mc
			N
			em
			ar
			test
			:
			\n
			``
			`
			python
			\n
			import
			numpy
			as
			np
			\n
			from
			sc
			ip
			y
			.
			stats
			import
			m
			cn
			em
			ar
			\n
			\n
			#
			Example
			data
			\n
			labels
			=
			np
			.
			array
			([
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
			1
			,
			
			0
			,
			
   
  Legend: Shifted positions  | 
    Marginal positions  | 
    Unshifted positions
  
  Token Distributions