I am a newbie to NLP.
Suppose there is a tagging task with 5 features: A, B, C, D, E.
Assume the result achieved by the tagger using feature A gets 60.11%
accuracy, which is the baseline.
To determine the effectiveness of the remaining features to the
baseline, the following experiments are performed:
A + B = 60.23%
A + C = 59.34%
A + D = 61.28%
A + E = 60.03%
It seems that B and D improve the baseline but C and E do not. How can
I statistically claim that B and D sigificantly improve the baseline?
I know that I may do a null hypothesis test which usually requires a
underlying distribution. But in this case, how can I determine it?
(exhaustively perform many experiments?) Could anyone kindly tell me
the systemic procedure to setup a test for this task?
Moreover, can I claim myself I am doing a "feature selection" if B and
D sigificantly improve the baseline (though they may probably not).
Since there are many "hidden" relations between features, what is the
typical way to select the best combination of features? Could anyone
recommend me some references or survey about that? Really really