项目作者: bo1929

项目描述 :
active learning for NER using transfer learning.
高级语言: Python
项目地址: git://github.com/bo1929/active-learning-named-entity-recognition.git
创建时间: 2021-01-19T15:20:10Z
项目社区:https://github.com/bo1929/active-learning-named-entity-recognition

开源协议:MIT License

下载


Focusing on potential named entities during active label acquisition

Implementation of the method (anelfop) described in Şapcı, A., Kemik, H., Yeniterzi, R., & Tastan, O. (2023). Focusing on potential named entities during active label acquisition. Natural Language Engineering, 1-23^1, together with scripts used for experiments discussed in the paper.

About the method

Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into predefined named entity classes.
While deep learning-based pre-trained language models help to achieve good predictive performances in NER, many domain-specific NER applications still call for a substantial amount of labeled data.
Active learning (AL), a general framework for the label acquisition problem, has been used for NER tasks to minimize the annotation cost without sacrificing model performance.
However, the heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER.
We propose several AL sentence query evaluation functions that pay more attention to potential positive tokens and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies.
We also propose a better data-driven normalization approach to penalize sentences that are too long or too short.
Our experiments on three datasets from different domains reveal that the proposed approach reduces the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.

Reproducing the results

If you would like to re-produce the experiments, see expt_scripts.
Do not forget to edit the config files.

First create a virtual environment based on python 3.7.

Then install the required packages with pip install requirements.txt.

For active learning

You can use python anelfop/al_experiment.py, and the following configuration file:

  1. seed: 0 # random seed
  2. increment_cons: exp1 # exponentıally increasing labeled training set size
  3. initial_size: 16 # initial size of the training set
  4. stopping_criteria: full # do not stop active learning until all dataset is labeled
  5. generator: False
  6. method: dpAP # see al_methods.py to choose the active learning method
  7. main_directory: /path/to/results
  8. data_directory: /path/to/dataset
  9. data_set:
  10. name: CONLL2003
  11. pos: True
  12. pretrained_model: bert-base-cased
  13. embedding_type: cl4l
  14. init_reduction:
  15. type: pca
  16. pca:
  17. dimension:256
  18. CRF:
  19. algorithm: lbfgs
  20. c1: 0.1
  21. c2: 0.1
  22. max_iterations: 100
  23. allow_all_states: True
  24. allow_all_transitions: True
  25. outlier_method:
  26. method_name: GLOSH
  27. outlier_quantile: 0.995
  28. parameters: {}
  29. hdbscan_al:
  30. min_c_size: 1000
  31. min_samp: 70
  32. c_eps: 0.19
  33. umap_al:
  34. neig: 40
  35. min_dist: 0.0
  36. n_comp: 2

Simple write this to a file, e.g., al-cfg.yaml, and then run python al_experiment --config-path /path/to/al-cfg.yaml.
Many standard active learning methods and proposed methods for querying sentences from the unlabeled corpus are implemented in al_methods.py. The abbreviations of methods can be found in the source file, use those in the configuration.

For passive learning

You can use python anelfop/pl_experiment.py, and the following example configuration file:

  1. seed: 0
  2. generator: True
  3. method: dpAP
  4. main_directory: /path/to/results
  5. data_directory: /path/to/dataset
  6. data_set:
  7. name: CONLL2003
  8. pos: True
  9. pretrained_model: bert-base-cased
  10. embedding_type: cl4l
  11. init_reduction:
  12. type: pca
  13. pca:
  14. dimension:256
  15. CRF:
  16. algorithm: lbfgs
  17. c1: 0.1
  18. c2: 0.1
  19. max_iterations: 100
  20. allow_all_states: True
  21. allow_all_transitions: True

Again, write this to a file, e.g., pl-cfg.yaml, and then run python pl_experiment --config-path /path/to/pl-cfg.yaml.