A recipe for weak sequence labelling using Snorkel for clinical applications
Sequence labelling
Sequence labelling is the process of automatically labelling a sequence of text words X with a sequence of integer labels Y. These integer labels usually denote the class that a particular word in X belongs to.
x₁, x₂, … xₙ ↦ y₁, y₂, …, yₙ; where yᵢ = {0, 1, 2, …, n}
Supervised machine learning techniques require a large number of labelled text sequences to aid the training of sequence labelling models. However, hand-labelling any large text dataset is expensive, especially for the clinical domain, because
- The manual labeller/annotator needs to be a domain expert.
- Multiple people are required to annotate the text because of the low reliability of a single annotator.
- Multiple people need to be paid for the task. ($$$$$)
Weak labelling
Weak labelling is the process of automatically labelling a text sequence with the integer label sequence. A weak labeller is a labelling function LF (λ) that labels a sequence X with the sequence of weak labels Ỹ.
x₁, x₂, … xₙ ↦ ỹ₁, ỹ₂, …, ỹₙ; where ỹᵢ = {0, 1, 2, …, n}
To label X with entity classes Ỹ, an LF needs two things.
- Labelling source LFₛ: For example, if you intend to label the names of drugs in a text dataset, you need a dictionary of all the possible drug names.
- Labelling heuristic LFₕ: Once you have a labelling source, for example, the abovementioned dictionary of drug names, you need a labelling logic. This logic or heuristic could be used to map the terms from the drug names dictionary onto the text sequence. It could be as simple as a string match.
Weak labelling eliminates the need for this manual labour by automating the process.
Problems with weak labelling and Snorkel
Though weak labelling is easy to implement, the problem is that one requires several weak labellers (λₘ). The accuracy of each λᵢ depends on its labelling source and heuristic. How should one go about aggregating several weak labellers λₘ to obtain a single consensus label for your sequence X?
Snorkel provides a Label Model, which is a generative model that aims to estimate a consensus or true label for each data point given outputs of several labelling functions accounting for their accuracies.
Labelling functions
This section describes my experience in developing labelling functions for PICO extraction. If you can access a small validation set, use it to build the labelling functions. Depending upon the end goal, develop LFs that either favour recall or favour precision. To develop LF’s…
- What could be the possible choices for the labelling sources? Use the currently available ontologies, terminologies, hand-craft dictionaries, regular expressions (ReGeX), heuristics, and distant supervision sources (CTO, EudraCT, miRBase, etc.)
- The quickest source of medical and clinical ontologies is the UMLS meta thesaurus.
- Additional biomedical ontologies could be queried using NCBO Bioportal.
- To capture generic patterns of the labels, use heuristics and ReGeX. For example, to identify the intervention in clinical trials using “cognitive behavioural therapy intervention for improving quality of life”, MeSH ontology could be used to identify the chunk “cognitive behavioural therapy” (https://www.ncbi.nlm.nih.gov/mesh/?term=cognitive+behavioral+therapy) while the flanking term intervention after CBT could be marked using a heuristic that captures the term intervention and the tokens {‘cognitive’, ‘behaviour’, ‘therapy’} using specific POS tag (NN, JJ) pattern preceding it.
- I had seen some posts on Stackoverflow asking what labels should my labelling functions emit on the text tokens. Answer: A labelling function should output +1 for the positive class tokens, 0 for the negative class tokens, and abstain (-1) on the text tokens where decision-making is confusing. Thanks to Jason Fries for clarifying this.
- If you can not think of any source for the negative class, start with tagging stopwords as the negative token labels. In fact, you can merge stopwords from multiple sources like NLTK, Gensim, sci-kit learn, and spaCy. Check this medium post.
- If your entity or span recognition task has a hand-labelled validation set available, then use them to check the empirical accuracy of your labelling functions using LF summaries.
- Remove all those labelling functions that do not label a single token as +1. In my experiments, removing them increased recall for the positive class and, eventually, the F1 score. It can be different for your application.
- Do not remove an LF if it has low coverage but has high empirical accuracy on the LF summary score. Check this video from Paroma Versa.
- Always inspect the LF summaries to improve the final LM.
- Rember that the label model only provides labels for a subset of tokens. We had a case where the label model abstained on half the dataset. In this case, we reported performance only on the label model's total number of labelled tokens and trained the downstream transformer only on the non-abstains.
- Make sure you do not have a large number of abstains. It could result in the LabelModel emitting several abstains upon prediction. Design additional LF to increase the coverage of the class.
- Apart from direct and complete string matching and regular expression matching, you can use frequent n-grams (bigrams, trigrams, etc.) as a labelling heuristic as well. Identify the most frequent n-grams from your ontologies and use them as label templates. This allows for fuzzy matching.
- Another approach to fuzzy matching is fuzzy string matching, whereby you can partially match an ontology term onto your text. Use partial string matching using 1) the difflib library for partial string match and 2) matching lemma from ontology to text.
- Do not forget that abbreviations are rampant in the clinical text, and they should be resolved using a heuristic.
Bibliography
I. Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., & Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature communications, 12(1), 2017.
II. Dhrangadhariya, A., & Müller, H. (2023). Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation. JAMIA open, 6(1), ooac107.
III. Nye, Benjamin, et al. “A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature.” Proceedings of the conference. Association for Computational Linguistics. Meeting. Vol. 2018. NIH Public Access, 2018.
IV. Abaho, Micheal, et al. “Correcting crowdsourced annotations to improve detection of outcome types in evidence-based medicine.” CEUR workshop proceedings. Vol. 2429. 2019.
Do read my paper for details: Not so Weak PICO: leveraging weak supervision for Participants, interventions, and outcomes recognition for systematic review automation and My presentation on weakly supervised PICO information extraction.
#snorkel #ner #naturallanguageprocessing #weaklabeling #spanrecognition #womenintech #womenincomputerscience #womeninscience #femtech #distantlabeling #distantsupervision #clinical #biomedical #namedentityrecognition #PICO #entityextraction #informationextraction