Friday, March 22, 2024

Constructing a Biomedical Entity Linker with LLMs | by Anand Subramanian | Mar, 2024

Must read


How can an LLM be utilized successfully for biomedical entity linking?

Towards Data Science
Photograph by Alina Grubnyak on Unsplash

Biomedical textual content is a catch-all time period that broadly encompasses paperwork resembling analysis articles, scientific trial reviews, and affected person information, serving as wealthy repositories of details about numerous organic, medical, and scientific ideas. Analysis papers within the biomedical discipline current novel breakthroughs in areas like drug discovery, drug unwanted side effects, and new illness therapies. Medical trial reviews supply in-depth particulars on the security, efficacy, and unwanted side effects of latest medicines or therapies. In the meantime, affected person information comprise complete medical histories, diagnoses, remedy plans, and outcomes recorded by physicians and healthcare professionals.

Mining these texts permits practitioners to extract priceless insights, which may be useful for numerous downstream duties. You would mine textual content to establish opposed drug response extractions, construct automated medical coding algorithms or construct data retrieval or question-answering programs that may assist extract data from huge analysis corpora. Nonetheless, one subject affecting biomedical doc processing is the customarily unstructured nature of the textual content. For instance, researchers may use completely different phrases to consult with the identical idea. What one researcher calls a “coronary heart assault” may be known as a “myocardial infarction” by one other. Equally, in drug-related documentation, technical and customary names could also be used interchangeably. For example, “Acetaminophen” is the technical title of a drug, whereas “Paracetamol” is its extra widespread counterpart. The prevalence of abbreviations additionally provides one other layer of complexity; for example, “Nitric Oxide” may be known as “NO” in one other context. Regardless of these various phrases referring to the identical idea, these variations make it troublesome for a layman or a text-processing algorithm to find out whether or not they consult with the identical idea. Thus, Entity Linking turns into essential on this state of affairs.

  1. What’s Entity Linking?
  2. The place do LLMs are available in right here?
  3. Experimental Setup
  4. Processing the Dataset
  5. Zero-Shot Entity Linking utilizing the LLM
  6. LLM with Retrieval Augmented Era for Entity Linking
  7. Zero-Shot Entity Extraction with the LLM and an Exterior KB Linker
  8. High quality-tuned Entity Extraction with the LLM and an Exterior KB Linker
  9. Benchmarking Scispacy
  10. Takeaways
  11. Limitations
  12. References

When textual content is unstructured, precisely figuring out and standardizing medical ideas turns into essential. To realize this, medical terminology programs resembling Unified Medical Language System (UMLS) [1], Systematized Medical Nomenclature for Drugs–Medical Terminology (SNOMED-CT) [2], and Medical Topic Headings (MeSH) [3] play a necessary function. These programs present a complete and standardized set of medical ideas, every uniquely recognized by an alphanumeric code.

Entity linking includes recognizing and extracting entities inside the textual content and mapping them to standardized ideas in a big terminology. On this context, a Data Base (KB) refers to an in depth database containing standardized data and ideas associated to the terminology, resembling medical phrases, ailments, and medicines. Sometimes, a KB is expert-curated and designed, containing detailed details about the ideas, together with variations of the phrases that might be used to consult with the idea, or how it’s associated to different ideas.

An summary of the Entity Recognition and Linking Pipeline. The entities are first parsed from the textual content, after which every entity is linked to a Data Base to acquire their corresponding identifiers. The data base thought of on this instance is MeSH Terminology. The instance textual content is taken from the BioCreative V CDR Corpus [4,5,6,7,8] (Picture by Creator)

Entity recognition entails extracting phrases or phrases which can be important within the context of our activity. On this context, it normally refers to extraction of biomedical phrases resembling medicine, ailments and so on. Sometimes, lookup-based strategies or machine studying/deep learning-based programs are sometimes used for entity recognition. Linking the entities to a KB normally includes a retriever system that indexes the KB. This method takes every extracted entity from the earlier step and retrieves doubtless identifiers from the KB. The retriever right here can also be an abstraction, which can be sparse (BM-25), dense (embedding-based), or perhaps a generative system (like a Giant Language Mannequin, (LLM)) that has encoded the KB in its parameters.

I’ve been curious for some time about the most effective methods to combine LLMs into biomedical and scientific text-processing pipelines. Provided that Entity Linking is a crucial a part of such pipelines, I made a decision to discover how finest LLMs may be utilized for this activity. Particularly I investigated the next setups:

  1. Zero-Shot Entity Linking with an LLM: Leveraging an LLM to straight establish all entities and idea IDs from enter biomedical texts with none fine-tuning
  2. LLM with Retrieval Augmented Era (RAG): Using the LLM inside a RAG framework by injecting details about related idea IDs within the immediate to establish the related idea IDs.
  3. Zero-Shot Entity Extraction with LLM with an Exterior KB Linker: Using the LLM for zero-shot entity extraction from biomedical texts, with an exterior linker/retriever for mapping the entities to idea IDs.
  4. High quality-tuned Entity Extraction with an Exterior KB Linker: Finetuning the LLM first on the entity extraction activity, and utilizing it as an entity extractor with an exterior linker/retriever for mapping the entities to idea IDs.
  5. Comparability with an present pipeline: How do these strategies fare comparted to Scispacy, a generally used library for biomedical textual content processing?

All code and sources associated to this text are made obtainable at this Github repository, below the entity_linking folder. Be happy to drag the repository and run the notebooks on to run these experiments. Please let me know in case you have any suggestions or observations or if you happen to discover any errors!

To conduct these experiments, we make the most of the Mistral-7B Instruct mannequin [9] as our Giant Language Mannequin (LLM). For the medical terminology to hyperlink entities towards, we make the most of the MeSH terminology. To cite the Nationwide Library of Drugs web site:

“The Medical Topic Headings (MeSH) thesaurus is a managed and hierarchically-organized vocabulary produced by the Nationwide Library of Drugs. It’s used for indexing, cataloging, and looking out of biomedical and health-related data.”

We make the most of the BioCreative-V-CDR-Corpus [4,5,6,7,8] for analysis. This dataset accommodates annotations of illness and chemical entities, together with their corresponding MeSH IDs. For analysis functions, we randomly pattern 100 knowledge factors from the take a look at set. We used a model of the MeSH KB supplied by Scispacy [10,11], which accommodates details about the MeSH identifiers, resembling definitions and entities corresponding to every ID.

For efficiency analysis, we calculate two metrics. The primary metric pertains to the entity extraction efficiency. The unique dataset accommodates all mentions of entities within the textual content, annotated on the substring degree. A strict analysis would verify if the algorithm has outputted all occurrences of all entities. Nonetheless, we simplify this course of for simpler analysis; we lower-case and de-duplicate the entities within the floor reality. We then calculated the Precision, Recall and F1 rating for every occasion and calculate the macro-average for every metric.

Suppose you will have a set of precise entities, ground_truth, and a set of entities predicted by a mannequin, pred for every enter textual content. The true positives TP may be decided by figuring out the widespread parts between pred and ground_truth, basically by calculating the intersection of those two units.

For every enter, we are able to then calculate:

precision = len(TP)/ len(pred) ,

recall = len(TP) / len(ground_truth) and

f1 = 2 * precision * recall / (precision + recall)

and eventually calculate the macro-average for every metric by summing all of them up and dividing by the variety of datapoints in our take a look at set.

For evaluating the general entity linking efficiency, we once more calculate the identical metrics. On this case, for every enter datapoint, we’ve a set of tuples, the place every tuple is a (entity, mesh_id) pair. The metrics are in any other case calculated the identical approach.

Proper, let’s kick off issues by first defining some helper capabilities for processing our dataset.

def parse_dataset(file_path):
"""
Parse the BioCreative Dataset.

Args:
- file_path (str): Path to the file containing the paperwork.

Returns:
- record of dict: An inventory the place every factor is a dictionary representing a doc.
"""
paperwork = []
current_doc = None

with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
line = line.strip()
if not line:
proceed
if "|t|" in line:
if current_doc:
paperwork.append(current_doc)
id_, title = line.cut up("|t|", 1)
current_doc = {'id': id_, 'title': title, 'summary': '', 'annotations': []}
elif "|a|" in line:
_, summary = line.cut up("|a|", 1)
current_doc['abstract'] = summary
else:
elements = line.cut up("t")
if elements[1] == "CID":
proceed
annotation = {
'textual content': elements[3],
'sort': elements[4],
'identifier': elements[5]
}
current_doc['annotations'].append(annotation)

if current_doc:
paperwork.append(current_doc)

return paperwork

def deduplicate_annotations(paperwork):
"""
Filter paperwork to make sure annotation consistency.

Args:
- paperwork (record of dict): The record of paperwork to be checked.
"""
for doc in paperwork:
doc["annotations"] = remove_duplicates(doc["annotations"])

def remove_duplicates(dict_list):
"""
Take away duplicate dictionaries from an inventory of dictionaries.

Args:
- dict_list (record of dict): An inventory of dictionaries from which duplicates are to be eliminated.

Returns:
- record of dict: An inventory of dictionaries after eradicating duplicates.
"""
unique_dicts = []
seen = set()

for d in dict_list:
dict_tuple = tuple(sorted(d.gadgets()))
if dict_tuple not in seen:
seen.add(dict_tuple)
unique_dicts.append(d)

return unique_dicts

We first parse the dataset from the textual content information supplied within the authentic dataset. The unique dataset consists of the title, summary, and all entities annotated with their entity sort (Illness or Chemical), their substring indices indicating their precise location within the textual content, together with their MeSH IDs. Whereas processing our dataset, we make just a few simplifications. We disregard the substring indices and the entity sort. Furthermore, we de-duplicate annotations that share the identical entity title and MeSH ID. At this stage, we solely de-duplicate in a case-sensitive method, that means if the identical entity seems in each decrease and higher case throughout the doc, we retain each situations in our processing thus far.

First, we goal to find out whether or not the LLM already possesses an understanding of MeSH terminology as a consequence of its pre-training, and if it might perform as a zero-shot entity linker. By zero-shot, we imply the LLM’s functionality to straight hyperlink entities to their MeSH IDs from biomedical textual content based mostly on its intrinsic data, with out relying on an exterior KB linker. This speculation will not be totally unrealistic, contemplating the provision of details about MeSH on-line, which makes it doable that the mannequin might need encountered MeSH-related data throughout its pre-training section. Nonetheless, even when the LLM was skilled with such data, it’s unlikely that this alone would allow the mannequin to carry out zero-shot entity linking successfully, because of the complexity of biomedical terminology and the precision required for correct entity linking.

To judge this, we offer the enter textual content to the LLM and straight immediate it to foretell the entities and corresponding MeSH IDs. Moreover, we create a few-shot immediate by sampling three knowledge factors from the coaching dataset. You will need to make clear the excellence in the usage of “zero-shot” and “few-shot” right here: “zero-shot” refers back to the LLM as an entire performing entity linking with out prior particular coaching on this activity, whereas “few-shot” refers back to the prompting technique employed on this context.

LLM as a Zero-Shot Entity Linker (Picture by Creator)

To calculate our metrics, we outline capabilities for evaluating the efficiency:

def calculate_entity_metrics(gt, pred):
"""
Calculate precision, recall, and F1-score for entity recognition.

Args:
- gt (record of dict): An inventory of dictionaries representing the bottom reality entities.
Every dictionary ought to have a key "textual content" with the entity textual content.
- pred (record of dict): An inventory of dictionaries representing the anticipated entities.
Just like `gt`, every dictionary ought to have a key "textual content".

Returns:
tuple: A tuple containing precision, recall, and F1-score (in that order).
"""
ground_truth_set = set([x["text"].decrease() for x in gt])
predicted_set = set([x["text"].decrease() for x in pred])

# True positives are predicted gadgets which can be within the floor reality
true_positives = len(predicted_set.intersection(ground_truth_set))

# Precision calculation
if len(predicted_set) == 0:
precision = 0
else:
precision = true_positives / len(predicted_set)

# Recall calculation
if len(ground_truth_set) == 0:
recall = 0
else:
recall = true_positives / len(ground_truth_set)

# F1-score calculation
if precision + recall == 0:
f1_score = 0
else:
f1_score = 2 * (precision * recall) / (precision + recall)

return precision, recall, f1_score

def calculate_mesh_metrics(gt, pred):
"""
Calculate precision, recall, and F1-score for matching MeSH (Medical Topic Headings) codes.

Args:
- gt (record of dict): Floor reality knowledge
- pred (record of dict): Predicted knowledge

Returns:
tuple: A tuple containing precision, recall, and F1-score (in that order).
"""
ground_truth = []

for merchandise in gt:
mesh_codes = merchandise["identifier"]
if mesh_codes == "-1":
mesh_codes = "None"
mesh_codes_split = mesh_codes.cut up("|")
for elem in mesh_codes_split:
combined_elem = {"entity": merchandise["text"].decrease(), "identifier": elem}
if combined_elem not in ground_truth:
ground_truth.append(combined_elem)

predicted = []
for merchandise in pred:
mesh_codes = merchandise["identifier"]
mesh_codes_split = mesh_codes.strip().cut up("|")
for elem in mesh_codes_split:
combined_elem = {"entity": merchandise["text"].decrease(), "identifier": elem}
if combined_elem not in predicted:
predicted.append(combined_elem)
# True positives are predicted gadgets which can be within the floor reality
true_positives = len([x for x in predicted if x in ground_truth])

# Precision calculation
if len(predicted) == 0:
precision = 0
else:
precision = true_positives / len(predicted)

# Recall calculation
if len(ground_truth) == 0:
recall = 0
else:
recall = true_positives / len(ground_truth)

# F1-score calculation
if precision + recall == 0:
f1_score = 0
else:
f1_score = 2 * (precision * recall) / (precision + recall)

return precision, recall, f1_score

Let’s now run the mannequin and get our predictions:

mannequin = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2",  torch_dtype=torch.bfloat16).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
mannequin.eval()

mistral_few_shot_answers = []
for merchandise in tqdm(test_set_subsample):
few_shot_prompt_messages = build_few_shot_prompt(SYSTEM_PROMPT, merchandise, few_shot_example)
input_ids = tokenizer.apply_chat_template(few_shot_prompt_messages, tokenize=True, return_tensors = "pt").cuda()
outputs = mannequin.generate(input_ids = input_ids, max_new_tokens=200, do_sample=False)
# https://github.com/huggingface/transformers/points/17117#issuecomment-1124497554
gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
mistral_few_shot_answers.append(parse_answer(gen_text.strip()))

On the entity extraction degree, the LLM performs fairly nicely, contemplating it has not been explicitly fine-tuned for this activity. Nonetheless, its efficiency as a zero-shot linker is sort of poor, with an general efficiency of lower than 1%. This final result is intuitive, although, as a result of the output area for MeSH labels is huge, and it’s a onerous activity to precisely map entities to a particular MeSH ID.

Zero-Shot Entity Extraction and Entity Linking Scores

Retrieval Augmented Era (RAG) [12] refers to a framework that mixes LLMs with an exterior KB outfitted with a querying perform, resembling a retriever/linker. For every incoming question, the system first retrieves data related to the question from the KB utilizing the querying perform. It then combines the retrieved data and the question, offering this mixed immediate to the LLM to carry out the duty. This strategy relies on the understanding that LLMs could not have all the required data or data to reply an incoming question successfully. Thus, data is injected into the mannequin by querying an exterior data supply.

Utilizing a RAG framework can supply a number of benefits:

  1. An present LLM may be utilized for a brand new area or activity with out the necessity for domain-specific fine-tuning, because the related data may be queried and supplied to the mannequin by way of a immediate.
  2. LLMs can typically present incorrect solutions (hallucinate) when responding to queries. Using RAG with LLMs can considerably cut back such hallucinations, because the solutions supplied by the LLM usually tend to be grounded in details because of the data provided to it.

Contemplating that the LLM lacks particular data of MeSH terminologies, we examine whether or not a RAG setup may improve efficiency. On this strategy, for every enter paragraph, we make the most of a BM-25 retriever to question the KB. For every MeSH ID, we’ve entry to a normal description of the ID and the entity names related to it. After retrieval, we inject this data to the mannequin by way of the immediate for entity linking.

To analyze the impact of the variety of retrieved IDs supplied as context to the mannequin on the entity linking course of, we run this setup by offering high 10, 30 and 50 paperwork to the mannequin and quantify its efficiency on entity extraction and MeSH idea identification.

LLM with RAG as an Entity Linker (Picture by Creator)

Let’s first outline our BM-25 Retriever:

from rank_bm25 import BM25Okapi
from typing import Record, Tuple, Dict
from nltk.tokenize import word_tokenize
from tqdm import tqdm

class BM25Retriever:
"""
A category for retrieving paperwork utilizing the BM25 algorithm.

Attributes:
index (Record[int, str]): A dictionary with doc IDs as keys and doc texts as values.
tokenized_docs (Record[List[str]]): Tokenized model of the paperwork in `processed_index`.
bm25 (BM25Okapi): An occasion of the BM25Okapi mannequin from the rank_bm25 package deal.
"""

def __init__(self, docs_with_ids: Dict[int, str]):
"""
Initializes the BM25Retriever with a dictionary of paperwork.

Args:
docs_with_ids (Record[List[str, str]]): A dictionary with doc IDs as keys and doc texts as values.
"""
self.index = docs_with_ids
self.tokenized_docs = self._tokenize_docs([x[1] for x in self.index])
self.bm25 = BM25Okapi(self.tokenized_docs)

def _tokenize_docs(self, docs: Record[str]) -> Record[List[str]]:
"""
Tokenizes the paperwork utilizing NLTK's word_tokenize.

Args:
docs (Record[str]): An inventory of paperwork to be tokenized.

Returns:
Record[List[str]]: An inventory of tokenized paperwork.
"""
return [word_tokenize(doc.lower()) for doc in docs]

def question(self, question: str, top_n: int = 10) -> Record[Tuple[int, float]]:
"""
Queries the BM25 mannequin and retrieves the highest N paperwork with their scores.

Args:
question (str): The question string.
top_n (int): The variety of high paperwork to retrieve.

Returns:
Record[Tuple[int, float]]: An inventory of tuples, every containing a doc ID and its BM25 rating.
"""
tokenized_query = word_tokenize(question.decrease())
scores = self.bm25.get_scores(tokenized_query)
doc_scores_with_ids = [(doc_id, scores[i]) for i, (doc_id, _) in enumerate(self.index)]
top_doc_ids_and_scores = sorted(doc_scores_with_ids, key=lambda x: x[1], reverse=True)[:top_n]
return [x[0] for x in top_doc_ids_and_scores]

We now course of our KB file and create a BM-25 retriever occasion that indexes it. Whereas indexing the KB, we index every ID utilizing a concatenation of their description, aliases and canonical title.

def process_index(index):
"""
Processes the preliminary doc index to mix aliases, canonical names, and definitions right into a single textual content index.

Args:
- index (Dict): The MeSH data base
Returns:
Record[List[int, str]]: A dictionary with doc IDs as keys and mixed textual content indices as values.
"""
processed_index = []
for key, worth in tqdm(index.gadgets()):
assert(sort(worth["aliases"]) != record)
aliases_text = " ".be a part of(worth["aliases"].cut up(","))
text_index = (aliases_text + " " + worth.get("canonical_name", "")).strip()
if "definition" in worth:
text_index += " " + worth["definition"]
processed_index.append([value["concept_id"], text_index])
return processed_index

mesh_data = read_jsonl_file("mesh_2020.jsonl")
process_mesh_kb(mesh_data)
mesh_data_kb = {x["concept_id"]:x for x in mesh_data}
mesh_data_dict = process_index({x["concept_id"]:x for x in mesh_data})
retriever = BM25Retriever(mesh_data_dict)

mistral_rag_answers = {10:[], 30:[], 50:[]}

for okay in [10,30,50]:
for merchandise in tqdm(test_set_subsample):
relevant_mesh_ids = retriever.question(merchandise["title"] + " " + merchandise["abstract"], top_n = okay)
relevant_contexts = [mesh_data_kb[x] for x in relevant_mesh_ids]
rag_prompt = build_rag_prompt(SYSTEM_RAG_PROMPT, merchandise, relevant_contexts)
input_ids = tokenizer.apply_chat_template(rag_prompt, tokenize=True, return_tensors = "pt").cuda()
outputs = mannequin.generate(input_ids = input_ids, max_new_tokens=200, do_sample=False)
gen_text = tokenizer.batch_decode(outputs.detach().cpu().numpy()[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
mistral_rag_answers[k].append(parse_answer(gen_text.strip()))

entity_scores_at_k = {}
mesh_scores_at_k = {}

for key, worth in mistral_rag_answers.gadgets():
entity_scores = [calculate_entity_metrics(gt["annotations"],pred) for gt, pred in zip(test_set_subsample, worth)]
macro_precision_entity = sum([x[0] for x in entity_scores]) / len(entity_scores)
macro_recall_entity = sum([x[1] for x in entity_scores]) / len(entity_scores)
macro_f1_entity = sum([x[2] for x in entity_scores]) / len(entity_scores)
entity_scores_at_k[key] = {"macro-precision": macro_precision_entity, "macro-recall": macro_recall_entity, "macro-f1": macro_f1_entity}

mesh_scores = [calculate_mesh_metrics(gt["annotations"],pred) for gt, pred in zip(test_set_subsample, worth)]
macro_precision_mesh = sum([x[0] for x in mesh_scores]) / len(mesh_scores)
macro_recall_mesh = sum([x[1] for x in mesh_scores]) / len(mesh_scores)
macro_f1_mesh = sum([x[2] for x in mesh_scores]) / len(mesh_scores)
mesh_scores_at_k[key] = {"macro-precision": macro_precision_mesh, "macro-recall": macro_recall_mesh, "macro-f1": macro_f1_mesh}

Basically, the RAG setup improves the general MeSH Identification course of, in comparison with the unique zero-shot setup. However what’s the impression of the variety of paperwork supplied as data to the mannequin? We plot the scores as a perform of the variety of retrieved IDs supplied to the mannequin as context.



Supply hyperlink

More articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest article