NegBio documentation¶
NegBio is a high-performance NLP tool for negation and uncertainty detection in radiology reports.
Beloved Features¶
- Patterns on both universal dependency graph and regular expressions
- Creating user patterns
- Transparency
- Multiprocessing
NegBio officially supports Python>=3.6.
These instructions will get you a copy of the project up and run on your local machine for development and testing purposes. The package should successfully install on Linux (and possibly macOS).
Installation of NegBio¶
This part of the documentation covers the installation of NegBio. The first step to using any software package is getting it properly installed.
Prerequisites¶
- python >=3.6
- Linux
- Java
Note: since v1.0, MetaMap is not required. You can use the vocabularies (e.g., patterns/cxr14_phrases_v2.yml
) instead.
Installation of MetaMap¶
If you want to use MetaMap to extract findings!!!
Download MetaMap full version and extract inot the directory called
public_mm
.Install MetaMap locally. Installation instructions can be found at https://metamap.nlm.nih.gov/Installation.shtml.
cd public_mm ./bin/install.sh
Start the server.
./bin/skrmedpostctl start ./bin/wsdserverctl start
Getting the source code¶
NegBio is actively developed on GitHub, where the code is always available.
You can clone the public repository
$ git clone https://github.com/ncbi-nlp/NegBio.git
$ cd negbio
Once you have a copy of the source, you can prepare a virtual environment
$ conda create --name negbio python=3.6
$ source activate negbio
$ pip install --upgrade pip setuptools
or
$ virtualenv --python=/usr/bin/python3.6 negbio_env
$ source negbio_env/bin/activate
Finally, you can install the required packages:
$ pip install -r requirements3.txt
Quickstart¶
Eager to get started? This page gives a good introduction in how to get started with NegBio.
First, make sure that NegBio is installed.
Preparing the dataset¶
The inputs of NegBio should be in the BioC format.
Briefly, a BioC-format file is an XML document as the basis of the BioC data exchange and the BioC data classes. Each file contains a group of documents. Each document should have a unique id and one or more passages. Each passage should have (1) a non-overlapping offset that specifies the location of the passage with respect to the whole document, and (2) the original text of the passage.
The text can contains special characters such as newlines.
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<collection>
<source>ChestXray-NIHCC</source>
<date>2017-05-31</date>
<key></key>
<document>
<id>0001</id>
<passage>
<offset>0</offset>
<text>findings:
chest: four images:
right picc with tip within the upper svc.
probable enlargement of the main pulmonary artery.
mild cardiomegaly.
no evidence of focal infiltrate, effusion or pneumothorax.
dictating </text>
</passage>
</document>
<document>
<id>0002</id>
<passage>
<offset>0</offset>
<text>findings: pa and lat cxr at 7:34 p.m.. heart and mediastinum are
stable. lungs are unchanged. air- filled cystic changes. no
pneumothorax. osseous structures unchanged scoliosis
impression: stable chest.
dictating </text>
</passage>
</document>
</collection>
Running NegBio¶
$ export OUTPUT_DIR=examples-local
$ export OUTPUT_LABELS=examples-local/labels.csv
$ export INPUT_FILES="examples/1.xml examples/2.xml"
$ bash examples/run_negbio_examples.sh
You can also include all reports in one folder, so that the $INPUT_FILES=examples/*.xml
After the script is finished, you can find the labels at examples-local/labels.csv
. It contains three rows with respect to three documents. Each row has multiple findings, such as Atelectasis and Cardiomegaly. The definition of findings can be found at patterns/cxr14_phrases_v2.yml
. In this file, 1 means positive findings, 0 means negative findings, and -1 means uncertain findings.
Besides the final label file, 6 folders contain the intermediate files of each step, respectively. For example, the ssplit
folder consists of sentences, and the parse
folder consists of the parse tree of each sentence. The content and format of these files should be self-explained.
Ready for more? Check out the Advanced Usage
section.
Advanced Usage¶
This document covers some of NegBio more advanced features.
Running the pipeline step-by-step¶
The step-by-step pipeline generates all intermediate documents. You can easily rerun one step if it makes errors. The whole steps are
text2bioc
combines text into a BioC XML file.normalize
removes noisy text such as[**Patterns**]
.section_split
splits the report into sectionsssplit
splits text into sentences.- Named entity recognition
dner_mm
detects UMLS concepts using MetaMap.dner_regex
detects concepts using the vocabularies such aspatterns/cxr14_phrases_v2.yml
.
parse
parses sentence using the Bllip parser.ptb2ud
converts the parse tree to universal dependencies using Stanford converter.neg2
detects negative and uncertain findings.cleanup
removes intermediate information.
General arguments¶
The general command is
python negbio/negbio_pipeline.py <command> [options] --output=/path/to/output/dir /path/to/inputs
The <command>
must be one of the steps above. The --output
specifies the output directory. The inputs
can be one or multiple files.
Other options include
--suffix
: Append an additionalSUFFIX
to file names.--verbose
: Print more information about progress.--workers
: Number of threads.--files_per_worker
: Number of input files per worker.--overwrite
: Overwrite the output file.
Convert text files to BioC format¶
You can skip this step if the reports are already in the BioC format. If you have lots of reports, it is recommended to put them into several BioC files, for example, 100 reports per BioC file.
export BIOC_DIR=/path/to/bioc
export TEXT_DIR=/path/to/text
python negbio/negbio_pipeline.py text2bioc --output=$BIOC_DIR/test.xml $TEXT_DIR/*.txt
Another most commonly used command is:
find $TEXT_DIR -type f | python negbio/negbio_pipeline.py text2bioc --output=$BIOC_DIR
Normalize reports¶
This step removes the noisy text such as [**Patterns**]
in the MIMIC-III reports.
Split each report into sections¶
This step splits the report into sections.
The default section titles is at patterns/section_titles.txt
.
You can specify customized section titles using the option --pattern=<file>
.
Splits each report into sentences¶
This step splits the report into sentences using the NLTK splitter (nltk.tokenize.sent_tokenize).
Named entity recognition¶
This step recognizes named entities (e.g., findings, diseases, devices) from the reports. In general, MetaMap is more comprehensive while vocabulary is more accurate on 14 types of findings. MetaMap is also slower and easier to break than vocabulary.
Using MetaMap¶
The first version of NegBio uses MetaMap to detect UMLS concepts. Please make sure that both skrmedpostctl
and wsdserverctl
are started
MetaMap intends to extract all UMLS concepts.
Many of them are not irrelevant to radiology.
Therefore, it is better to specify the UMLS concepts of interest via --cuis=<file>
$ export METAMAP_BIN=METAMAP_HOME/bin/metamap16
$ negbio_pipeline dner_mm --metamap=$METAMAP_BIN --output=$OUTPUT_DIR $INPUT_DIR/*.xml
Using vocabularies¶
NegBio also integrates the CheXpert’s method to use vocabularies to recognize the presence of 14 observations.
All vocabularies can be found at patterns
.
Each file in the folder represents one type of named entities with various text expressions. You can specify customized patterns via --phrases_file=<file>
.
Parse the sentence¶
This step parses sentence using the Bllip parser.
Convert the parse tree to UD¶
This step converts the parse tree to universal dependencies using Stanford converter.
Detect negative and uncertain findings¶
This step detects negative and uncertain findings using patterns.
By default, the program uses the negation and uncertainty patterns in the patterns
folder.
However, You can specify customized patterns such as --neg-patterns=<file>
.
Patterns on the dependency graph¶
The pattern is a semgrex-type pattern for matching node in the dependency graph.
Currently, we only support <
and >
operations.
A detailed grammar specification (using PLY, Python Lex-Yacc) can be found in ngrex/parser.py
.
Since v2.0, NegBio integrates the CheXpert algorithms. NegBio utilizes a 3-phase pipeline consisting of pre-negation uncertainty, negation, and post-negation uncertainty (Irvin et al., 2019). Each phase consists of rules which are matched against the mention; if a match is found, then the mention is classified accordingly (as uncertain in the first or third phase, and as negative in the second phase). If a mention is not matched in any of the phases, it is classified as positive.
You can specify customized patterns via --neg-patterns=<file>
, --pre-uncertainty-patterns=<file>
, and --post-uncertainty-patterns=<file>
. Each file is an yaml-format file that consists of a list of patterns. Each pattern must have an id
field and a pattern
field. This allows NegBio to associate each pattern with the detected negation/uncertainty, to maximum the transparency. Examples can be found at patterns
.
Regular expression patterns¶
NegBio also allows to use the regular expression to match simple cases. This function can also speed up the detection process, because pattern matching on the dependency graph is relatively slower. NegBio will first use regular expressions to match the text. If not found, semgrex is then used.
You can specify customized patterns via --neg-regex-patterns=<file>
and --uncertainty-regex-patterns=<file>
. Each file is an yaml-format file that consists of a list of patterns. Each pattern must have an id
field and an pattern
field. Examples can be found in patterns
.
Cleans intermediate information¶
This step removes intermediate information (sentence annotations) from the BioC files.
NegBio Developer Guide¶
Create this documentation¶
$ pip install Sphinx sphinx_rtd_theme recommonmark
$ cd docs
$ make html
Testing the code¶
$ python -m pytest tests
License¶
PUBLIC DOMAIN NOTICE
National Center for Biotechnology Information
This software/database is a “United States Government Work” under the terms of the United States Copyright Act. It was written as part of the author’s official duties as a United States Government employee and thus cannot be copyrighted. This software/database is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.
Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.
Please cite the author in any work or product based on these materials:
Peng Y, Wang X, Lu L, Bagheri M, Summers RM, Lu Z. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA 2018 Informatics Summit. 2018, 188-196.
Wang X, Peng Y, Lu L, Bagheri M, Lu Z, Summers R. ChestX-ray8: Hospital-scale Chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 2097-2106.
Contributing¶
When contributing to this repository, please first discuss the change you wish to make via issue, email, or any other method with the owners of this repository before making a change.
This project adheres to the Contributor Covenant Code of Conduct.
Maintainers¶
NegBio is maintained with :heart: by:
– @yfpeng
See also the list of contributors who participated in this project.
Acknowledgments¶
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine and Clinical Center.
We are grateful to the authors of NegEx, MetaMap, Stanford CoreNLP, Bllip parser, and CheXpert labeler for making their software tools publicly available.
We thank Dr. Alexis Allot for the helpful discussion.
Disclaimer¶
This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI’s disclaimer policy is available.
Reference¶
- Peng Y, Wang X, Lu L, Bagheri M, Summers RM, Lu Z. NegBio: a high-performance tool for negation and uncertainty detection in radiology reports. AMIA 2018 Informatics Summit. 2018, 188-196.
- Wang X, Peng Y, Lu L, Bagheri M, Lu Z, Summers R. ChestX-ray8: Hospital-scale Chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, 2097-2106.