Advanced Usage¶
This document covers some of NegBio more advanced features.
Running the pipeline step-by-step¶
The step-by-step pipeline generates all intermediate documents. You can easily rerun one step if it makes errors. The whole steps are
text2bioccombines text into a BioC XML file.normalizeremoves noisy text such as[**Patterns**].section_splitsplits the report into sectionsssplitsplits text into sentences.- Named entity recognition
dner_mmdetects UMLS concepts using MetaMap.dner_regexdetects concepts using the vocabularies such aspatterns/cxr14_phrases_v2.yml.
parseparses sentence using the Bllip parser.ptb2udconverts the parse tree to universal dependencies using Stanford converter.neg2detects negative and uncertain findings.cleanupremoves intermediate information.
General arguments¶
The general command is
python negbio/negbio_pipeline.py <command> [options] --output=/path/to/output/dir /path/to/inputs
The <command> must be one of the steps above. The --output specifies the output directory. The inputs can be one or multiple files.
Other options include
--suffix: Append an additionalSUFFIXto file names.--verbose: Print more information about progress.--workers: Number of threads.--files_per_worker: Number of input files per worker.--overwrite: Overwrite the output file.
Convert text files to BioC format¶
You can skip this step if the reports are already in the BioC format. If you have lots of reports, it is recommended to put them into several BioC files, for example, 100 reports per BioC file.
export BIOC_DIR=/path/to/bioc
export TEXT_DIR=/path/to/text
python negbio/negbio_pipeline.py text2bioc --output=$BIOC_DIR/test.xml $TEXT_DIR/*.txt
Another most commonly used command is:
find $TEXT_DIR -type f | python negbio/negbio_pipeline.py text2bioc --output=$BIOC_DIR
Normalize reports¶
This step removes the noisy text such as [**Patterns**] in the MIMIC-III reports.
Split each report into sections¶
This step splits the report into sections.
The default section titles is at patterns/section_titles.txt.
You can specify customized section titles using the option --pattern=<file>.
Splits each report into sentences¶
This step splits the report into sentences using the NLTK splitter (nltk.tokenize.sent_tokenize).
Named entity recognition¶
This step recognizes named entities (e.g., findings, diseases, devices) from the reports. In general, MetaMap is more comprehensive while vocabulary is more accurate on 14 types of findings. MetaMap is also slower and easier to break than vocabulary.
Using MetaMap¶
The first version of NegBio uses MetaMap to detect UMLS concepts. Please make sure that both skrmedpostctl and wsdserverctl are started
MetaMap intends to extract all UMLS concepts.
Many of them are not irrelevant to radiology.
Therefore, it is better to specify the UMLS concepts of interest via --cuis=<file>
$ export METAMAP_BIN=METAMAP_HOME/bin/metamap16
$ negbio_pipeline dner_mm --metamap=$METAMAP_BIN --output=$OUTPUT_DIR $INPUT_DIR/*.xml
Using vocabularies¶
NegBio also integrates the CheXpert’s method to use vocabularies to recognize the presence of 14 observations.
All vocabularies can be found at patterns.
Each file in the folder represents one type of named entities with various text expressions. You can specify customized patterns via --phrases_file=<file>.
Parse the sentence¶
This step parses sentence using the Bllip parser.
Convert the parse tree to UD¶
This step converts the parse tree to universal dependencies using Stanford converter.
Detect negative and uncertain findings¶
This step detects negative and uncertain findings using patterns.
By default, the program uses the negation and uncertainty patterns in the patterns folder.
However, You can specify customized patterns such as --neg-patterns=<file>.
Patterns on the dependency graph¶
The pattern is a semgrex-type pattern for matching node in the dependency graph.
Currently, we only support < and > operations.
A detailed grammar specification (using PLY, Python Lex-Yacc) can be found in ngrex/parser.py.
Since v2.0, NegBio integrates the CheXpert algorithms. NegBio utilizes a 3-phase pipeline consisting of pre-negation uncertainty, negation, and post-negation uncertainty (Irvin et al., 2019). Each phase consists of rules which are matched against the mention; if a match is found, then the mention is classified accordingly (as uncertain in the first or third phase, and as negative in the second phase). If a mention is not matched in any of the phases, it is classified as positive.
You can specify customized patterns via --neg-patterns=<file>, --pre-uncertainty-patterns=<file>, and --post-uncertainty-patterns=<file>. Each file is an yaml-format file that consists of a list of patterns. Each pattern must have an id field and a pattern field. This allows NegBio to associate each pattern with the detected negation/uncertainty, to maximum the transparency. Examples can be found at patterns.
Regular expression patterns¶
NegBio also allows to use the regular expression to match simple cases. This function can also speed up the detection process, because pattern matching on the dependency graph is relatively slower. NegBio will first use regular expressions to match the text. If not found, semgrex is then used.
You can specify customized patterns via --neg-regex-patterns=<file> and --uncertainty-regex-patterns=<file>. Each file is an yaml-format file that consists of a list of patterns. Each pattern must have an id field and an pattern field. Examples can be found in patterns.
Cleans intermediate information¶
This step removes intermediate information (sentence annotations) from the BioC files.