

Resources¶

Saber is ready to go out-of-the box when using the web-service or a pre-trained model. However, if you plan on training you own models, you will need to provide a dataset (or datasets!) and, ideally, pre-trained word embeddings.

Pre-trained models¶

Pre-trained model names can be passed to Saber.load() (see Quick Start: Pre-trained Models). Appending "*-large" to the model name (e.g. "PRGE-large" will download a much larger model, which should perform slightly better than the base model.

Identifier	Semantic Group	Identified entity types	Namespace
`CHED`	Chemicals	Abbreviations and Acronyms, Molecular Formulas, Chemical database identifiers, IUPAC names, Trivial (common names of chemicals and trademark names), Family (chemical families with a defined structure) and Multiple (non-continuous mentions of chemicals in text)	PubChem Compounds
`DISO`	Disorders	Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom	Disease Ontology
`LIVB`	Organisms	Species, Taxa	NCBI Taxonomy
`PRGE`	Genes and Gene Products	Genes, Gene Products	STRING

Datasets¶

Currently, Saber requires corpora to be in a CoNLL format with a BIO or IOBES tag scheme, e.g.:

Selegiline  B-CHED
-   O
induced O
postural    B-DISO
hypotension I-DISO
...

Corpora in such a format are collected in here for convenience.

Info

Many of the corpora in the BIO and IOBES tag format were originally collected by Crichton et al., 2017, here.

In this format, the first column contains each token of an input sentence, the last column contains the tokens tag, all columns are separated by tabs, and all sentences by a newline.

Of course, not all corpora are distributed in the CoNLL format:

Corpora in the Standoff format can be converted to CoNLL format using this tool.
Corpora in PubTator format can be converted to Standoff first using this tool.

Saber infers the "training strategy" based on the structure of the dataset folder:

To use k-fold cross-validation, simply provide a train.* file in your dataset folder.

E.g.

.
├── NCBI_Disease
│   └── train.tsv

To use a train/valid/test strategy, provide train.* and test.* files in your dataset folder. Optionally, you can provide a valid.* file. If not provided, a random 10% of examples from train.* are used as the validation set.

E.g.

.
├── NCBI_Disease
│   ├── test.tsv
│   └── train.tsv

Word embeddings¶

When training new models, you can (and should) provide your own pre-trained word embeddings with the pretrained_embeddings argument (either at the command line or in the configuration file). Saber expects all word embeddings to be in the word2vec file format. Pyysalo et al. 2013 provide word embeddings that work quite well in the biomedical domain, which can be downloaded here. Alternatively, from the command line call:

# Replace this with a location you want to save the embeddings to
$ mkdir path/to/word_embeddings
# Note: this file is over 4GB
$ wget http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin -O path/to/word_embeddings

To use these word embeddings with Saber, provide their path in the pretrained_embeddings argument (either in the config file or at the command line). Alternatively, pass their path to Saber.load_embeddings(). For example:

from saber.saber import Saber

saber = Saber()

saber.load_dataset('path/to/dataset')
# load the embeddings here
saber.load_embeddings('path/to/word_embeddings')

saber.build()
saber.train()

GloVe¶

To use GloVe embeddings, just convert them to the word2vec format first:

(saber) $ python
>>> from gensim.scripts.glove2word2vec import glove2word2vec
>>> glove_input_file = 'glove.txt'
>>> word2vec_output_file = 'word2vec.txt'
>>> glove2word2vec(glove_input_file, word2vec_output_file)