Saber is ready to go out-of-the box when using the web-service or a pre-trained model. However, if you plan on training you own models, you will need to provide a dataset (or datasets!) and, ideally, pre-trained word embeddings.
Pre-trained model names can be passed to
Saber.load() (see Quick Start: Pre-trained Models). Appending
"*-large" to the model name (e.g.
"PRGE-large" will download a much larger model, which should perform slightly better than the base model.
|Identifier||Semantic Group||Identified entity types||Namespace|
||Chemicals||Abbreviations and Acronyms, Molecular Formulas, Chemical database identifiers, IUPAC names, Trivial (common names of chemicals and trademark names), Family (chemical families with a defined structure) and Multiple (non-continuous mentions of chemicals in text)||PubChem Compounds|
||Disorders||Acquired Abnormality, Anatomical Abnormality, Cell or Molecular Dysfunction, Congenital Abnormality, Disease or Syndrome, Mental or Behavioral Dysfunction, Neoplastic Process, Pathologic Function, Sign or Symptom||Disease Ontology|
||Organisms||Species, Taxa||NCBI Taxonomy|
||Genes and Gene Products||Genes, Gene Products||STRING|
Currently, Saber requires corpora to be in a CoNLL format with a BIO or IOBES tag scheme, e.g.:
Selegiline B-CHED - O induced O postural B-DISO hypotension I-DISO ...
Corpora in such a format are collected in here for convenience.
In this format, the first column contains each token of an input sentence, the last column contains the tokens tag, all columns are separated by tabs, and all sentences by a newline.
Of course, not all corpora are distributed in the CoNLL format:
- Corpora in the Standoff format can be converted to CoNLL format using this tool.
- Corpora in PubTator format can be converted to Standoff first using this tool.
Saber infers the "training strategy" based on the structure of the dataset folder:
- To use k-fold cross-validation, simply provide a
train.*file in your dataset folder.
. ├── NCBI_Disease │ └── train.tsv
- To use a train/valid/test strategy, provide
test.*files in your dataset folder. Optionally, you can provide a
valid.*file. If not provided, a random 10% of examples from
train.*are used as the validation set.
. ├── NCBI_Disease │ ├── test.tsv │ └── train.tsv
When training new models, you can (and should) provide your own pre-trained word embeddings with the
pretrained_embeddings argument (either at the command line or in the configuration file). Saber expects all word embeddings to be in the
word2vec file format. Pyysalo et al. 2013 provide word embeddings that work quite well in the biomedical domain, which can be downloaded here. Alternatively, from the command line call:
# Replace this with a location you want to save the embeddings to $ mkdir path/to/word_embeddings # Note: this file is over 4GB $ wget http://evexdb.org/pmresources/vec-space-models/wikipedia-pubmed-and-PMC-w2v.bin -O path/to/word_embeddings
To use these word embeddings with Saber, provide their path in the
pretrained_embeddings argument (either in the
config file or at the command line). Alternatively, pass their path to
Saber.load_embeddings(). For example:
from saber.saber import Saber saber = Saber() saber.load_dataset('path/to/dataset') # load the embeddings here saber.load_embeddings('path/to/word_embeddings') saber.build() saber.train()
(saber) $ python >>> from gensim.scripts.glove2word2vec import glove2word2vec >>> glove_input_file = 'glove.txt' >>> word2vec_output_file = 'word2vec.txt' >>> glove2word2vec(glove_input_file, word2vec_output_file)