bacalhau.tei_document.TEIDocument

class bacalhau.tei_document.TEIDocument(filepath, tokenizer, stopwords, xpath, ns_map={'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'})[source]

Bases: bacalhau.document.Document

Implementation of the abstract bacalhau.document.Document class to work with TEI files.

Creates a new TEIDocument for the given file path.

Parameters:
  • filepath (str) – path to the file.
  • tokenizer (nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus.
  • stopwords (list) – words to be removed from the texts.
  • xpath (str) – XPath where to get the bacalhau.text.Text from the TEI files.
  • ns_map (dict) – namespaces used in the TEIDocument.
NS_MAP = {'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'}
TEI = '{http://www.tei-c.org/ns/1.0}'
TEI_NAMESPACE = 'http://www.tei-c.org/ns/1.0'
XML = '{http://www.w3.org/XML/1998/namespace}'
XML_NAMESPACE = 'http://www.w3.org/XML/1998/namespace'
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 29
_abc_registry = <_weakrefset.WeakSet object>
get_term_data()[source]

Returns term data for each bacalhau.text.Text within this document.

Return type:dict
get_texts()[source]

Returns a list of bacalhau.text.Text objects within this document.

Returns:bacalhau.text.Text objects within this document.
Return type:list