bacalhau.tei_document.TEIDocument¶

class bacalhau.tei_document.TEIDocument(filepath, tokenizer, stopwords, xpath, ns_map={'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'})[source]¶

Bases: bacalhau.document.Document

Implementation of the abstract bacalhau.document.Document class to work with TEI files.

Creates a new TEIDocument for the given file path.

Parameters:	filepath (`str`) – path to the file. tokenizer (`nltk.tokenize.api.TokenizerI`) – tokenizer used to tokenize the files in the corpus. stopwords (`list`) – words to be removed from the texts. xpath (`str`) – XPath where to get the `bacalhau.text.Text` from the TEI files. ns_map (`dict`) – namespaces used in the `TEIDocument`.

NS_MAP = {'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'}¶

TEI = '{http://www.tei-c.org/ns/1.0}'¶

TEI_NAMESPACE = 'http://www.tei-c.org/ns/1.0'¶

XML = '{http://www.w3.org/XML/1998/namespace}'¶

XML_NAMESPACE = 'http://www.w3.org/XML/1998/namespace'¶

_abc_cache = <_weakrefset.WeakSet object>¶

_abc_negative_cache = <_weakrefset.WeakSet object>¶

_abc_negative_cache_version = 29¶

_abc_registry = <_weakrefset.WeakSet object>¶

get_term_data()[source]¶

Returns term data for each bacalhau.text.Text within this document.

Return type:	`dict`

get_texts()[source]¶

Returns a list of bacalhau.text.Text objects within this document.

Returns:	`bacalhau.text.Text` objects within this document.
Return type:	`list`

bacalhau.tei_document.TEIDocument¶

Previous topic

Next topic

This Page