bacalhau.tei_document.TEIDocument¶
-
class
bacalhau.tei_document.
TEIDocument
(filepath, tokenizer, stopwords, xpath, ns_map={'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'})[source]¶ Bases:
bacalhau.document.Document
Implementation of the abstract
bacalhau.document.Document
class to work with TEI files.Creates a new
TEIDocument
for the given file path.Parameters: - filepath (
str
) – path to the file. - tokenizer (
nltk.tokenize.api.TokenizerI
) – tokenizer used to tokenize the files in the corpus. - stopwords (
list
) – words to be removed from the texts. - xpath (
str
) – XPath where to get thebacalhau.text.Text
from the TEI files. - ns_map (
dict
) – namespaces used in theTEIDocument
.
-
NS_MAP
= {'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'}¶
-
TEI
= '{http://www.tei-c.org/ns/1.0}'¶
-
TEI_NAMESPACE
= 'http://www.tei-c.org/ns/1.0'¶
-
XML
= '{http://www.w3.org/XML/1998/namespace}'¶
-
XML_NAMESPACE
= 'http://www.w3.org/XML/1998/namespace'¶
-
_abc_cache
= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache
= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache_version
= 29¶
-
_abc_registry
= <_weakrefset.WeakSet object>¶
-
get_term_data
()[source]¶ Returns term data for each
bacalhau.text.Text
within this document.Return type: dict
-
get_texts
()[source]¶ Returns a list of
bacalhau.text.Text
objects within this document.Returns: bacalhau.text.Text
objects within this document.Return type: list
- filepath (