bacalhau.tei_document.TEIDocument¶
-
class
bacalhau.tei_document.TEIDocument(filepath, tokenizer, stopwords, xpath, ns_map={'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'})[source]¶ Bases:
bacalhau.document.DocumentImplementation of the abstract
bacalhau.document.Documentclass to work with TEI files.Creates a new
TEIDocumentfor the given file path.Parameters: - filepath (
str) – path to the file. - tokenizer (
nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus. - stopwords (
list) – words to be removed from the texts. - xpath (
str) – XPath where to get thebacalhau.text.Textfrom the TEI files. - ns_map (
dict) – namespaces used in theTEIDocument.
-
NS_MAP= {'xml': 'http://www.w3.org/XML/1998/namespace', 'tei': 'http://www.tei-c.org/ns/1.0'}¶
-
TEI= '{http://www.tei-c.org/ns/1.0}'¶
-
TEI_NAMESPACE= 'http://www.tei-c.org/ns/1.0'¶
-
XML= '{http://www.w3.org/XML/1998/namespace}'¶
-
XML_NAMESPACE= 'http://www.w3.org/XML/1998/namespace'¶
-
_abc_cache= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache_version= 29¶
-
_abc_registry= <_weakrefset.WeakSet object>¶
-
get_term_data()[source]¶ Returns term data for each
bacalhau.text.Textwithin this document.Return type: dict
-
get_texts()[source]¶ Returns a list of
bacalhau.text.Textobjects within this document.Returns: bacalhau.text.Textobjects within this document.Return type: list
- filepath (