bacalhau.document.Document¶
-
class
bacalhau.document.Document(filepath, tokenizer, stopwords)[source]¶ Bases:
objectAbstract class to read from/write to files. Different implementations should extend this class and override the abstract methods.
Creates a new
Documentfor the given file path.Parameters: - filepath (
str) – path to the file. - tokenizer (
nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus. - stopwords (
list) – words to be removed from the texts.
-
_abc_cache= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache_version= 29¶
-
_abc_registry= <_weakrefset.WeakSet object>¶
-
get_term_data()[source]¶ Returns term data for each
bacalhau.text.Textwithin this document.Returns: dict
-
get_text_count()[source]¶ Returns the number of
bacalhau.text.Textobjects for thisDocument.Returns: number of bacalhau.text.Textobjects.Return type: int
-
get_texts()[source]¶ Returns a list of
bacalhau.text.Textobjects within this document.Returns: list of bacalhau.text.Textobjects.Return type: list
- filepath (