bacalhau.document.Document¶
-
class
bacalhau.document.
Document
(filepath, tokenizer, stopwords)[source]¶ Bases:
object
Abstract class to read from/write to files. Different implementations should extend this class and override the abstract methods.
Creates a new
Document
for the given file path.Parameters: - filepath (
str
) – path to the file. - tokenizer (
nltk.tokenize.api.TokenizerI
) – tokenizer used to tokenize the files in the corpus. - stopwords (
list
) – words to be removed from the texts.
-
_abc_cache
= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache
= <_weakrefset.WeakSet object>¶
-
_abc_negative_cache_version
= 29¶
-
_abc_registry
= <_weakrefset.WeakSet object>¶
-
get_term_data
()[source]¶ Returns term data for each
bacalhau.text.Text
within this document.Returns: dict
-
get_text_count
()[source]¶ Returns the number of
bacalhau.text.Text
objects for thisDocument
.Returns: number of bacalhau.text.Text
objects.Return type: int
-
get_texts
()[source]¶ Returns a list of
bacalhau.text.Text
objects within this document.Returns: list of bacalhau.text.Text
objects.Return type: list
- filepath (