bacalhau.document.Document

class bacalhau.document.Document(filepath, tokenizer, stopwords)[source]

Bases: object

Abstract class to read from/write to files. Different implementations should extend this class and override the abstract methods.

Creates a new Document for the given file path.

Parameters:
  • filepath (str) – path to the file.
  • tokenizer (nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus.
  • stopwords (list) – words to be removed from the texts.
_abc_cache = <_weakrefset.WeakSet object>
_abc_negative_cache = <_weakrefset.WeakSet object>
_abc_negative_cache_version = 29
_abc_registry = <_weakrefset.WeakSet object>
get_term_data()[source]

Returns term data for each bacalhau.text.Text within this document.

Returns:dict
get_text_count()[source]

Returns the number of bacalhau.text.Text objects for this Document.

Returns:number of bacalhau.text.Text objects.
Return type:int
get_texts()[source]

Returns a list of bacalhau.text.Text objects within this document.

Returns:list of bacalhau.text.Text objects.
Return type:list