bacalhau.document.Document¶

class bacalhau.document.Document(filepath, tokenizer, stopwords)[source]¶

Bases: object

Abstract class to read from/write to files. Different implementations should extend this class and override the abstract methods.

Creates a new Document for the given file path.

Parameters:	filepath (`str`) – path to the file. tokenizer (`nltk.tokenize.api.TokenizerI`) – tokenizer used to tokenize the files in the corpus. stopwords (`list`) – words to be removed from the texts.

get_term_data()[source]¶

Returns term data for each bacalhau.text.Text within this document.

Returns:	`dict`

get_text_count()[source]¶

Returns the number of bacalhau.text.Text objects for this Document.

Returns:	number of `bacalhau.text.Text` objects.
Return type:	`int`

get_texts()[source]¶

Returns a list of bacalhau.text.Text objects within this document.

Returns:	list of `bacalhau.text.Text` objects.
Return type:	`list`