bacalhau.corpus.Corpus

class bacalhau.corpus.Corpus(corpus_path, document_class, tokenizer=<WordPunctTokenizer object>, stopwords=<Mock object>, **document_kwargs)[source]

Bases: object

A manager class to generate topic hierarchies from files.

Creates a new Corpus for the given path, using the given bacalhau.document.Document class to process the files.

Parameters:
  • corpus_path (str) – path to the files.
  • document_class (bacalhau.document.Document) – document class used to process the corpus files.
  • tokenizer (nltk.tokenize.api.TokenizerI) – tokenizer used to tokenize the files in the corpus, defaults to nltk.tokenize.regexp.WordPunctTokenizer.
  • stopwords (list) – words to be removed from the texts, defaults to nltk.corpus.stopwords.words('english').
_add_tf_idf(term_data)[source]

Returns term_data with a TF.IDF value added to each term/text combination.

Parameters:term_data (dict) – dict with term/text combination.
Return type:dict
_get_documents()[source]

Creates a bacalhau.document.Document object for each of the files in the corpus, and returns them in a list.

Parameters:corpus_path (str) – path to the corpus files.
Returns:documents in this corpus.
Return type:list
_get_hypernym(word)[source]

Returns a list of the hypernyms for the given word.

Parameters:word (str) – the word to get the hypernym for.
Return type:list
_get_term_data()[source]

Returns term data for all of the bacalhau.document.Document objects in this corpus.

Return type:dict
_get_text_count()[source]

Returns the number of bacalhau.text.Text objects in this corpus.

Return type:float
annotate_topic_tree(tree)[source]

Annotates the nodes in the bacalhau.topic_tree.TopicTree with information about which bacalhau.text.Text and counts the nodes relate to.

Parameters:tree (bacalhau.topic_tree.TopicTree) – topic tree of terms
Return type:bacalhau.topic_tree.TopicTree
generate_topic_tree(n_terms)[source]

Generates a bacalhau.topic_tree.TopicTree for the corpus, using a maximum of n_terms from each bacalhau.text.Text. First extracts top terms; second gets hypernyms for each of the terms; third creates the bacalhau.topic_tree.TopicTree using the hypernyms.

Parameters:n_terms (int) – maximum number of terms to be used from each Text.
Returns:the generated topic tree.
Return type:bacalhau.topic_tree.TopicTree
get_hypernyms(top_terms)[source]

Returns a dictionary with the hypernyms for the given terms.

Parameters:top_terms (dict) – dict with term/text information.
Returns:{text: {term: hypernym}}.
Return type:dict
get_top_terms(n_terms)[source]

Returns a dictionary with the highest n_terms for each bacalhau.text.Text from the term data dictionary.

Parameters:n_terms (int) – maximum number of terms to be used from each text.
Returns:dict
get_topic_tree(hypernyms)[source]

Generates and returns a bacalhau.topic_tree.TopicTree for the given hypernyms.

Parameters:hypernyms (dict) – dictionary of hypernyms.
Return type:bacalhau.topic_tree.TopicTree