Bag of Factors#
Bag of Factors allows you to analyze a corpus from its factors.
Free software: MIT
Documentation: https://balouf.github.io/bof/.
Features#
Feature Extraction#
The feature_extraction
module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
with a focus on character-based extraction.
The main differences are:
it is slightly faster;
the features can be incrementally updated;
it is possible to fit only a random sample of factors to reduce space and computation time.
The main entry point for this module is the CountVectorizer
class, which mimicks
its scikit-learn counterpart (also named CountVectorizer
).
It is in fact very similar to sklearn’s CountVectorizer
using char
or
char_wb
analyzer option from that module.
Fuzz#
The fuzz
module mimicks the fuzzywuzzy-like packages like
fuzzywuzzy (seatgeek/fuzzywuzzy)
rapidfuzz (maxbachmann/rapidfuzz)
The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also slightly change to enable new features:
The list of possible choices can be pre-trained (
fit
) to accelerate the computation in the case a stream of queries is sent against the same list of choices.Instead of one single query, a list of queries can be used. Computations will be parallelized.
The main fuzz
entry point is the Process
class.
Getting Started#
Look at examples from the reference section.
Credits#
This package was created with Cookiecutter and the package_helper_2 project template.