Reference
Fuzz
The fuzz
module mimicks the fuzzywuzzy-like packages like
fuzzywuzzy (https://github.com/seatgeek/fuzzywuzzy)
rapidfuzz (https://github.com/maxbachmann/rapidfuzz)
The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also slightly change to enable new features:
The list of possible choices can be pre-trained (
fit()
) to accelerate the computation in the case a stream of queries is sent against the same list of choices.Instead of one single query, a list of queries can be used. Computations will be parallelized.
The main fuzz
entry point is the Process
class.
- class bof.fuzz.Process(n_range=5, preprocessor=None, length_impact=0.5, allow_updates=True, filename=None, path='.')[source]
The process class is used to compute the closest choices from a list of queries base on joint complexity.
- Parameters
n_range (
int
or None, optional) – Maximum factor size. If None, all factors will be extracted.:preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.
length_impact (
float
) – Importance of the length difference between two texts when computing the scores.allow_updates (
bool
) – When transforming queries, are new factors kept in theCountVectorizer
filename (
str
, optional) – If set, load from corresponding file.path (
str
orPath
, optional) – If set, specify the directory where the file is located.
- vectorizer
The vectorizer used to compute factors.
- Type
- choices_matrix
The factor matrix of the choices.
- Type
- dedupe(contains_dup, threshold=60.0)[source]
Inspired by fuzzywuzzy’s dedupe function to remove (near) duplicates. Currently barely optimized (and probably bugged).
- Parameters
- Return type
Examples
>>> contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins'] >>> p = Process() >>> p.dedupe(contains_dupes) ['Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
F. Baggins is kept because the length difference impacts the results. Let us ignore the length.
>>> p.length_impact = 0.0 >>> p.dedupe(contains_dupes) ['Frodo Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
- extract(queries, choices=None, limit=5, score_cutoff=40.0)[source]
Find the best matches amongst a list of choices.
- Parameters
queries (
str
orlist
ofstr
) – Text (or list of texts) to match amongst the queries.choices (
list
ofstr
, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.limit (
int
or None) – Maximal number of results to give. If None, all choices will be returned (sorted).score_cutoff (
float
, optional) – Minimal score that a result must achieve to be considered a match
- Returns
If queries is a single text, the list of tuples containing the best choices and their scores. If queries is a list of texts, the list of list of tuples containing the best choices and their scores.
- Return type
Examples
>>> p = Process() >>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> p.extract("new york jets", choices, limit=2) [('New York Jets', 100.0), ('New York Giants', 46.835443037974684)] >>> p.extract("new york jets", choices, limit=None) [('New York Jets', 100.0), ('New York Giants', 46.835443037974684), ('Atlanta Falcons', 0.0), ('Dallas Cowboys', 0.0)] >>> p.extract(["new york", "atlanta"], choices, limit=2, score_cutoff=0.0) [[('New York Jets', 56.60377358490566), ('New York Giants', 47.61904761904762)], [('Atlanta Falcons', 37.28813559322034), ('New York Giants', 7.594936708860759)]]
- extractOne(queries, choices=None, score_cutoff=40.0)[source]
Find the best match amongst a list of choices.
- Parameters
- Returns
If queries is a single text, a tuple containing the best choice and its score. If queries is a list of texts, the list of tuples containing the best choices and their scores.
- Return type
Examples
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"] >>> p = Process() >>> p.extractOne("Cowboys", choices) ('Dallas Cowboys', 42.857142857142854) >>> p.extractOne(["Cowboys", "falcon's"], choices) [('Dallas Cowboys', 42.857142857142854), None] >>> p.extractOne(["Cowboys", "falcon's"], choices, score_cutoff=30) [('Dallas Cowboys', 42.857142857142854), ('Atlanta Falcons', 30.88235294117647)]
- fit(choices)[source]
Compute the factors of a list of choices.
Examples
>>> p = Process() >>> p.fit(["riri", "fifi", "rififi"])
The choices:
>>> p.choices ['riri', 'fifi', 'rififi']
Numbe of unique factors for each choice:
>>> p.choices_factors array([ 7, 7, 14], dtype=int32)
The matrix that associates factors to choices:
>>> p.choices_matrix.toarray() array([[2, 0, 1], [2, 0, 1], [1, 0, 0], [1, 0, 0], [2, 2, 3], [1, 0, 0], [1, 0, 0], [0, 2, 2], [0, 2, 2], [0, 1, 1], [0, 1, 1], [0, 1, 2], [0, 1, 2], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1], [0, 0, 1]], dtype=uint32)
The corresponding factors:
>>> p.vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
- reset()[source]
Clear choices from the object.
Examples
>>> p = Process() >>> p.fit(["riri", "fifi", "rififi"]) >>> p.choices ['riri', 'fifi', 'rififi'] >>> p.choices_factors array([ 7, 7, 14], dtype=int32) >>> p.reset() >>> p.choices is None True >>> p.choices_factors is None True
- transform(queries, threshold=0.0)[source]
Compute the joint complexities of queries against choices,
- Parameters
Examples
>>> p = Process() >>> p.fit(["riri", "fifi", "rififi"])
Notice the number of factors:
>>> p.vectorizer.m 18
>>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) array([[71.42857143, 9.09090909, 18.75 ], [ 6.25 , 21.42857143, 14.28571429], [ 9.09090909, 41.17647059, 34.7826087 ], [ 1.92307692, 0. , 1.69491525]])
The factors have been augmented with the ones from the queries:
>>> p.vectorizer.m 79
This is could be a memory issue if you keep entering very different queries. To keep the factors clean after a transform, set allow_updates to False.
>>> p.allow_updates = False >>> p.fit(["riri", "fifi", "rififi"]) >>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"]) array([[71.42857143, 9.09090909, 18.75 ], [ 6.25 , 21.42857143, 14.28571429], [ 9.09090909, 41.17647059, 34.7826087 ], [ 1.92307692, 0. , 1.69491525]]) >>> p.vectorizer.m 18 >>> p.vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
- bof.fuzz.get_best_choice(choices, scores, score_cutoff)[source]
Given a list of choices with scores, extract the best choice.
- Parameters
- Returns
Tuple containing the choice and its score if the latter is above the cutoff, None otherwise.
- Return type
tuple
or None
- bof.fuzz.get_best_choices(choices, scores, limit)[source]
Given a list of choices with scores, extract the best choices.
- Parameters
- Returns
List of tuples containing the choices and their scores.
- Return type
- bof.fuzz.jit_common_factors(queries_length, xind, xptr, choices_length, yind, yptr, m)[source]
Jitted function to compute the common factors between a corpus of queries and a corpus of choices.
- Parameters
queries_length (
int
) – Number of documents in the corpus of queriesxind (
ndarray
) – Indices of the query factor matrixxptr (
ndarray
) – pointers of the query factor matrixchoices_length (
int
) – Number of documents in the corpus of choicesyind (
ndarray
) – Indices of the transposed choices factor matrixyptr (
ndarray
) – Pointers of the transposed choices factor matrixm (
int
) – Size of the factor space for choices
- Returns
A queries_length X choices_length matrix that contains the number of (unique) factors between choices and queries.
- Return type
- bof.fuzz.jit_jc(queries_factors, choices_factors, common_factors, length_impact, threshold=0.0)[source]
Jitted function to compute a joint complexity between a corpus of queries and a corpus of choices.
- Parameters
queries_factors (
ndarray
) – Vector of the number of unique factors for each query.choices_factors (
ndarray
) – Vector of the number of unique factors for each choice.common_factors (
ndarray
) – Matrix of the number of common unique factors between queries and choices.length_impact (
float
) – Importance of the length difference between two texts when computing the scores.threshold (
float
) – Don’t compute JC is common factors is less than threshold X (# query factors)
- Returns
Joint Complexity matrix
- Return type
- bof.fuzz.jit_square_factors(xind, xptr, yind, yptr, n, length_impact)[source]
Jitted function to compute the joint complexity between texts of a corpus.
- Parameters
xind (
ndarray
) – Indices of the factor matrixxptr (
ndarray
) – pointers of the factor matrixyind (
ndarray
) – Indices of the transposed factor matrixyptr (
ndarray
) – Pointers of the transposed factor matrixn (
int
) – Corpus sizelength_impact (
float
) – Importance of the length difference between two texts when computing the scores.
- Returns
A n X n matrix that contains joint complexity scores of the corpus.
- Return type
Feature Extraction
The feature_extraction
module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text
with a focus on character-based extraction.
The main differences are:
it is slightly faster;
the features can be incrementally updated;
it is possible to fit only a random sample of factors to reduce space and computation time.
The main entry point for this module is the CountVectorizer
class, which mimicks
its scikit-learn counterpart (also named CountVectorizer
).
It is in fact very similar to sklearn’s CountVectorizer
using char or
char_wb analyzer option from that module.
- class bof.feature_extraction.CountVectorizer(n_range=5, preprocessor=None, filename=None, path='.')[source]
Counts the factors of a list of document.
- Parameters
preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.
n_range (
int
or None, optional) – Maximum factor size. If None, all factors will be extracted.filename (
str
, optional) – If set, load from corresponding file.path (
str
orPath
, optional) – If set, specify the directory where the file is located.
Examples
Build a vectorizer limiting factor size to 3:
>>> vectorizer = CountVectorizer(n_range=3)
Build the factor matrix of a corpus of texts.
>>> corpus = ["riri", "fifi", "rififi"] >>> vectorizer.fit_transform(corpus=corpus).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)
List the factors in the corpus:
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
- property features
Get the list of features (internally, features are stored as a
dict
that associates factors to indexes).
- fit(corpus, reset=True)[source]
Build the features. Does not build the factor matrix.
- Parameters
- Return type
None
Examples
We compute the factors of a corpus.
>>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit(["riri", "fifi", "rififi"])
The fit method does not return anything, but the factors have been populated:
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
We fit another corpus.
>>> vectorizer.fit(["riri", "fifi"])
The factors have been implicitly reset (rif is gone in this toy example):
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi']
We keep pre-existing factors by setting reset to False:
>>> vectorizer.fit(["rififi"], reset=False)
The list of features has been updated (with rif`):
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
- fit_transform(corpus, reset=True)[source]
Build the features and return the factor matrix.
- Parameters
- Returns
A sparse matrix that indicates for each document of the corpus its factors and their multiplicity.
- Return type
Examples
Build a FactorTree from a corpus of three documents:
>>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)
List of factors (of size at most 3):
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
Build a FactorTree from a corpus of two documents.
>>> vectorizer.fit_transform(["fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0], [2, 2, 1, 3, 2, 2, 1, 1, 1]], dtype=uint32)
Notice the implicit reset, as only factors from “fifi” and “rififi” are present:
>>> vectorizer.features ['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif']
>>> vectorizer.m 9
With reset set to False, we can add another list without discarding pre-existing factors.
>>> vectorizer.fit_transform(["riri"], reset=False).toarray() array([[0, 0, 0, 2, 0, 0, 2, 2, 0, 1, 1, 1]], dtype=uint32)
Notice the presence of empty columns, which corresponds to pre-existing factors that do not exist in “riri”.
The size and list of factors:
>>> vectorizer.m 12
>>> vectorizer.features ['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif', 'rir', 'ir', 'iri']
Setting n_range to None will compute all factors.
>>> vectorizer.n_range = None >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 3, 0, 0, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1]], dtype=uint32)
- sampling_fit(corpus, reset=True, sampling_rate=0.5, seed=42)[source]
Build a partial factor tree where only a random subset of factors are selected. Note that there is no sampling_fit_transform method, as mutualizing the processes would introduce incoherences in the factor description: you have to do a sampling_fit followed by a transform.
- Parameters
- Return type
None
Examples
We fit a corpus to a tree a normal way to see the complete list of factors of size at most 3..
>>> vectorizer = CountVectorizer() >>> vectorizer.fit(["riri", "fifi", "rififi"]) >>> vectorizer.features ['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
Now we use a sampling fit instead. Only a subset of the factors are selected.
>>> vectorizer.sampling_fit(["riri", "fifi", "rififi"]) >>> vectorizer.features ['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'r', 'ri', 'rif', 'rifi', 'rifif']
We random fit another corpus. We reset the seed to reproduce the example above.
>>> vectorizer.sampling_fit(["riri", "fifi"])
The factors have been implicitly reset.
>>> vectorizer.features ['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi']
We add another corpus to the fit by setting reset to False:
>>> vectorizer.sampling_fit(["rififi"], reset=False)
The list of features has been updated:
>>> vectorizer.features ['i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'ifif', 'ififi']
- transform(corpus)[source]
Build factor matrix from the factors already computed. New factors are discarded.
- Parameters
- Returns
The factor count of the input corpus NB: if reset is set to False, the factor count of the pre-existing corpus is not returned but is internally preserved.
- Return type
Examples
To start, we fit a corpus:
>>> vectorizer = CountVectorizer(n_range=3) >>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray() array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0], [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)
The factors are:
>>> vectorizer.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
We now apply a transform.
>>> vectorizer.transform(["fir", "rfi"]).toarray() array([[1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0], [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=uint32)
The features have not been updated. For example, the only factors reported for “rfi” are “r”, “i”, “f”, and “fi”. Factors that were not fit (e.g. rf) are discarded.
- bof.feature_extraction.build_end(n_range=None)[source]
Return a function of a starting position s and a text length l that tells the end of scanning text from s. It avoids to test the value of n_range all the time when doing factor extraction.
- Parameters
n_range (
int
or None) – Maximal factor size. If 0 or None, all factors are considered.- Return type
callable
Examples
>>> end = build_end() >>> end(7, 15) 15 >>> end(13, 15) 15 >>> end = build_end(5) >>> end(7, 15) 12 >>> end(13, 15) 15
- bof.feature_extraction.number_of_factors(length, n_range=None)[source]
Return the number of factors (with multiplicity) of size at most n_range that exist in a text of length length. This allows to pre-allocate working memory.
- Parameters
- Returns
The number of factors (with multiplicity).
- Return type
Examples
>>> l = len("riri") >>> number_of_factors(l) 10 >>> number_of_factors(l, n_range=2) 7
Common
The common module contains miscellaneous classes and functions.
- class bof.common.MixInIO[source]
Provide basic save/load capacities to other classes.
- save(filename: str, path='.', erase=False, compress=False)[source]
Save instance to file.
- Parameters
Examples
>>> import tempfile >>> from bof.feature_extraction import CountVectorizer >>> vect1 = CountVectorizer(n_range=3) >>> vect1.fit(["riri", "fifi", "rififi"]) >>> with tempfile.TemporaryDirectory() as tmpdirname: ... vect1.save(filename='myfile', compress=True, path=tmpdirname) ... vect2 = CountVectorizer(filename='myfile', path=Path(tmpdirname)) >>> vect2.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
>>> with tempfile.TemporaryDirectory() as tmpdirname: ... vect1.save(filename='myfile', compress=True, path=tmpdirname) ... dir_content = [f.name for f in Path(tmpdirname).glob('*')] ... vect2 = CountVectorizer(filename='myfile', path=Path(tmpdirname)) ... vect1.save(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl.gz already exists! Use erase option to overwrite. >>> dir_content ['myfile.pkl.gz'] >>> vect2.m 12
>>> from bof.fuzz import Process >>> p1 = Process() >>> p1.fit(["riri", "fifi", "rififi"]) >>> with tempfile.TemporaryDirectory() as tmpdirname: ... p1.save(filename='myfile', compress=True, path=tmpdirname) ... p2 = Process(filename='myfile', path=Path(tmpdirname)) >>> p2.extractOne("rififo") ('rififi', 63.1578947368421)
>>> with tempfile.TemporaryDirectory() as tmpdirname: ... p1.save(filename='myfile', path=tmpdirname) ... p1.save(filename='myfile', path=tmpdirname) # doctest.ELLIPSIS File ...myfile.pkl already exists! Use erase option to overwrite.
>>> vect1.fit_transform(["titi"], reset=False).toarray() array([[0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1]], dtype=uint32) >>> with tempfile.TemporaryDirectory() as tmpdirname: ... vect1.save(filename='myfile', path=tmpdirname) ... vect1.save(filename='myfile', path=tmpdirname, erase=True) ... vect2.load(filename='myfile', path=tmpdirname) ... dir_content = [f.name for f in Path(tmpdirname).glob('*')] >>> dir_content ['myfile.pkl'] >>> vect2.features ['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif', 't', 'ti', 'tit', 'it', 'iti']
>>> with tempfile.TemporaryDirectory() as tmpdirname: ... vect2.load(filename='thisfilenamedoesnotexist', path=tmpdirname) # doctest.ELLIPSIS Traceback (most recent call last): ... FileNotFoundError: [Errno 2] No such file or directory: ...
- bof.common.default_preprocessor(txt)[source]
Default string preprocessor: trim extra spaces and lower case from string txt.
Examples
>>> default_preprocessor(" LaTeX RuleZ ") 'latex rulez'
- bof.common.make_random_bool_generator(probability_true=0.5)[source]
Provides a (possibly biased) random generator of booleans.
- Parameters
probability_true (
float
, optional.) – Probability to return True.- Returns
random_boolean – A function that returns a random boolean when called.
- Return type
callable
Examples
>>> rb = make_random_bool_generator() >>> set_seed(seed=42) >>> [rb() for _ in range(10)] [True, False, False, False, True, True, True, False, False, False]