Reference#

Fuzz#

The fuzz module mimicks the fuzzywuzzy-like packages like

The main difference is that the Levenshtein distance is replaced by the Joint Complexity distance. The API is also slightly change to enable new features:

  • The list of possible choices can be pre-trained (fit()) to accelerate the computation in the case a stream of queries is sent against the same list of choices.

  • Instead of one single query, a list of queries can be used. Computations will be parallelized.

The main fuzz entry point is the Process class.

class bof.fuzz.Process(n_range=5, preprocessor=None, length_impact=0.5, allow_updates=True)[source]#

The process class is used to compute the closest choices from a list of queries base on joint complexity.

Parameters:
  • n_range (int or None, optional) – Maximum factor size. If None, all factors will be extracted.:

  • preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

  • allow_updates (bool) – When transforming queries, are new factors kept in the CountVectorizer

  • filename (str, optional) – If set, load from corresponding file.

  • path (str or Path, optional) – If set, specify the directory where the file is located.

vectorizer#

The vectorizer used to compute factors.

Type:

CountVectorizer

choices#

The possible choices.

Type:

list of str

choices_matrix#

The factor matrix of the choices.

Type:

csr_matrix

choices_factors#

Number of unique factors of each choice

Type:

ndarray

choices_length#

Number of choices

Type:

int

dedupe(contains_dup, threshold=60.0)[source]#

Inspired by fuzzywuzzy’s dedupe function to remove (near) duplicates. Currently barely optimized (and probably bugged).

Parameters:
  • contains_dup (list of str) – A list of texts with possible nearly redundant entries.

  • threshold (float, optional) – Minimal score that a result must achieve to be considered redundant.

Return type:

list of str

Examples

>>> contains_dupes = ['Frodo Baggin', 'Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
>>> p = Process()
>>> p.dedupe(contains_dupes)
['Frodo Baggins', 'F. Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']

F. Baggins is kept because the length difference impacts the results. Let us ignore the length.

>>> p.length_impact = 0.0
>>> p.dedupe(contains_dupes)
['Frodo Baggins', 'Samwise G.', 'Gandalf', 'Bilbo Baggins']
extract(queries, choices=None, limit=5, score_cutoff=40.0)[source]#

Find the best matches amongst a list of choices.

Parameters:
  • queries (str or list of str) – Text (or list of texts) to match amongst the queries.

  • choices (list of str, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.

  • limit (int or None) – Maximal number of results to give. If None, all choices will be returned (sorted).

  • score_cutoff (float, optional) – Minimal score that a result must achieve to be considered a match

Returns:

If queries is a single text, the list of tuples containing the best choices and their scores. If queries is a list of texts, the list of list of tuples containing the best choices and their scores.

Return type:

list

Examples

>>> p = Process()
>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> p.extract("new york jets", choices, limit=2)
[('New York Jets', 100.0), ('New York Giants', 46.835443037974684)]
>>> p.extract("new york jets", choices, limit=None)
[('New York Jets', 100.0),
('New York Giants', 46.835443037974684),
('Atlanta Falcons', 0.0),
('Dallas Cowboys', 0.0)]
>>> p.extract(["new york", "atlanta"], choices, limit=2, score_cutoff=0.0)
[[('New York Jets', 56.60377358490566), ('New York Giants', 47.61904761904762)],
[('Atlanta Falcons', 37.28813559322034), ('New York Giants', 7.594936708860759)]]
extractOne(queries, choices=None, score_cutoff=40.0)[source]#

Find the best match amongst a list of choices.

Parameters:
  • queries (str or list of str) – Text (or list of texts) to match amongst the queries.

  • choices (list of str, optional) – Possible choices. If None, the previously used (or fitted) choices will be used.

  • score_cutoff (float, optional) – Minimal score that a result must achieve to be considered a match

Returns:

If queries is a single text, a tuple containing the best choice and its score. If queries is a list of texts, the list of tuples containing the best choices and their scores.

Return type:

tuple or list

Examples

>>> choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
>>> p = Process()
>>> p.extractOne("Cowboys", choices)
('Dallas Cowboys', 42.857142857142854)
>>> p.extractOne(["Cowboys", "falcon's"], choices)
[('Dallas Cowboys', 42.857142857142854), None]
>>> p.extractOne(["Cowboys", "falcon's"], choices, score_cutoff=30)
[('Dallas Cowboys', 42.857142857142854), ('Atlanta Falcons', 30.88235294117647)]
fit(choices)[source]#

Compute the factors of a list of choices.

Parameters:
  • choices (list of str)

  • choices. (The possible)

Return type:

None

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])

The choices:

>>> p.choices
['riri', 'fifi', 'rififi']

Numbe of unique factors for each choice:

>>> p.choices_factors.astype(int)
array([ 7,  7, 14])

The matrix that associates factors to choices:

>>> p.choices_matrix.toarray()
array([[2, 0, 1],
       [2, 0, 1],
       [1, 0, 0],
       [1, 0, 0],
       [2, 2, 3],
       [1, 0, 0],
       [1, 0, 0],
       [0, 2, 2],
       [0, 2, 2],
       [0, 1, 1],
       [0, 1, 1],
       [0, 1, 2],
       [0, 1, 2],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1],
       [0, 0, 1]], dtype=uint32)

The corresponding factors:

>>> p.vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
reset()[source]#

Clear choices from the object.

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])
>>> p.choices
['riri', 'fifi', 'rififi']
>>> p.choices_factors.astype(int)
array([ 7,  7, 14])
>>> p.reset()
>>> p.choices is None
True
>>> p.choices_factors is None
True
transform(queries, threshold=0.0)[source]#

Compute the joint complexities of queries against choices,

Parameters:
  • queries (list of str) – The list of queries.

  • threshold (float) – Relative threshold below which joint complexity is not actually computed.

Examples

>>> p = Process()
>>> p.fit(["riri", "fifi", "rififi"])

Notice the number of factors:

>>> p.vectorizer.m
18
>>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"])
array([[71.42857143,  9.09090909, 18.75      ],
       [ 6.25      , 21.42857143, 14.28571429],
       [ 9.09090909, 41.17647059, 34.7826087 ],
       [ 1.92307692,  0.        ,  1.69491525]])

The factors have been augmented with the ones from the queries:

>>> p.vectorizer.m
79

This is could be a memory issue if you keep entering very different queries. To keep the factors clean after a transform, set allow_updates to False.

>>> p.allow_updates = False
>>> p.fit(["riri", "fifi", "rififi"])
>>> p.transform(["rir", "fido", "rafifi", "screugneuhneu"])
array([[71.42857143,  9.09090909, 18.75      ],
       [ 6.25      , 21.42857143, 14.28571429],
       [ 9.09090909, 41.17647059, 34.7826087 ],
       [ 1.92307692,  0.        ,  1.69491525]])
>>> p.vectorizer.m
18
>>> p.vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']
bof.fuzz.get_best_choice(choices, scores, score_cutoff)[source]#

Given a list of choices with scores, extract the best choice.

Parameters:
  • choices (list of str) – List of choices.

  • scores (ndarray) – Score of the choices (must have the same size as choices)

  • score_cutoff (float) – Minimal score that the result must achieve to be considered a match

Returns:

Tuple containing the choice and its score if the latter is above the cutoff, None otherwise.

Return type:

tuple or None

bof.fuzz.get_best_choices(choices, scores, limit)[source]#

Given a list of choices with scores, extract the best choices.

Parameters:
  • choices (list of str) – List of choices.

  • scores (ndarray) – Score of the choices (must have the same size as choices)

  • limit (int or None) – Maximal number of results to give. If None, all choices will be returned (sorted).

Returns:

List of tuples containing the choices and their scores.

Return type:

list

bof.fuzz.jit_common_factors(queries_length, xind, xptr, choices_length, yind, yptr, m)[source]#

Jitted function to compute the common factors between a corpus of queries and a corpus of choices.

Parameters:
  • queries_length (int) – Number of documents in the corpus of queries

  • xind (ndarray) – Indices of the query factor matrix

  • xptr (ndarray) – pointers of the query factor matrix

  • choices_length (int) – Number of documents in the corpus of choices

  • yind (ndarray) – Indices of the transposed choices factor matrix

  • yptr (ndarray) – Pointers of the transposed choices factor matrix

  • m (int) – Size of the factor space for choices

Returns:

A queries_length X choices_length matrix that contains the number of (unique) factors between choices and queries.

Return type:

ndarray

bof.fuzz.jit_jc(queries_factors, choices_factors, common_factors, length_impact, threshold=0.0)[source]#

Jitted function to compute a joint complexity between a corpus of queries and a corpus of choices.

Parameters:
  • queries_factors (ndarray) – Vector of the number of unique factors for each query.

  • choices_factors (ndarray) – Vector of the number of unique factors for each choice.

  • common_factors (ndarray) – Matrix of the number of common unique factors between queries and choices.

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

  • threshold (float) – Don’t compute JC is common factors is less than threshold X (# query factors)

Returns:

Joint Complexity matrix

Return type:

ndarray

bof.fuzz.jit_square_factors(xind, xptr, yind, yptr, n, length_impact)[source]#

Jitted function to compute the joint complexity between texts of a corpus.

Parameters:
  • xind (ndarray) – Indices of the factor matrix

  • xptr (ndarray) – pointers of the factor matrix

  • yind (ndarray) – Indices of the transposed factor matrix

  • yptr (ndarray) – Pointers of the transposed factor matrix

  • n (int) – Corpus size

  • length_impact (float) – Importance of the length difference between two texts when computing the scores.

Returns:

A n X n matrix that contains joint complexity scores of the corpus.

Return type:

ndarray

bof.fuzz.txt_max(txt_list)[source]#

Auxiliary function that returns the index of the longest text from a list. In case of ties, alphabetic order is used.

Parameters:

txt_list (list of str) – List of texts.

Returns:

The index of the longest text.

Return type:

int

Examples

>>> txt_max(["This is long!", "That is long!", "It's short"])
1

Feature Extraction#

The feature_extraction module mimicks the module https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text with a focus on character-based extraction.

The main differences are:

  • it is slightly faster;

  • the features can be incrementally updated;

  • it is possible to fit only a random sample of factors to reduce space and computation time.

The main entry point for this module is the CountVectorizer class, which mimicks its scikit-learn counterpart (also named CountVectorizer). It is in fact very similar to sklearn’s CountVectorizer using char or char_wb analyzer option from that module.

class bof.feature_extraction.CountVectorizer(n_range=5, preprocessor=None)[source]#

Counts the factors of a list of document.

Parameters:
  • preprocessor (callable, optional) – Preprocessing function to apply to texts before adding them to the factor tree.

  • n_range (int or None, optional) – Maximum factor size. If None, all factors will be extracted.

  • filename (str, optional) – If set, load from corresponding file.

  • path (str or Path, optional) – If set, specify the directory where the file is located.

features_#

Dictionary that maps factors to their index in the list.

Type:

dict of str -> int

Examples

Build a vectorizer limiting factor size to 3:

>>> vectorizer = CountVectorizer(n_range=3)

Build the factor matrix of a corpus of texts.

>>> corpus = ["riri", "fifi", "rififi"]
>>> vectorizer.fit_transform(corpus=corpus).toarray()
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

List the factors in the corpus:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
property features#

Get the list of features (internally, features are stored as a Numba Typed dict that associates factors to indexes).

Returns:

features – List of factors.

Return type:

list of str

fit(corpus, reset=True)[source]#

Build the features. Does not build the factor matrix.

Parameters:
  • corpus (list of str.) – Texts to analyze.

  • reset (bool) – Clears current features and corpus. Features will be updated instead.

Return type:

None

Examples

We compute the factors of a corpus.

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit(["riri", "fifi", "rififi"])

The fit method does not return anything, but the factors have been populated:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

We fit another corpus.

>>> vectorizer.fit(["riri", "fifi"])

The factors have been implicitly reset (rif is gone in this toy example):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi']

We keep pre-existing factors by setting reset to False:

>>> vectorizer.fit(["rififi"], reset=False)

The list of features has been updated (with rif`):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']
fit_transform(corpus, reset=True)[source]#

Build the features and return the factor matrix.

Parameters:
  • corpus (list of str.) – Texts to analyze.

  • reset (bool, optional) – Clears factors. If False, factors are updated instead.

Returns:

A sparse matrix that indicates for each document of the corpus its factors and their multiplicity.

Return type:

csr_matrix

Examples

Build a FactorTree from a corpus of three documents:

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray()
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

List of factors (of size at most 3):

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

Build a FactorTree from a corpus of two documents.

>>> vectorizer.fit_transform(["fifi", "rififi"]).toarray()
array([[2, 2, 1, 2, 1, 1, 0, 0, 0],
       [2, 2, 1, 3, 2, 2, 1, 1, 1]], dtype=uint32)

Notice the implicit reset, as only factors from “fifi” and “rififi” are present:

>>> vectorizer.features
['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif']
>>> vectorizer.m
9

With reset set to False, we can add another list without discarding pre-existing factors.

>>> vectorizer.fit_transform(["riri"], reset=False).toarray()
array([[0, 0, 0, 2, 0, 0, 2, 2, 0, 1, 1, 1]], dtype=uint32)

Notice the presence of empty columns, which corresponds to pre-existing factors that do not exist in “riri”.

The size and list of factors:

>>> vectorizer.m
12
>>> vectorizer.features
['f', 'fi', 'fif', 'i', 'if', 'ifi', 'r', 'ri', 'rif', 'rir', 'ir', 'iri']

Setting n_range to None will compute all factors.

>>> vectorizer.n_range = None
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray()
array([[2, 2, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 3, 0, 0, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1]], dtype=uint32)
property m#

Get the number of features.

Returns:

m – Number of factors.

Return type:

int

no_none_range()[source]#

Replace None n_range by 0 before passing to cython code

Return type:

None

sampling_fit(corpus, reset=True, sampling_rate=0.5, seed=None)[source]#

Build a partial factor tree where only a random subset of factors are selected. Note that there is no sampling_fit_transform method, as mutualizing the processes would introduce incoherences in the factor description: you have to do a sampling_fit followed by a transform.

Parameters:
  • corpus (list of str) – Texts to analyze.

  • reset (bool) – Clears FactorTree. If False, FactorTree will be updated instead.

  • sampling_rate (float) – Probability to explore factors starting from one given position in the text.

  • seed (int, optional) – Seed of the random generator.

Return type:

None

Examples

We fit a corpus to a tree a normal way to see the complete list of factors of size at most 3..

>>> vectorizer = CountVectorizer()
>>> vectorizer.fit(["riri", "fifi", "rififi"])
>>> vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'fifi', 'if', 'ifi', 'rif', 'rifi', 'rifif', 'ifif', 'ififi']

Now we use a sampling fit instead. Only a subset of the factors are selected.

>>> vectorizer.sampling_fit(["riri", "fifi", "rififi"], seed=42)
>>> vectorizer.features
['r', 'ri', 'rir', 'riri', 'f', 'fi', 'fif', 'fifi', 'i', 'if', 'ifi']

We random fit another corpus. We reset the seed to reproduce the example above.

>>> vectorizer.sampling_fit(["riri", "fifi"], sampling_rate=.2)

The factors have been implicitly reset.

>>> vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri']

We add another corpus to the fit by setting reset to False:

>>> vectorizer.sampling_fit(["rififi"], reset=False, sampling_rate=.2)

The list of features has been updated:

>>> vectorizer.features
['r', 'ri', 'rir', 'riri', 'i', 'ir', 'iri', 'f', 'fi']
transform(corpus)[source]#

Build factor matrix from the factors already computed. New factors are discarded.

Parameters:

corpus (list of str.) – Texts to analyze.

Returns:

The factor count of the input corpus NB: if reset is set to False, the factor count of the pre-existing corpus is not returned but is internally preserved.

Return type:

csr_matrix

Examples

To start, we fit a corpus:

>>> vectorizer = CountVectorizer(n_range=3)
>>> vectorizer.fit_transform(["riri", "fifi", "rififi"]).toarray()
array([[2, 2, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 2, 0, 0, 2, 2, 1, 1, 1, 0],
       [1, 1, 0, 3, 0, 0, 2, 2, 1, 2, 2, 1]], dtype=uint32)

The factors are:

>>> vectorizer.features
['r', 'ri', 'rir', 'i', 'ir', 'iri', 'f', 'fi', 'fif', 'if', 'ifi', 'rif']

We now apply a transform.

>>> vectorizer.transform(["fir", "rfi"]).toarray()
array([[1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0]], dtype=uint32)

The features have not been updated. For example, the only factors reported for “rfi” are “r”, “i”, “f”, and “fi”. Factors that were not fit (e.g. rf) are discarded.

bof.feature_extraction.build_end(n_range=None)[source]#

Return a function of a starting position s and a text length tl that tells the end of scanning text from s. It avoids to test the value of n_range all the time when doing factor extraction.

Parameters:

n_range (int or None) – Maximal factor size. If 0 or None, all factors are considered.

Return type:

callable

Examples

>>> end = build_end()
>>> end(7, 15)
15
>>> end(13, 15)
15
>>> end = build_end(5)
>>> end(7, 15)
12
>>> end(13, 15)
15

Common#

The common module contains miscellaneous classes and functions.

class bof.common.MixInIO[source]#

Provide basic save/load capacities to other classes.

dump(filename: str, path='.', overwrite=False, compress=True, stemize=True)[source]#

Save instance to file.

Parameters:
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

  • overwrite (bool, default=False) – Should existing file be erased if it exists?

  • compress (bool, default=True) – Should gzip compression be used?

  • stemize (bool, default=True) – Trim any extension (e.g. .xxx)

Examples

>>> import tempfile
>>> v1 = ToyClass(42)
>>> v2 = ToyClass()
>>> v2.value
0
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=True, path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
...     v2 = ToyClass.load(filename='myfile', path=Path(tmpdirname))
...     v1.dump(filename='myfile', compress=True, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl.gz already exists! Use overwrite option to overwrite.
>>> dir_content
['myfile.pkl.gz']
>>> v2.value
42
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', compress=False, path=tmpdirname)
...     v1.dump(filename='myfile', compress=False, path=tmpdirname) # doctest.ELLIPSIS
File ...myfile.pkl already exists! Use overwrite option to overwrite.
>>> v1.value = 51
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...     v1.dump(filename='myfile', path=tmpdirname, compress=False)
...     v1.dump(filename='myfile', path=tmpdirname, overwrite=True, compress=False)
...     v2 = ToyClass.load(filename='myfile', path=tmpdirname)
...     dir_content = [file.name for file in Path(tmpdirname).glob('*')]
>>> dir_content
['myfile.pkl']
>>> v2.value
51
>>> with tempfile.TemporaryDirectory() as tmpdirname:
...    v2 = ToyClass.load(filename='thisfilenamedoesnotexist')
Traceback (most recent call last):
 ...
FileNotFoundError: [Errno 2] No such file or directory: ...
classmethod load(filename: str, path='.')[source]#

Load instance from file.

Parameters:
  • filename (str) – The stem of the filename.

  • path (str or Path, optional) – The location path.

class bof.common.ToyClass(value=0)[source]#