Landmarks#

Introduced in v0.4, this high-level module allows deeper analysis of a small corpus by using individual query results for the embedding.

class gismo.landmarks.Landmarks(source=None, to_text=None, **kwargs)[source]#

The Landmarks class is a subclass Corpus. It offers the capability to batch-rank all its entries against a Gismo instance. After it has been processed, a Landmarks can be used to analyze/classify Gismo queries, Cluster, or Landmarks.

Landmarks also offers the possibility to reduce a source or a gismo to its neighborhood. This can be useful if the source is huge and one wants something smaller for performance.

Parameters:
  • source (list) – The list of items that form Landmarks.

  • to_text (function) – The function that transforms an item into text

  • kwargs (dict) – Custom default runtime parameters. You just need to specify the parameters that differ from DEFAULT_LANDMARKS_PARAMETERS.

Examples

Landmarks lean on a Gismo. We can use a toy Gismo to start with.

>>> corpus = Corpus(toy_source_text)
>>> embedding = Embedding()
>>> embedding.fit_transform(corpus)
>>> gismo = Gismo(corpus, embedding)
>>> print(toy_source_text) 
['Gizmo is a Mogwaï.',
'This is a sentence about Blade.',
'This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the Gremlins movie by comparing Gizmo and Yoda.',
'In chinese folklore, a Mogwaï is a demon.']

Landmarks are constructed exactly like a Gismo object, with a source and a to_text function.

>>> landmarks_source = [{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'},
... {'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
... {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
... {'name': 'Shadoks', 'content': 'Shadoks is a French sarcastic show.'},]
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])

The fit() method compute gismo queries for all landmarks and retain the results.

>>> landmarks.fit(gismo)

We run the request Yoda and look at the key landmarks. Note that Gremlins comes before Star Wars. This is actually correct in this small dataset: the word Yoda only exists in one sentence, which contains the words Gremlins and Gizmo.

>>> success = gismo.rank('yoda')
>>> landmarks.get_landmarks_by_rank(gismo) 
[{'name': 'Gremlins', 'content': 'The Gremlins movie features a Mogwai.'},
{'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
{'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'}]

For better readibility, we set the item post_processing to return the name of a landmark item.

>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Gremlins', 'Star Wars', 'Movies']

The balance adjusts between documents and features spaces. A balance set to 1.0 focuses only on documents.

>>> success = gismo.rank('blade')
>>> landmarks.get_landmarks_by_rank(gismo, balance=1)
['Movies']

A balance set to 0.0 focuses only on features. For blade, this triggers Shadoks as a secondary result, because of the shared word sentence.

>>> landmarks.get_landmarks_by_rank(gismo, balance=0)
['Movies', 'Shadoks']

Landmarks can be used to analyze landmarks.

>>> landmarks.get_landmarks_by_rank(landmarks)
['Gremlins', 'Star Wars']

See again how balance can change things. Here a balance set to 0.0 (using only features) fully changes the results.

>>> landmarks.get_landmarks_by_rank(landmarks, balance=0)
['Shadoks']

Like for Gismo, landmarks can provide clusters.

>>> success = gismo.rank('gizmo')
>>> landmarks.get_landmarks_by_cluster(gismo) 
{'landmark': {'name': 'Gremlins',
              'content': 'The Gremlins movie features a Mogwai.'},
              'focus': np.float64(0.999998...),
              'children': [{'landmark': {'name': 'Gremlins',
              'content': 'The Gremlins movie features a Mogwai.'},
              'focus': 1.0, 'children': []},
              {'landmark': {'name': 'Star Wars', 'content': 'The Star Wars movies feature Yoda.'},
              'focus': 1.0, 'children': []},
              {'landmark': {'name': 'Movies', 'content': 'Star Wars, Gremlins, and Blade are movies.'},
              'focus': 1.0, 'children': []}]}

We can set the post_cluster attribute to customize the output. Gismo provides a simple display.

>>> from gismo.post_processing import post_landmarks_cluster_print
>>> landmarks.post_cluster = post_landmarks_cluster_print
>>> landmarks.get_landmarks_by_cluster(gismo) 
F: 1.00.
- Gremlins
- Star Wars
- Movies

Like for Gismo, parameters like k, distortion, or resolution can be used.

>>> landmarks.get_landmarks_by_cluster(gismo, k=4, distortion=False, resolution=.9) 
F: 0.03.
- F: 0.93.
-- F: 1.00.
--- Gremlins
--- Star Wars
-- Movies
- Shadoks

Note that a Cluster can also be used as reference for the get_landmarks_by_rank() and get_landmarks_by_cluster() methods.

>>> cluster = landmarks.get_landmarks_by_cluster(gismo, post=False)
>>> landmarks.get_landmarks_by_rank(cluster)
['Gremlins', 'Star Wars', 'Movies']

Yet, you cannot use anything as reference. For example, you cannot use a string as such.

>>> landmarks.get_landmarks_by_rank("Landmarks do not use external queries (pass them to a gismo")  # doctest.ELLIPSIS
Traceback (most recent call last):
...
TypeError: bad operand type for unary -: 'NoneType'

Last but not least, landmarks can be used to reduce the size of a source or a Gismo. The reduction is controlled by the x_density attribute that tells the number of documents each landmark will allow to keep.

>>> landmarks.parameters.x_density = 1
>>> reduced_gismo = landmarks.get_reduced_gismo(gismo)
>>> reduced_gismo.corpus.source 
['This is another sentence about Shadoks.',
'This very long sentence, with a lot of stuff about Star Wars inside, makes at some point a side reference to the
Gremlins movie by comparing Gizmo and Yoda.']

Side remark #1: in the constructor, to_text indicates how to convert an item to str, while ranking_function specifies how to run a query on a Gismo. Yet, it is possible to have the text conversion handled by the ranking_function.

>>> landmarks = Landmarks(landmarks_source, rank=lambda g, q: g.rank(q['content']))
>>> landmarks.fit(gismo)
>>> success = gismo.rank('yoda')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Star Wars', 'Movies', 'Gremlins']

However, this is bad practice. When you only need to customize the way an item is converted to text, you should stick to to_text. ranking_function is for more elaborated filters that require to change the default way gismo does queries.

Side remark #2: if a landmark item query fails (its text does not intersect the gismo features), the default uniform projection will be used and a warning will be issued. This may yield to undesired results.

>>> landmarks_source.append({'name': 'unrelated', 'content': 'unrelated.'})
>>> landmarks = Landmarks(landmarks_source, to_text=lambda e: e['content'])
>>> landmarks.fit(gismo)
>>> success = gismo.rank('gizmo')
>>> landmarks.post_item = lambda lmk, i: lmk[i]['name']
>>> landmarks.get_landmarks_by_rank(gismo)
['Shadoks', 'unrelated']
fit(gismo, **kwargs)[source]#

Runs gismo queries on all landmarks. The relevance results are used to build two set of vectors: x_vectors is the vectors on the document space; y_vectors is the vectors on the document space. On each space, vectors are summed to build a direction, which is a sort of vector summary of the landmarks.

Parameters:
  • gismo (Gismo) – The Gismo on which vectors will be computed.

  • kwargs (dict) – Custom Landmarks runtime parameters.

Return type:

None

gismo.landmarks.get_direction(reference, balance)[source]#

Converts a reference object into a n+m direction (dense or sparse depending on reference type).

Parameters:
  • reference (Gismo or Landmarks or Cluster or np.ndarray or csr_matrix.) – The object from which a direction will be extracted.

  • balance (float in range [0.0, 1.0]) – The trade-off between documents and features. Set to 0.0, only the feature space will be used. Set to 1.0, only the document space will be used.

Returns:

A n+m direction.

Return type:

np.ndarray or csr_matrix