HALTools#

GisMap exposes two helpers to compare what different databases know about a researcher and to spot duplicates inside a single database:

  • diff_sources(a, b) lists publications present in a but not in b, and vice-versa. Useful when you want to manually update one database using inputs from another one.

  • find_duplicates(a) groups publications that look like the same paper inside a (e.g. an entry created by you and a near-duplicate created by a co-author).

A frequent use is to audit the database you can actually edit, and compare it with a more remote one. For researchers based in France, the editable database is usually HAL — hence the nickname HALTools. The methods themselves are not HAL-specific.

Typical use cases:

  • Spot duplicates of the same publication that you and a co-author registered independently.

  • Verify that all your DBLP-indexed publications have a counterpart in HAL (mandatory for most French academic labs, easy to forget).

  • For researchers with a partial HAL footprint (e.g. several pids and no unified HAL-ID), HALTools helps reconcile what is found by which identifier.

The rest of this notebook walks through illustrated examples.

Note

You do not need a full LabMap to use HALTools — a LabAuthor is enough (technically SourcedAuthor would do, but LabAuthor is more convenient).

Important

HALTools surface entries to check, not entries to change. As we will see below, many flagged items are perfectly fine and need no action.

Céline — a clean researcher#

We start with Céline Comte, whose HAL and DBLP profiles are well-maintained. The differences we find should be explainable, not actionable.

[1]:
from gismap.lab.lab_author import LabAuthor

celine = LabAuthor("Céline Comte")
celine.auto_sources()

Comparing HAL and LDB#

[2]:
print(celine.diff_sources("hal", "ldb"))
=== Only in hal (8) ===
"Modèle de couplage stochastique non-biparti" (2021, conference) - https://hal.science/hal-03219422v1
"Online Stochastic Matching: A Polytope Perspective" (2025, report) - https://hal.science/hal-03502084v6
"Un seul serveur vous manque, et tout est découplé !" (2018, conference) - https://hal.science/hal-01773674v1
"À la racine du parallélisme" (2017, conference) - https://hal.science/hal-01517150v1
"0 = 0, c'est le truc du noyau ! Application aux files d'attente" (2019, conference) - https://hal.science/hal-02118156v1
"Performance of a Server Cluster with Parallel Processing and Randomized Load Balancing" (2016, report) - https://hal.science/hal-01306343v1
"Rien ne sert de prédire ; il faut servir ancien." (2019, conference) - https://hal.science/hal-02118170v1
"La Grille de Kleinberg, l’Univers et le Reste" (2017, conference) - https://hal.science/hal-01517123v1
=== Only in ldb (3) ===
"Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems." (2023, journal) - https://doi.org/10.48550/ARXIV.2312.02804
"Networks of multi-server queues with parallel processing." (2016, journal) - http://arxiv.org/abs/1604.06763
"Stochastic dynamic matching: A mixed graph-theory and linear-algebra approach." (2021, journal) - https://arxiv.org/abs/2112.14457

Reading the output:

  • Most HAL-only papers are written in French, typically Algotel conference papers aimed at the French-speaking community. DBLP rarely indexes French-only papers (with a few exceptions like JIAF 2023). Nothing wrong here.

  • Networks of multi-server queues with parallel processing (LDB-only): likely an early research report later absorbed into a journal paper with a different title.

  • Online Stochastic Matching: A Polytope Perspective (HAL) and Stochastic dynamic matching: A mixed graph-theory and linear-algebra approach (LDB) are actually the same article. HAL stores the latest version (newer date and title), DBLP/LDB keeps the original. Merging these would be a hard feature for a small payoff — don’t expect it soon.

  • Score-Aware Policy-Gradient Methods… (LDB-only): same pattern, a report later published in a journal under a slightly different title. Could be merged with a more lenient title-similarity threshold, but that knob is not exposed in the public API yet.

Bottom line: everything is fine.

Duplicates inside HAL#

[3]:
print(celine.find_duplicates("hal"))
=== Duplicates in hal (2 groups) ===
  Group 1:
  "À la racine du parallélisme" (2017, conference) - https://hal.science/hal-01517150v1
  "À la racine du parallélisme" (2017, report) - https://inria.hal.science/hal-01476889v3
  Group 2:
  "Dynamic load balancing with tokens" (2019, journal) - https://hal.science/hal-02340255v1
  "Dynamic Load Balancing with Tokens" (2018, conference) - https://hal.science/hal-01758912v2

Both groups are report-to-conference or conference-to-journal lifecycles — no actual issue. GisMap merges these duplicates automatically when building a map, so no action is needed.

François — a messier real-world case#

François Durand has a different history, with French papers, posters, and a few real gaps. We pin the HAL pid explicitly to avoid homonyms.

[4]:
francois = LabAuthor("François Durand (hal:fradurand)")
francois.auto_sources()

Spotting actual gaps in HAL#

[5]:
print(francois.diff_sources("hal", "ldb"))
=== Only in hal (12) ===
"Coupure enharmonique, complétude et applications" (2012, report) - https://hal.science/hal-00694023v1
"Voter Autrement 2017 for the French Presidential Election — The data of the In Situ Experiments" (2020, report) - https://hal.science/hal-03223803v1
"Reducing Manipulability" (2014, poster) - https://hal.science/hal-01095992v1
"Trier des cochons sauvages" (2023, conference) - https://hal.science/hal-04077471v1
"Élection du Best Paper AlgoTel 2012 : étude de la manipulabilité" (2014, conference) - https://inria.hal.science/hal-00986060v1
"Démocratie à géométrie variable (à l'usage des algorithmes)" (2021, conference) - https://hal.science/hal-03213987v1
"Making a voting system depend only on orders of preference reduces its manipulability rate" (2014, report) - https://inria.hal.science/hal-01009136v1
"Shannon, Turing and Hats: Information Theory Incompleteness" (2017, conference) - https://inria.hal.science/hal-01675019v1
"Vers des modes de scrutin moins manipulables" (2015, thesis) - https://inria.hal.science/tel-01242440v2
"Sortus Interruptus : Trier des phoques en O(n log(n))" (2025, conference) - https://hal.science/hal-05003794v1
"Coalitional manipulation of voting rules: simulations on empirical data" (2023, journal) - https://hal.science/hal-05573536v1
"Élection d'un chemin dans un réseau : étude de la manipulabilité" (2014, conference) - https://inria.hal.science/hal-00986050v1
=== Only in ldb (6) ===
"Learning-Based Multiuser Scheduling in Mimo-Ofdm Systems With Hybrid Beamforming." (2025, conference) - https://doi.org/10.1109/EUCNC/6GSUMMIT63408.2025.11037174
"Post-Computing Analog Beams after User Selection in a Hybrid Beamforming System." (2024, conference) - https://doi.org/10.1109/EUCNC/6GSUMMIT60053.2024.10597109
"L'abus de comparaisons est mauvais pour la santé." (2023, conference) - https://hal.science/hal-04209856v1/document
"Detection of Horse Locomotion Modifications Due to Training with Inertial Measurement Units: A Proof-of-Concept." (2022, journal) - https://doi.org/10.3390/S22134981
"Probability of a Condorcet Winner for Large Electorates: An Analytic Combinatorics Approach." (2025, journal) - https://doi.org/10.48550/ARXIV.2505.06028
"Sorting wild pigs." (2023, journal) - https://doi.org/10.48550/ARXIV.2304.11952

Only in HAL — all explainable:

  • French papers (DBLP rarely indexes them), including the PhD thesis (written in French).

  • Items without proper proceedings (posters, Voter Autrement dataset reports).

  • Coalitional manipulation of voting rules… — published in an economics venue outside DBLP’s scope.

Only in LDB — mixed bag:

  • The two beamforming papers and the Condorcet paper should be in HAL but were forgotten. These are real action items.

  • L’abus de comparaisons est mauvais pour la santé: the HAL entry is the whole proceedings of the conference, with the PC chairs as authors. DBLP somehow extracted the individual papers from HAL even though HAL itself does not split them.

  • Detection of Horse Locomotion…: a rare DBLP confusion. The pid 38/11269 is supposed to point to a single person, but this particular paper was authored by a homonym. The only fix is to email DBLP directly (dblp@dagstuhl.de); response time can be slow.

  • Sorting wild pigs / Trier des cochons sauvages: a French paper that was OK to file in French because it carries an English title and abstract. Nothing to patch.

Duplicates: same paper, different lifecycle stages#

[6]:
print(francois.find_duplicates("hal"))
=== Duplicates in hal (5 groups) ===
  Group 1:
  "Voter Autrement 2017 for the French Presidential Election — The data of the In Situ Experiments" (2020, report) - https://hal.science/hal-03223803v1
  "Voter Autrement 2017 for the French Presidential Election" (2019, report) - https://shs.hal.science/halshs-02379941v1
  "Voter Autrement 2017 - Online Experiment" (2018, report) - https://hal.science/hal-03223762v1
  Group 2:
  "Reducing Manipulability" (2014, poster) - https://hal.science/hal-01095992v1
  "Making most voting systems meet the Condorcet criterion reduces their manipulability" (2014, report) - https://inria.hal.science/hal-01009134v1
  Group 3:
  "Geometry on the Utility Space" (2015, conference) - https://inria.hal.science/hal-01222871v1
  "Geometry on the Utility Space" (2014, conference) - https://hal.science/hal-01096018v1
  Group 4:
  "SVVAMP: Simulator of Various Voting Algorithms in Manipulating Populations" (2016, conference) - https://hal.science/hal-01369835v1
  "SVVAMP: Simulator of Various Voting Algorithms in Manipulating Populations" (2015, report) - https://hal.science/hal-01135109v1
  Group 5:
  "On the Manipulability of Voting Systems: Application to Multi-Operator Networks" (2013, conference) - https://hal.science/hal-00874096v1
  "On the Manipulability of Voting Systems: Application to Multi-Carrier Networks" (2012, report) - https://inria.hal.science/hal-00692096v1

All five groups are benign:

  • Group 1 — three distinct documents around the same Voter Autrement 2017 experiment.

  • Group 2 — a poster summarizing a longer report.

  • Group 3 — two evolutions of the same paper.

  • Group 4 — short report vs. short announcement of the same tool.

  • Group 5 — report-to-conference lifecycle.

Fabien — many duplicates, one real issue#

On older HAL profiles, like for Fabien Mathieu, the report → conference → journal lifecycle has had time to accumulate. The interesting question is whether a real duplicate hides among the noise.

[7]:
fabien = LabAuthor("Fabien Mathieu")
fabien.auto_sources()

Focus on duplicates:

[8]:
print(fabien.find_duplicates("hal"))
=== Duplicates in hal (14 groups) ===
  Group 1:
  "LiveRank: How to Refresh Old Datasets" (2015, journal) - https://inria.hal.science/hal-01251552v1
  "LiveRank: How to Refresh Old Crawls" (2014, conference) - https://inria.hal.science/hal-01093188v1
  Group 2:
  "Geometry on the Utility Space" (2015, conference) - https://inria.hal.science/hal-01222871v1
  "Geometry on the Utility Space" (2014, conference) - https://hal.science/hal-01096018v1
  Group 3:
  "Stratification in P2P networks, Application to BitTorrent" (2007, conference) - https://hal.science/hal-00159663v1
  "Stratification in P2P Networks - Application to BitTorrent" (2006, report) - https://inria.hal.science/inria-00121974v2
  Group 4:
  "SVVAMP: Simulator of Various Voting Algorithms in Manipulating Populations" (2016, conference) - https://hal.science/hal-01369835v1
  "SVVAMP: Simulator of Various Voting Algorithms in Manipulating Populations" (2015, report) - https://hal.science/hal-01135109v1
  Group 5:
  "Upper bounds for stabilization in acyclic preference-based systems" (2007, conference) - https://inria.hal.science/hal-00668356v1
  "Acyclic Preference-Based Systems" (2010, chapter) - https://inria.hal.science/hal-00667351v1
  Group 6:
  "On Using Matching Theory to Understand P2P Network Design" (2007, conference) - https://hal.science/hal-00159678v1
  "On Using Matching Theory to Understand P2P Network Design" (2006, report) - https://inria.hal.science/inria-00121604v2
  Group 7:
  "Acyclic Preference Systems in P2P Networks" (2007, conference) - https://inria.hal.science/inria-00471720v1
  "Acyclic Preference Systems in P2P Networks" (2007, report) - https://inria.hal.science/inria-00143790v2
  Group 8:
  "Local Aspects of the Global Ranking of Web Pages" (2006, conference) - https://inria.hal.science/inria-00160799v1
  "Local Aspects of the Global Ranking of Web Pages" (2004, report) - https://inria.hal.science/inria-00070800v1
  Group 9:
  "On resource aware algorithms in epidemic live streaming" (2010, conference) - https://inria.hal.science/hal-00668321v1
  "On Resource Aware Algorithms in Epidemic Live Streaming" (2009, report) - https://inria.hal.science/inria-00414706v1
  Group 10:
  "The Stable Configuration of Acyclic Preference-Based Systems" (2009, conference) - https://inria.hal.science/hal-00668292v1
  "The stable configuration in acyclic preference-based systems" (2008, report) - https://inria.hal.science/inria-00318621v1
  Group 11:
  "Reducing Manipulability" (2014, poster) - https://hal.science/hal-01095992v1
  "Making most voting systems meet the Condorcet criterion reduces their manipulability" (2014, report) - https://inria.hal.science/hal-01009134v1
  Group 12:
  "Deciding and verifying network properties locally with few output bits" (2020, journal) - https://inria.hal.science/hal-03100213v1
  "Deciding and verifying network properties locally with few output bits" (2020, journal) - https://hal.science/hal-02285014v1
  Group 13:
  "À la racine du parallélisme" (2017, conference) - https://hal.science/hal-01517150v1
  "À la racine du parallélisme" (2017, report) - https://inria.hal.science/hal-01476889v3
  Group 14:
  "On the Manipulability of Voting Systems: Application to Multi-Operator Networks" (2013, conference) - https://hal.science/hal-00874096v1
  "On the Manipulability of Voting Systems: Application to Multi-Carrier Networks" (2012, report) - https://inria.hal.science/hal-00692096v1

Groups 1 to 13 follow the usual lifecycle pattern (report → conference → journal/chapter) and need no action.

Group 14 is the real issue: Deciding and verifying network properties locally with few output bits exists twice in HAL because two co-authors entered it independently. The fix is to ask HAL support to merge the two entries.

Élie — when the HAL pid is too strict#

Élie de Panafieu has a low HAL footprint and a more idiosyncratic profile. Default settings will only catch part of his work — HALTools is exactly the tool to diagnose what is missing and why.

By default, GisMap picks the most specific HAL identifier it can find:

[9]:
elie = LabAuthor("Élie de Panafieu")
elie.auto_sources("hal")
elie.sources
[9]:
[HALAuthor(name='Élie de Panafieu', key='1319887', key_type='pid')]
[10]:
print("\n".join(str(p) for p in elie.get_publications().values()))

The birth of the strong components, by Sergey Dovgal, Élie de Panafieu, Dimbinaina Ralaivaosaona, Vonjy Rasendrahasina, and Stephan Wagner. In Random Structures and Algorithms [journal], 2023.

Only one publication. The HAL pid is unambiguous but too narrow: many of Élie’s HAL entries are not attached to that pid. We can confirm this by comparing pid-based and fullname-based searches:

[11]:
elie = LabAuthor("Élie de Panafieu (hal:1319887, hal:fullname)")
print(elie.diff_sources(0, 1))
=== Only in hal (0) (0) ===
=== Only in hal (1) (5) ===
"Enumeration and structure of inhomogeneous graphs" (2015, conference) - https://hal.science/hal-01337762v1
"Counting connected graphs with large excess" (2016, conference) - https://hal.science/hal-02166341v1
"Complexity Estimates for Two Uncoupling Algorithms" (2013, conference) - https://inria.hal.science/hal-00780010v2
"Of Kernels and Queues: when network calculus meets analytic combinatorics" (2018, conference) - https://hal.science/hal-01889101v1
"0 = 0, c'est le truc du noyau ! Application aux files d'attente" (2019, conference) - https://hal.science/hal-02118156v1

The fullname search picks up five more HAL entries that the pid misses. The fix is to switch to fullname (the hal:fullname flag is documented in the FAQ):

[12]:
elie = LabAuthor("Élie de Panafieu (hal:fullname)")
elie.auto_sources()

Comparing HAL (now in fullname mode) against LDB:

[13]:
print(elie.diff_sources("hal", "ldb"))
=== Only in hal (3) ===
"Counting connected graphs with large excess" (2016, conference) - https://hal.science/hal-02166341v1
"Enumeration and structure of inhomogeneous graphs" (2015, conference) - https://hal.science/hal-01337762v1
"0 = 0, c'est le truc du noyau ! Application aux files d'attente" (2019, conference) - https://hal.science/hal-02118156v1
=== Only in ldb (17) ===
"torus packing for multisets." (2024, misc) - https://doi.org/10.4230/ARTIFACTS.22479
"Tree Walks and the Spectrum of Random Graphs." (2024, conference) - https://doi.org/10.4230/LIPICS.AOFA.2024.11
"Combinatorics of nondeterministic walks of the Dyck and Motzkin type." (2019, conference) - https://doi.org/10.1137/1.9781611975505.1
"Graphs with degree constraints." (2016, conference) - https://doi.org/10.1137/1.9781611974324.4
"Expressive Key-Policy Attribute-Based Encryption with Constant-Size Ciphertexts." (2011, conference) - https://doi.org/10.1007/978-3-642-19379-8_6
"Phase transition of random non-uniform hypergraphs." (2015, journal) - https://doi.org/10.1016/J.JDA.2015.01.009
"2-Xor Revisited: Satisfiability and Probabilities of Functions." (2016, journal) - https://doi.org/10.1007/S00453-016-0119-X
"Active clustering for labeling training data." (2021, conference) - https://proceedings.neurips.cc/paper/2021/hash/47841cc9e552bd5c40164db7073b817b-Abstract.html
"Robot Positioning Using Torus Packing for Multisets." (2024, conference) - https://doi.org/10.4230/LIPICS.ICALP.2024.43
"Threshold functions for small subgraphs in simple graphs and multigraphs." (2020, journal) - https://doi.org/10.1016/J.EJC.2020.103113
"Analytic description of the phase transition of inhomogeneous multigraphs." (2015, journal) - https://doi.org/10.1016/J.EJC.2015.02.020
"Attribute-based encryption schemes with constant-size ciphertexts." (2012, journal) - https://doi.org/10.1016/J.TCS.2011.12.004
"Exact enumeration of satisfiable 2-SAT formulae." (2023, journal) - https://doi.org/10.5070/C63261985
"Threshold functions for small subgraphs: an analytic approach." (2017, journal) - https://doi.org/10.1016/J.ENDM.2017.06.048
"Analytic combinatorics of connected graphs." (2019, journal) - https://doi.org/10.1002/RSA.20836
"Probability of a Condorcet Winner for Large Electorates: An Analytic Combinatorics Approach." (2025, journal) - https://doi.org/10.48550/ARXIV.2505.06028
"Counting directed acyclic and elementary digraphs." (2020, journal) - https://arxiv.org/abs/2001.08659

Élie’s exposure is mostly on DBLP/LDB — and that is fine. HAL deposit is only mandatory for researchers affiliated with a French academic institution; the LDB-only entries here are not action items.

When to act#

Across the four examples, only a handful of items called for action:

  • Missing HAL deposits (François: beamforming and Condorcet papers).

  • Genuine HAL duplicates entered by two co-authors (Fabien: Deciding and verifying network properties…).

  • Wrong DBLP attribution to a homonym (François: Horse Locomotion).

  • Identifier strategy (Élie: switch from pid to fullname).

Everything else — lifecycle duplicates, French-only papers, reports absorbed into journals — is normal database life that GisMap already handles when building a map.