A .NET 9 C# console application to detect and resolve duplicate BP identifiers in your idp.data/Biblio XML corpus. It:
- Discovers all XML files under your
idp.data/Biblioarchive (recursively). - Extracts each file’s BP idno and other metadata via
XMLEntryGatherer. - Sorts & finds duplicates—any two files sharing the same BP idno, where one file has
title[@level="a"]and the other hastitle[@level="m"](as the XSLT that drove the creation of PN Biblio created two files – both an article/chapter as well as the book that contains it – from a single BP fiche, which is the origin of the duplicates). - Prompts you (via
XmlComparerUI) to choose which of the two duplicate entries should have itsseg[@resp="#BP"]ornote[@resp="#BP"]elements removed. - Deletes the selected segments in the chosen file, saves the updated XML, and logs every action.
The BPRemovingDuplicateIdnos project folder should be a sibling of idp.data in the local directory.
├─idp.data/
│ └── Biblio
└─BPRemovingDuplicateIdnos/ ← C# console project
├── bin/
├── obj/
├── BPDataEntry.cs
├── BPEntryGatherer.cs
├── BPRemovingDuplicateIdnos.sln ← Visual Studio solution
├── BPRemovingDuplicateIdnos.csproj
├── Program.cs ← entry point & orchestration
├── Logger.cs ← simple file‑and‑console logger
├── XMLEntryGatherer.cs ← gathers & parses each TEI file
├── XMLDataEntry.cs ← model for parsed TEI fields
├── XmlComparerUI.cs ← console UI for duplicate resolution
└── … (other helpers)
The tool locates
idp.databy walking up from your current directory, then finds the first subdirectory whose name contains “Biblio.”
- .NET 9 SDK
Download & install from https://aka.ms/dotnet-download - Clone & restore
git clone https://github.com/halosm1th/BPRemovingDuplicateIdnos.git cd BPRemovingDuplicateIdnos/BPRemovingDuplicateIdnos dotnet restore
From within the BPRemovingDuplicateIdnos project folder (where Program.cs lives):
dotnet runYou’ll see console output:
-
Current directory and location of
idp.data&Biblio. -
List of duplicate pairs:
Found match in 1932‑BP1234 and 1932‑BP1234 -
Interactive prompt for each pair:
Choose file to DELETE segments from: [A] 1932‑BP1234.xml (“a” title level) [M] 1932‑BP1234.xml (“m” title level) > M -
Deletes all
<seg>and<note>elements of the losing file, saves it, and logs the change. -
Log file written by
Logger(timestamped in working directory).
-
SetXMLFilepath()- Starts in your CWD, walks upward to find an
idp.datadirectory. - Within that, finds the first subfolder containing “Biblio.”
- Starts in your CWD, walks upward to find an
-
XMLEntryGatherer- Recursively reads every
.xmlfile. - Constructs an
XMLDataEntryholding BP number (<idno type="bp">), “title level” (avs.m), and full path.
- Recursively reads every
-
Duplicate detection
- Sorts entries by numeric BP value.
- Any adjacent entries with equal BP are duplicates.
-
XmlComparerUI- Presents both entries’ key fields for side‑by‑side comparison.
- Returns the entry whose segments should be purged.
-
Deletion logic
- In
DeleteSegsOnFile(), loads the losing file asXmlDocument, finds all<seg>or<note>nodes of various@subtypeand removes them. - Optionally prompts before removing illustrations.
- Saves the updated XML in place.
- In
-
“Could not find idp.data directory”
- Ensure you run
dotnet runfrom a directory that is sibling to theidp.datafolder.
- Ensure you run
-
No duplicates found
- All BP numbers are unique—no action needed!
Created with the help of Chatgpt