Merge pull request #2217 from ipfs/cid-clarifications

mishmosh · web-flow · commit be4fa47d577e · 2025-12-15T11:43:58.000-05:00
Add caveat about CID determinism
diff --git a/.github/styles/README.md b/.github/styles/README.md
@@ -0,0 +1,25 @@
+# Vale Styles
+
+This directory contains Vale linting configuration for ipfs-docs.
+
+## Spelling Rules
+
+There are two spelling systems:
+
+1. **`Vocab/ipfs-docs-vocab/accept.txt`** - General Vale vocabulary
+2. **`pln-ignore.txt`** - Custom ignore file for `docs/PLNSpelling.yml`
+
+### Fixing PLNSpelling Errors
+
+When CI fails with `[docs.PLNSpelling] Did you really mean 'word'?`:
+
+1. Add the word to **`pln-ignore.txt`** (lowercase)
+2. Do NOT add to `Vocab/accept.txt` - that file is for other Vale rules
+
+The `PLNSpelling.yml` rule explicitly references `pln-ignore.txt`:
+
+```yaml
+extends: spelling
+ignore:
+  - pln-ignore.txt
+```
diff --git a/.github/styles/pln-ignore.txt b/.github/styles/pln-ignore.txt
@@ -27,6 +27,7 @@ boolean
 Bootstrappers
 boxo
 browserify
+buzhash
 caddy
 Caddyfile
 callout
@@ -41,6 +42,7 @@ clis
 cmds
 cnames
 codec
+codecs
 codecov
 coinlist
 composable
@@ -56,6 +58,7 @@ dapps
 data('s)
 datastore
 deduplicate
+deduplication
 Denylist
 denylist
 dep
@@ -200,6 +203,7 @@ philz
 pinset
 pipeable
 plaintext
+PLNSpelling
 pluggable
 powergate
 powershell
diff --git a/docs/concepts/content-addressing.md b/docs/concepts/content-addressing.md
@@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match
 
 As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`.
 
+### Why the hashes differ
+
+The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes.
+
+When you add a file to IPFS, the data goes through several transformations:
+
+1. **Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each)
+2. **Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph)
+3. **Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure
+
+The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID.
+
+#### When CID hash equals file hash
+
+There is one case where the Multihash does equal the file's hash: when the CID uses the `raw` [codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents.
+
+#### Same file, different CIDs
+
+Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured:
+
+- **Chunk size**: Different chunking strategies produce different block trees
+- **DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently
+- **Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently
+- **CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats
+- **Hash algorithm**: sha2-256, blake3, and others produce different hashes
+
+#### Why this flexibility matters
+
+This is a feature, not a limitation. Different structures optimize for different use cases:
+
+- **DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs
+- **Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics
+- **Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use
+- **Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified
+
+[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs.
+
+When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings.
+
+To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
+
 
 ## CID versions
 
diff --git a/docs/concepts/how-ipfs-works.md b/docs/concepts/how-ipfs-works.md
@@ -45,13 +45,13 @@ IPFS represents data as content-addressed <VueCustomTooltip label="The term for
 
 In IPFS, data is chunked into <VueCustomTooltip label="The term for a single unit of data in IPFS." underlined multiline is-medium>blocks</VueCustomTooltip>, which are assigned a unique identifier called a <VueCustomTooltip label="An address used to point to data in IPFS, based on the content itself, as opposed to the location." underlined multiline is-medium>Content Identifier (CID)</VueCustomTooltip>.  In general, the CID is computed by combining the hash of the data with its <VueCustomTooltip label="Software capable of encoding and/or decoding data." underlined multiline is-medium>codec</VueCustomTooltip>. The codec is generated using <VueCustomTooltip label="A collection of interoperable, extensible protocols for making data self-describable." underlined multiline is-medium>Multiformats</VueCustomTooltip>.
 
-CIDs are unique to the data from which they were computed, which provides IPFS with the following benefits:
-- Data can be fetched based on its content, rather than its location. 
-- The CID of the data received can be computed and compared to the CID requested, to verify that the data is what was requested.
+Because CIDs are based on content, not location:
+- You can fetch data by *what it is*, not where it's stored.
+- You can verify data by recomputing the CID and comparing it to what you requested.
 
 :::callout
 **Learn more**
-Learn more about the concepts behind CIDs described here with the [the CID deep dive](../concepts/content-addressing.md#cid-versions).
+Learn more about CIDs in the [CID deep dive](../concepts/content-addressing.md#cid-versions).
 :::
 
 
diff --git a/docs/quickstart/pin-cli.md b/docs/quickstart/pin-cli.md
@@ -122,22 +122,22 @@ Each method will return a **CID** (Content Identifier) for your uploaded file. S
 
 ## CIDs explained
 
-In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)). The CID serves as the **permanent address** of the file and can be used by anyone to find it on the IPFS network.
+In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)), a unique hash derived from the file's contents. The CID serves as the **permanent address** of the file and can be used by anyone to find it on any IPFS network or system.
 
-When a file is first added to an IPFS node (like the image used in this guide), it's first transformed into a content-addressable representation in which the file is split into smaller chunks (if above ~1MB) which are linked together and hashed to produce the CID.
+When you add a file to IPFS, the system generates its CID by hashing the contents. Larger files (above ~1MB) are split into smaller chunks, linked together, and hashed.
 
-For example, a CID might look like:
+The resulting CID might look like this:
 
 ```plaintext
 bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4
 ```
 
-You can now share the CID with anyone and they can fetch the file using IPFS.
+Once you have a CID, you can share it with anyone and they can fetch the file using IPFS.
 
-To dive deeper into the anatomy of the CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
+To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
 
 :::callout
-The transformation into a content-addressable representation is a local operation that doesn't require any network connectivity. Many CLI tools perform this transformation locally before uploading.
+**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details.
 :::
 
 ## Retrieving with a gateway
@@ -167,6 +167,8 @@ curl https://[BUCKET_NAME].ipfs.filebase.io/ipfs/[CID]
 
 ### Using public gateways
 
+You can also use [public IPFS gateways](../concepts/public-utilities.md#public-ipfs-gateways):
+
 ```shell
 curl https://ipfs.io/ipfs/[CID]
 # or
@@ -192,4 +194,4 @@ Possible next steps include:
   - [Storacha CLI documentation](https://docs.storacha.network/cli/)
   - [Pinata API documentation](https://docs.pinata.cloud/)
   - [Filebase S3 API guide](https://docs.filebase.com/api-documentation/s3-compatible-api)
-  - [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)
+  - [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)