Skip to content

Commit be4fa47

Browse files
authored
Merge pull request #2217 from ipfs/cid-clarifications
Add caveat about CID determinism
2 parents 2b4a28b + 9c8406c commit be4fa47

File tree

5 files changed

+83
-11
lines changed

5 files changed

+83
-11
lines changed

.github/styles/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Vale Styles
2+
3+
This directory contains Vale linting configuration for ipfs-docs.
4+
5+
## Spelling Rules
6+
7+
There are two spelling systems:
8+
9+
1. **`Vocab/ipfs-docs-vocab/accept.txt`** - General Vale vocabulary
10+
2. **`pln-ignore.txt`** - Custom ignore file for `docs/PLNSpelling.yml`
11+
12+
### Fixing PLNSpelling Errors
13+
14+
When CI fails with `[docs.PLNSpelling] Did you really mean 'word'?`:
15+
16+
1. Add the word to **`pln-ignore.txt`** (lowercase)
17+
2. Do NOT add to `Vocab/accept.txt` - that file is for other Vale rules
18+
19+
The `PLNSpelling.yml` rule explicitly references `pln-ignore.txt`:
20+
21+
```yaml
22+
extends: spelling
23+
ignore:
24+
- pln-ignore.txt
25+
```

.github/styles/pln-ignore.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ boolean
2727
Bootstrappers
2828
boxo
2929
browserify
30+
buzhash
3031
caddy
3132
Caddyfile
3233
callout
@@ -41,6 +42,7 @@ clis
4142
cmds
4243
cnames
4344
codec
45+
codecs
4446
codecov
4547
coinlist
4648
composable
@@ -56,6 +58,7 @@ dapps
5658
data('s)
5759
datastore
5860
deduplicate
61+
deduplication
5962
Denylist
6063
denylist
6164
dep
@@ -200,6 +203,7 @@ philz
200203
pinset
201204
pipeable
202205
plaintext
206+
PLNSpelling
203207
pluggable
204208
powergate
205209
powershell

docs/concepts/content-addressing.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match
8787

8888
As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`.
8989

90+
### Why the hashes differ
91+
92+
The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes.
93+
94+
When you add a file to IPFS, the data goes through several transformations:
95+
96+
1. **Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each)
97+
2. **Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph)
98+
3. **Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure
99+
100+
The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID.
101+
102+
#### When CID hash equals file hash
103+
104+
There is one case where the Multihash does equal the file's hash: when the CID uses the `raw` [codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents.
105+
106+
#### Same file, different CIDs
107+
108+
Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured:
109+
110+
- **Chunk size**: Different chunking strategies produce different block trees
111+
- **DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently
112+
- **Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently
113+
- **CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats
114+
- **Hash algorithm**: sha2-256, blake3, and others produce different hashes
115+
116+
#### Why this flexibility matters
117+
118+
This is a feature, not a limitation. Different structures optimize for different use cases:
119+
120+
- **DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs
121+
- **Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics
122+
- **Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use
123+
- **Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified
124+
125+
[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs.
126+
127+
When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings.
128+
129+
To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
130+
90131

91132
## CID versions
92133

docs/concepts/how-ipfs-works.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,13 @@ IPFS represents data as content-addressed <VueCustomTooltip label="The term for
4545

4646
In IPFS, data is chunked into <VueCustomTooltip label="The term for a single unit of data in IPFS." underlined multiline is-medium>blocks</VueCustomTooltip>, which are assigned a unique identifier called a <VueCustomTooltip label="An address used to point to data in IPFS, based on the content itself, as opposed to the location." underlined multiline is-medium>Content Identifier (CID)</VueCustomTooltip>. In general, the CID is computed by combining the hash of the data with its <VueCustomTooltip label="Software capable of encoding and/or decoding data." underlined multiline is-medium>codec</VueCustomTooltip>. The codec is generated using <VueCustomTooltip label="A collection of interoperable, extensible protocols for making data self-describable." underlined multiline is-medium>Multiformats</VueCustomTooltip>.
4747

48-
CIDs are unique to the data from which they were computed, which provides IPFS with the following benefits:
49-
- Data can be fetched based on its content, rather than its location.
50-
- The CID of the data received can be computed and compared to the CID requested, to verify that the data is what was requested.
48+
Because CIDs are based on content, not location:
49+
- You can fetch data by *what it is*, not where it's stored.
50+
- You can verify data by recomputing the CID and comparing it to what you requested.
5151

5252
:::callout
5353
**Learn more**
54-
Learn more about the concepts behind CIDs described here with the [the CID deep dive](../concepts/content-addressing.md#cid-versions).
54+
Learn more about CIDs in the [CID deep dive](../concepts/content-addressing.md#cid-versions).
5555
:::
5656

5757

docs/quickstart/pin-cli.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -122,22 +122,22 @@ Each method will return a **CID** (Content Identifier) for your uploaded file. S
122122

123123
## CIDs explained
124124

125-
In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)). The CID serves as the **permanent address** of the file and can be used by anyone to find it on the IPFS network.
125+
In IPFS, every file and directory is identified with a Content Identifier ([CID](../concepts/content-addressing.md)), a unique hash derived from the file's contents. The CID serves as the **permanent address** of the file and can be used by anyone to find it on any IPFS network or system.
126126

127-
When a file is first added to an IPFS node (like the image used in this guide), it's first transformed into a content-addressable representation in which the file is split into smaller chunks (if above ~1MB) which are linked together and hashed to produce the CID.
127+
When you add a file to IPFS, the system generates its CID by hashing the contents. Larger files (above ~1MB) are split into smaller chunks, linked together, and hashed.
128128

129-
For example, a CID might look like:
129+
The resulting CID might look like this:
130130

131131
```plaintext
132132
bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4
133133
```
134134

135-
You can now share the CID with anyone and they can fetch the file using IPFS.
135+
Once you have a CID, you can share it with anyone and they can fetch the file using IPFS.
136136

137-
To dive deeper into the anatomy of the CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
137+
To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
138138

139139
:::callout
140-
The transformation into a content-addressable representation is a local operation that doesn't require any network connectivity. Many CLI tools perform this transformation locally before uploading.
140+
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details.
141141
:::
142142

143143
## Retrieving with a gateway
@@ -167,6 +167,8 @@ curl https://[BUCKET_NAME].ipfs.filebase.io/ipfs/[CID]
167167

168168
### Using public gateways
169169

170+
You can also use [public IPFS gateways](../concepts/public-utilities.md#public-ipfs-gateways):
171+
170172
```shell
171173
curl https://ipfs.io/ipfs/[CID]
172174
# or
@@ -192,4 +194,4 @@ Possible next steps include:
192194
- [Storacha CLI documentation](https://docs.storacha.network/cli/)
193195
- [Pinata API documentation](https://docs.pinata.cloud/)
194196
- [Filebase S3 API guide](https://docs.filebase.com/api-documentation/s3-compatible-api)
195-
- [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)
197+
- [Filebase IPFS RPC API](https://docs.filebase.com/api-documentation/ipfs-rpc-api)

0 commit comments

Comments
 (0)