Ripping data from refractiveindex.info for nk function
Contents
README — Building and Maintaining the nk-data Database
This page explains how to (1) recreate the nk-data folder (CSV files + catalog) from the refractiveindex.info (RII) YAML database, and (2) add custom sources manually so the shared nk function can use them immediately.
(Let me know of any glaring errors in the README as I haven't scrutinised it nearly as much as I did the code it's talking about.)
TL;DR: Run rii2catalog.py
once to build nk-data/
. When you get a new dataset, drop a CSV into nk-data/data/
and add a small JSON block to nk-data/catalog/catalog.json
.
0) Expected layout
project/ ├─ nk.m % (or nk_sam.m) the MATLAB function └─ nk-data/ ├─ data/ % CSV files: wavelength_um,n,k └─ catalog/ └─ catalog.json % all sources + metadata
The function looks for nk-data/
next to itself (or via opts.data_root
or the NK_DATA_ROOT
environment variable).
1) Rebuild nk-data from refractiveindex.info
Prerequisites
- Python 3.9+
- PyYAML (for parsing RII YAML)
- The RII database (release zip or cloned repo)
PowerShell:
python --version pip install pyyaml
Get the RII database
Download/unzip the official RII database so you have a path like:
C:\...\rii-db\database\ data\main\... data\organic\... data\glass\... data\other\... data\sopra\... data\3d\...
(Those subfolders are “shelves” of materials.)
Run the converter
Use the included rii2catalog.py
to convert YAML → CSV + catalog JSON.
PowerShell (from your project folder that has nk.m
and rii2catalog.py
):
$DBROOT = "C:\Users\<you>\...\rii-db\database" python .\rii2catalog.py --db-root "$DBROOT" --out-root .\nk-data --log-sampling ` --shelves data/main data/organic data/glass data/other data/sopra data/3d
What it produces:
- CSV files into
nk-data/data/
with the exact header:
wavelength_um,n,k
Wavelengths are in micrometers (µm).n
ork
may be blank if absent.
-
catalog.json
intonk-data/catalog/
with, per source:
*source_id
(stable identifier) *kind
(thinfilm
orbulk
) andvalid_thickness_nm
if detected *path
(relative tonk-data
) *valid_wavelength_um
=[wmin, wmax]
*has_k
(boolean) *year
(parsed from the reference text; newer wins ties) *ref_string
,notes
, provenance fields * SHA-256 checksums for each CSV (optional but included)
Supported formula sampling:
- Formula 1 (Sellmeier-like) and 4 (Cauchy) → sampled to n(λ)
- Other formula types are skipped (console shows
[SKIP]
)
Notes
- All wavelengths are stored in µm (RII’s internal unit). The MATLAB caller can pass meters/nm/µm;
nk
converts internally. - Pages with multiple DATA blocks (e.g., multiple thicknesses) produce multiple CSVs. Per-entry thickness is inferred where possible.
- Some families (e.g., ITO) live under umbrella groups in RII. The MATLAB function augments these at runtime (e.g., synthetic
materials.ito
built frommixed_crystals
).
2) Manually adding a new dataset
You can add lab data or literature without re-running the scraper.
2.1 Prepare the CSV
Create a file in nk-data/data/
with this exact header:
wavelength_um,n,k
- Wavelengths: in µm, strictly increasing.
- If you only have n, leave the
k
column blank. The MATLAB function can treat missing k as zero withopts.allow_k_zero = true
. - Suggested filenames:
<material>.<page>.<tag>.csv
, e.g.:
au.smith-2025.d1.csv sio2.custom-lab-2025.csv
2.2 Add a source entry to catalog.json
Open nk-data/catalog/catalog.json
. Find the target material under materials
. If it doesn’t exist, add a new object for it. Append a source record:
{ "source_id": "au.smith-2025.d1", "kind": "thinfilm", // or "bulk" "path": "data/au.smith-2025.d1.csv", // relative to nk-data/ "valid_wavelength_um": [0.4, 1.1], // µm "valid_thickness_nm": [100, 100], // [tmin,tmax] in nm, or null for bulk "priority": 1, // not used by nk; keep 1 "year": 2025, // tie-breaker (newer wins) "ref_string": "Smith et al., Opt. Lett. (2025) ...", "notes": "Ellipsometry, sputtered Au on glass, room temp.", "shelf": "lab", "book": "Au", "page": "smith-2025-d1", "has_k": true }
Material block (simplified):
"au": { "aliases": ["Au","gold"], "default_source_id": null, "sources": [ { ... }, // existing sources { "source_id": "au.smith-2025.d1", "kind": "thinfilm", "path": "data/au.smith-2025.d1.csv", "valid_wavelength_um": [0.4, 1.1], "valid_thickness_nm": [100, 100], "priority": 1, "year": 2025, "ref_string": "Smith et al., Opt. Lett. (2025) ...", "notes": "Ellipsometry, sputtered Au on glass, room temp.", "shelf": "lab", "book": "Au", "page": "smith-2025-d1", "has_k": true } ] }
Checksums (optional, for info.data_checksum
):
$h = (Get-FileHash .\nk-data\data\au.smith-2025.d1.csv -Algorithm SHA256).Hash.ToLower()
Add to the "checksums"
object:
"data/au.smith-2025.d1.csv": "sha256-<paste the hash here>"
2.3 Field rules (quick)
-
source_id
: unique; recommend<material>.<page>[.dN]
-
kind
:"thinfilm"
if thickness known, else"bulk"
-
valid_wavelength_um
: matches your CSV’s min/max -
valid_thickness_nm
:[t,t]
for single thickness;null
for bulk/unknown -
year
: 4-digit year (tie-breaker) -
has_k
:true
if the k column exists (even if some rows are blank)
3) Sanity checks in MATLAB
Point nk
at your database if it isn’t next to nk.m
:
setenv('NK_DATA_ROOT', 'C:\path\to\project\nk-data')
Examples:
% Thin-film preference (e.g., Ag at 532 nm, 100 nm film) [n,k,info] = nk(532e-9, 'ag', struct('units','m','thickness_nm',100)); disp(info.source_id); disp(info.selection_reason); % Bulk fallback (thickness > 500 nm) [n2,k2,info2] = nk(780e-9, 'au', struct('units','m','thickness_nm',600)); % Legacy names and inline paths [n3,k3,info3] = nk(532e-9, 'fused silica', struct('units','m')); % → SiO2 [n4,k4,info4] = nk(532e-9, 'BK7', struct('units','m')); % inline Sellmeier [n5,k5,info5] = nk(532e-9, 'ITO', struct('units','m')); % ITO aggregator
Expected behavior:
- For thickness ≤ 500 nm, thin-film datasets are preferred (closest thickness).
- Tie-breaks: has k → newer year → wider λ-span →
source_id
. - Legacy names (e.g.,
water
,fused silica
,diamond
,air
,ITO
) resolve. -
BK7
uses the inline Sellmeier to preserve legacy results.
4) Troubleshooting
- “No source covers …” – The wavelength must be inside a source’s
[λmin, λmax]
. The error lists the top partial overlaps; choose another wavelength or add a source. - “CSV missing column ‘…’” – The CSV header must be exactly
wavelength_um,n,k
. Units are µm. - “Chosen source lacks k-values” – Either add k to the CSV or call with
opts.allow_k_zero = true
. - Legacy name not found – Add it to the alias table in
nk.m
(functionlocal_alias_map()
). For umbrella cases (e.g., ITO), the function augments the catalog at runtime. - Default units – If old code passed meters, set the default in
local_parse_opts
:
'units','m' % instead of 'um'
5) Conventions & credits
- Wavelength unit: all CSVs store wavelengths in µm. Callers can use
opts.units
('m'
/'nm'
/'um'
). - Thickness unit: nm in the catalog.
- Credit: Data derived from refractiveindex.info; cite the original publications listed in
ref_string
.
6) One-line rebuild (for future you)
$DBROOT = "C:\...\rii-db\database" python .\rii2catalog.py --db-root "$DBROOT" --out-root .\nk-data --log-sampling ` --shelves data/main data/organic data/glass data/other data/sopra data/3d