Ripping data from refractiveindex.info for nk function

From Applied Optics Wiki
Revision as of 19:48, 4 September 2025 by Samuel Karet (talk | contribs) (Created page with "= README — Building and Maintaining the '''nk-data''' Database = This page explains how to (1) recreate the '''nk-data''' folder (CSV files + catalog) from the refractivein...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

README — Building and Maintaining the nk-data Database

This page explains how to (1) recreate the nk-data folder (CSV files + catalog) from the refractiveindex.info (RII) YAML database, and (2) add custom sources manually so the shared nk function can use them immediately. (Let me know of any glaring errors in the README as I haven't scrutinised it nearly as much as I did the code it's talking about.)

TL;DR: Run rii2catalog.py once to build nk-data/. When you get a new dataset, drop a CSV into nk-data/data/ and add a small JSON block to nk-data/catalog/catalog.json.


0) Expected layout

project/
├─ nk.m                    % (or nk_sam.m) the MATLAB function
└─ nk-data/
   ├─ data/                % CSV files: wavelength_um,n,k
   └─ catalog/
      └─ catalog.json      % all sources + metadata

The function looks for nk-data/ next to itself (or via opts.data_root or the NK_DATA_ROOT environment variable).


1) Rebuild nk-data from refractiveindex.info

Prerequisites

  • Python 3.9+
  • PyYAML (for parsing RII YAML)
  • The RII database (release zip or cloned repo)

PowerShell:

python --version
pip install pyyaml

Get the RII database

Download/unzip the official RII database so you have a path like:

C:\...\rii-db\database\
  data\main\...
  data\organic\...
  data\glass\...
  data\other\...
  data\sopra\...
  data\3d\...

(Those subfolders are “shelves” of materials.)

Run the converter

Use the included rii2catalog.py to convert YAML → CSV + catalog JSON.

PowerShell (from your project folder that has nk.m and rii2catalog.py):

$DBROOT = "C:\Users\<you>\...\rii-db\database"
python .\rii2catalog.py --db-root "$DBROOT" --out-root .\nk-data --log-sampling `
  --shelves data/main data/organic data/glass data/other data/sopra data/3d

What it produces:

  • CSV files into nk-data/data/ with the exact header:
wavelength_um,n,k
 Wavelengths are in micrometers (µm). n or k may be blank if absent.
  • catalog.json into nk-data/catalog/ with, per source:
 * source_id (stable identifier)
 * kind (thinfilm or bulk) and valid_thickness_nm if detected
 * path (relative to nk-data)
 * valid_wavelength_um = [wmin, wmax]
 * has_k (boolean)
 * year (parsed from the reference text; newer wins ties)
 * ref_string, notes, provenance fields
 * SHA-256 checksums for each CSV (optional but included)

Supported formula sampling:

  • Formula 1 (Sellmeier-like) and 4 (Cauchy) → sampled to n(λ)
  • Other formula types are skipped (console shows [SKIP])

Notes

  • All wavelengths are stored in µm (RII’s internal unit). The MATLAB caller can pass meters/nm/µm; nk converts internally.
  • Pages with multiple DATA blocks (e.g., multiple thicknesses) produce multiple CSVs. Per-entry thickness is inferred where possible.
  • Some families (e.g., ITO) live under umbrella groups in RII. The MATLAB function augments these at runtime (e.g., synthetic materials.ito built from mixed_crystals).

2) Manually adding a new dataset

You can add lab data or literature without re-running the scraper.

2.1 Prepare the CSV

Create a file in nk-data/data/ with this exact header:

wavelength_um,n,k
  • Wavelengths: in µm, strictly increasing.
  • If you only have n, leave the k column blank. The MATLAB function can treat missing k as zero with opts.allow_k_zero = true.
  • Suggested filenames: <material>.<page>.<tag>.csv, e.g.:
au.smith-2025.d1.csv
sio2.custom-lab-2025.csv

2.2 Add a source entry to catalog.json

Open nk-data/catalog/catalog.json. Find the target material under materials. If it doesn’t exist, add a new object for it. Append a source record:

{
  "source_id": "au.smith-2025.d1",
  "kind": "thinfilm",                      // or "bulk"
  "path": "data/au.smith-2025.d1.csv",     // relative to nk-data/
  "valid_wavelength_um": [0.4, 1.1],       // µm
  "valid_thickness_nm": [100, 100],        // [tmin,tmax] in nm, or null for bulk
  "priority": 1,                           // not used by nk; keep 1
  "year": 2025,                            // tie-breaker (newer wins)
  "ref_string": "Smith et al., Opt. Lett. (2025) ...",
  "notes": "Ellipsometry, sputtered Au on glass, room temp.",
  "shelf": "lab",
  "book": "Au",
  "page": "smith-2025-d1",
  "has_k": true
}

Material block (simplified):

"au": {
  "aliases": ["Au","gold"],
  "default_source_id": null,
  "sources": [
    { ... },                               // existing sources
    {
      "source_id": "au.smith-2025.d1",
      "kind": "thinfilm",
      "path": "data/au.smith-2025.d1.csv",
      "valid_wavelength_um": [0.4, 1.1],
      "valid_thickness_nm": [100, 100],
      "priority": 1,
      "year": 2025,
      "ref_string": "Smith et al., Opt. Lett. (2025) ...",
      "notes": "Ellipsometry, sputtered Au on glass, room temp.",
      "shelf": "lab",
      "book": "Au",
      "page": "smith-2025-d1",
      "has_k": true
    }
  ]
}

Checksums (optional, for info.data_checksum):

$h = (Get-FileHash .\nk-data\data\au.smith-2025.d1.csv -Algorithm SHA256).Hash.ToLower()

Add to the "checksums" object:

"data/au.smith-2025.d1.csv": "sha256-<paste the hash here>"

2.3 Field rules (quick)

  • source_id: unique; recommend <material>.<page>[.dN]
  • kind: "thinfilm" if thickness known, else "bulk"
  • valid_wavelength_um: matches your CSV’s min/max
  • valid_thickness_nm: [t,t] for single thickness; null for bulk/unknown
  • year: 4-digit year (tie-breaker)
  • has_k: true if the k column exists (even if some rows are blank)

3) Sanity checks in MATLAB

Point nk at your database if it isn’t next to nk.m:

setenv('NK_DATA_ROOT', 'C:\path\to\project\nk-data')

Examples:

% Thin-film preference (e.g., Ag at 532 nm, 100 nm film)
[n,k,info] = nk(532e-9, 'ag', struct('units','m','thickness_nm',100));
disp(info.source_id); disp(info.selection_reason);

% Bulk fallback (thickness > 500 nm)
[n2,k2,info2] = nk(780e-9, 'au', struct('units','m','thickness_nm',600));

% Legacy names and inline paths
[n3,k3,info3] = nk(532e-9, 'fused silica', struct('units','m'));  % → SiO2
[n4,k4,info4] = nk(532e-9, 'BK7', struct('units','m'));          % inline Sellmeier
[n5,k5,info5] = nk(532e-9, 'ITO', struct('units','m'));          % ITO aggregator

Expected behavior:

  • For thickness ≤ 500 nm, thin-film datasets are preferred (closest thickness).
  • Tie-breaks: has knewer yearwider λ-spansource_id.
  • Legacy names (e.g., water, fused silica, diamond, air, ITO) resolve.
  • BK7 uses the inline Sellmeier to preserve legacy results.

4) Troubleshooting

  • “No source covers …” – The wavelength must be inside a source’s [λmin, λmax]. The error lists the top partial overlaps; choose another wavelength or add a source.
  • “CSV missing column ‘…’” – The CSV header must be exactly wavelength_um,n,k. Units are µm.
  • “Chosen source lacks k-values” – Either add k to the CSV or call with opts.allow_k_zero = true.
  • Legacy name not found – Add it to the alias table in nk.m (function local_alias_map()). For umbrella cases (e.g., ITO), the function augments the catalog at runtime.
  • Default units – If old code passed meters, set the default in local_parse_opts:
'units','m'   % instead of 'um'

5) Conventions & credits

  • Wavelength unit: all CSVs store wavelengths in µm. Callers can use opts.units ('m'/'nm'/'um').
  • Thickness unit: nm in the catalog.
  • Credit: Data derived from refractiveindex.info; cite the original publications listed in ref_string.

6) One-line rebuild (for future you)

$DBROOT = "C:\...\rii-db\database"
python .\rii2catalog.py --db-root "$DBROOT" --out-root .\nk-data --log-sampling `
  --shelves data/main data/organic data/glass data/other data/sopra data/3d