Skip to content

Computing resources

This page documents what is on the Vishwanath Grid cluster today, what is planned for the next two quarters, and how access works.

If you have not read About, the short version: Vishwanath Grid is a research-computing cluster that several Indian universities run together. The pilot site is live at MIT-WPU Pune; everything in the software stack is open-source.

Phase 0 — Pilot site at MIT-WPU Pune

The pilot is intentionally modest. The point is not headline FLOPS; the point is to run the full researcher workflow — login, submit, wait, retrieve, repeat — on production-grade software, end to end, so that the federation work in Phase 1 starts from a known-working base.

Hardware (pilot, May 2026)

Class Count Notes
Compute nodes (CPU) small Mixed-generation x86-64 hardware contributed by the MIT-WPU research-computing group
Memory per node up to 128 GiB Sufficient for typical PhD-scale workloads
Local scratch NVMe, per node Fast scratch for IO-heavy jobs
Shared storage NFS pool Home directories and small datasets
Network gigabit Ethernet, on-campus Inter-node MPI is feasible for modest-scale jobs

Specific node counts and exact CPU SKUs are kept off the public site for the same reason most institutional facilities do — they change quarterly, the listed numbers go stale, and the figure that matters to a researcher is "is there capacity for my workload" rather than "what is the maximum core count". Partners and prospective users see the live numbers on request.

Software stack

Layer Component Why this choice
Operating system Debian stable Long support window, conservative defaults, no licence cost
Scheduler HTCondor Production scheduler used at CERN, NASA HEC, Fermilab, and a long list of university labs — familiar to anyone who has used a national HPC facility
Job submission HTCondor condor_submit and a web GUI (Open OnDemand) Power users get the terminal; everyone else gets a browser
Identity Keycloak OpenID Connect single sign-on; integrates with partner-university identity systems where they exist
Notebooks JupyterHub Spawns Jupyter sessions on cluster hardware on demand
Code & data Forgejo Self-hosted git for reproducibility artefacts
Monitoring Grafana + Prometheus Per-job, per-user, per-group accounting and dashboards
Configuration Ansible via AWX Reproducible site installs and routine maintenance

Every component is open-source under a permissive licence (Apache 2, MIT, BSD, or AGPL). There is no commercial subscription in the path of a researcher's job.

Queue policies and fair-share

The pilot uses a simple, transparent fair-share policy:

  • One queue, all jobs. No tiered priority, no "premium" queue.
  • Wall-time soft limit of 72 hours per job. Longer jobs need a short note ahead of time so we can plan the maintenance window.
  • Per-user resource cap kept just large enough that one user cannot fill the cluster, even unintentionally.
  • Preemption is off at the pilot. The cluster is small; we prefer queueing to checkpoint-restart complexity.

As the federation grows, the policy will need formalisation. The current rules will be replaced by a documented written policy maintained by the partner advisory committee — see Governance.

Software you can install yourself

Researchers do not need administrator access to install most tooling:

  • Python / R / Julia environments via conda, mamba, micromamba, uv, or pip --user — entirely in your home directory.
  • Containers via Apptainer (the successor to Singularity). Pull from Docker Hub, Quay, Nvidia NGC, or build locally from a definition file.
  • Spack for HPC-style scientific software stacks where reproducibility matters more than convenience.

Site-wide modules — the things that need administrator install — are deliberately kept minimal: compilers, MPI, CUDA toolkit, and the relevant runtime libraries.

Phase 1 — Federation (in progress)

In Phase 1, two additional Indian universities join as partner sites. Each partner runs the same software stack, configured by the same Ansible playbook, but on hardware they own and house. The federation work itself is mostly identity routing (so a job from campus A can be scheduled to campus B's nodes) and shared accounting (so each site sees who used what).

Specific partner names will be added here as agreements close and hardware lands on campus. Until then, see Roadmap for the timing view.

Phase 2 — Wider research toolkit (planned)

Once the federation runs at three sites, the platform becomes the home for the everyday tools a research group already uses — listed in Services. The work in Phase 2 is connecting those tools to the single login and the cluster's accounting, not inventing them; every service named is an established open-source project in its own right.

Reporting an issue, asking for help

The researcher-facing support channel is the group chat (once your account is provisioned) and the wiki. For account requests and anything that does not have an obvious home, the general contact address reaches all operators.

Citing the cluster

If a computation that contributed to a paper ran on Vishwanath Grid, please include the suggested acknowledgement when your paper is submitted. The current text is available from [email protected].