Announcing the Kotoba API & SDK Alpha: Speech-to-Speech, Streaming STT, and TTS for Builders

Products

Company

News

Contact

Company

Vision

Team

Careers

Investors

News

Contact

Mobile app

Model licensing

Product

May 28, 2026

•

News

Announcing the Kotoba API & SDK Alpha: Speech-to-Speech, Streaming STT, and TTS for Builders

Kotoba Technologies is launching the alpha release of the Kotoba API and SDK, opening up our state-of-the-art speech models to developers and enterprises for evaluation. The alpha gives builders direct access to the same foundation models trusted by large enterprises across the U.S. and Japan — now available to test, benchmark, and prototype against.

The alpha includes three model families designed to power the next generation of voice products:

Speech-to-speech translation — end-to-end, real-time translation that preserves conversational tempo
Streaming text-to-speech (TTS) — natural, expressive synthesis built for agentic and conversational interfaces. Sub 50ms latency on an H100.
Streaming speech-to-text (STT) — low-latency ASR built for voice agents and live note-taking

Our core supported languages are Japanese, Chinese, Korean, and English, with Spanish — the CJK + EN + ES coverage that matters most for cross-border business in Asia and the Americas.

Documentation is live at docs.kotoba.tech.

Why we built the API

For the last two years, Kotoba's models have powered our prosumer simultaneous interpretation app and a growing set of model-licensing enterprise deployments. The most consistent request we hear from teams shipping voice products and various ai agents — call centers, in-car assistants, meeting tools, robotics, media localization — is the same: give us the model endpoint for evaluation and deployment, not just the app.

The alpha is the answer. Customers can now evaluate Kotoba models directly against their own workloads, with a path to production licensing and dedicated API capacity once evaluation is complete.

Speech-to-Speech Translation: Finest for East Asian languages and Closing the Gap to Human Interpreters

We benchmarked our latest speech-to-speech model (Apr-26) against professional human interpreters, our own January-26 checkpoint, and our competitors' translation API (company O) across three language pairs: English→Japanese, Japanese→English, and Korean→Japanese.

Figure 1. Speech-to-speech translation quality (LLM-as-a-judge, higher is better) and latency (lower is better) across En→Ja, Ja→En, and Ko→Ja.

Two things stand out. On quality, the Apr-26 Kotoba model leads company O on Ja→En (0.40 vs 0.34) and Ko→Ja (0.35 vs 0.18), and is comparable on En→Ja (0.48 vs 0.47). On latency, Kotoba is meaningfully faster on En→Ja (3.10s vs 4.90s) and roughly 2.6× faster on Ko→Ja (1.96s vs 5.06s) — in fact faster than professional human interpreters on those two pairs while approaching their quality.

For CJK-centric workloads, Kotoba offers a better quality-latency frontier than company O today — and the jump from our January to April checkpoint suggests that gap will keep widening.

Japanese Agent TTS: Best-in-Class for Voice Agents

We evaluated Kotoba’s Japanese TTS against four competing systems on both natrualness and correcteness.

Japanese Agent TTS Benchmark — higher is better

SYSTEM	MOS ↑	CORRECTNESS ↑
Kotoba	4.15	0.80
Company A	4.12	0.73
Company B	3.19	0.46
Company C	2.08	0.14
Company D	1.99	0.27

* MOS: Mean Opinion Score (1–5). Correctness: agent-task success rate (0–1). Competitor names anonymized.

Kotoba leads on naturalness (MOS 4.15) and substantially outperforms every competitor on agent-task correctness (0.80). The correctness gap is the more telling number — Company A is close on raw audio quality (4.12 MOS) but trails by 7 points on actually completing the agent task end-to-end. For builders shipping production voice agents in Japanese, this is the difference between a demo and a deployment.

Who the alpha is for

The alpha is open to teams that want to evaluate Kotoba for production use cases including:

Voice agents and conversational AI — streaming STT + TTS optimized for full-duplex interaction
Real-time translation — meetings, broadcasts, call centers, live events

Kotoba models are already in use at large enterprises across the U.S. and Japan. The alpha API is the fastest way for new teams to evaluate the same technology against their own data, with model licensing and production API access available after evaluation.

Beyond the hosted API, we also support on-premise deployment for customers with data residency, security, or compliance requirements that rule out public cloud inference — common across finance, healthcare, government, and large enterprise IT in Japan and the U.S. Select models are additionally available in on-device form factors for mobile, automotive, robotics, and other edge use cases where latency, privacy, or offline operation matter. If on-prem or on-device deployment fits your roadmap, let us know during evaluation and we’ll scope it with you.

Get started

Docs: docs.kotoba.tech
Evaluation access: Contact us to request alpha credentials and discuss licensing
More from Kotoba: site.kotoba.tech

We’re excited to see what you build.

Join our newsletter

Subscribe for product updates, new features, and announcements as we build.

Join our newsletter

Subscribe for product updates, new features, and announcements as we build.

Join our newsletter

Subscribe for product updates, new features, and announcements as we build.

Join our newsletter

Subscribe for product updates, new features, and announcements as we build.

Select Language

English

Select Language

English

Select Language

English

Announcing the Kotoba API & SDK Alpha: Speech-to-Speech, Streaming STT, and TTS for Builders

Why we built the API

Speech-to-Speech Translation: Finest for East Asian languages and Closing the Gap to Human Interpreters

Japanese Agent TTS: Best-in-Class for Voice Agents

Who the alpha is for

Get started

Join our newsletter

Join our newsletter

Join our newsletter

Join our newsletter

Products

Company

Connect