Custom JS video player with ffmpeg HTTP streaming in PHP

The first real test of ShareBox with an actual file was an 8 GB MKV — HEVC encode, DTS audio, PGS subtitles burned from a Japanese Blu-ray. The <video src="..."> I had put in place opened, spun for two seconds, and stopped in complete silence. No error event, no console message. Just a spinner going nowhere.

ShareBox started from a simple premise — share large files without a third-party cloud. Video streaming was supposed to be one feature among others. Except real-world files are not clean H.264/AAC MP4s: there are MKVs, HEVC encodes, Dolby audio tracks, bitmap subtitles. The native browser player gives up without a word. So I built a full player: on the server side, download.php orchestrates three ffmpeg streaming modes. On the client side, a JS state machine handles mode selection, stall recovery, image subtitle burn-in, and UX. This post documents what I built — and where it cost more time than expected.

Three server-side streaming modes

The entry point is download.php?stream=MODE. Three modes are available, selected dynamically by the JS based on the detected codec.

native — X-Accel-Redirect

For files already playable by the browser (MP4 H.264 + AAC), we delegate to nginx via X-Accel-Redirect. PHP never touches the bytes, no ffmpeg, no CPU cost. HTTP byte-range works normally, seeking is instant.

remux — zero-cost repackaging

An H.264 MKV cannot be streamed natively by the browser, but the codecs are compatible. The solution: repackage on the fly into a fragmented MP4 without re-encoding.

ffmpeg -i input.mkv \
  -c:v copy -c:a aac \
  -movflags frag_keyframe+empty_moov+default_base_moof \
  -min_frag_duration 300000 \
  -f mp4 pipe:1

-c:v copy: zero video re-encoding. -c:a aac: audio is converted if needed (DTS, AC3 → AAC). The result is a fragmented MP4 piped directly into the HTTP response. CPU cost: near zero for video, a few percent for audio conversion.

The flags frag_keyframe+empty_moov+default_base_moof are critical. Without empty_moov, the browser waits for the end of the file to read the moov atom (metadata) — which blocks indefinitely on a pipe. Without frag_keyframe, fragments don't start on keyframes and seeking breaks.

A pitfall that cost me an hour: I had added -fflags +genpts and first_pts=0 to normalize timestamps. The reasoning seemed sound — some MKVs have PTS values that don't start at zero, and I wanted to avoid surprises during seeking. In practice, these options modify PTS in a way that browsers misinterpret on a fragmented stream: video and audio gradually desync after each seek. Removed, the problem disappeared immediately.

Normalizing timestamps on a fragmented pipe means "fixing" something the browser already handles fine — and introducing a desync it has no way to correct.

transcode — for incompatible codecs

Unsupported HEVC, VP9, AV1, or Dolby audio that won't pass through: we re-encode.

ffmpeg -i input.mkv \
  -c:v libx264 -preset ultrafast -crf 23 \
  -c:a aac \
  -movflags frag_keyframe+empty_moov+default_base_moof \
  -min_frag_duration 300000 \
  -f mp4 pipe:1

-preset ultrafast sacrifices compression to minimize startup latency. -crf 23 gives acceptable quality without blowing up the bitrate. The video is playable within a few seconds, not after a full encode pass.

Concurrency semaphore and clean disconnect

Each ffmpeg process consumes CPU for the entire duration of the stream. Without a limit, ten concurrent downloads mean ten ffmpeg processes fighting over CPU cores. I added a PHP semaphore (via sem_acquire) capped at 4 concurrent processes.

$sem = sem_get(ftok(__FILE__, 'f'), 4);
sem_acquire($sem);

register_shutdown_function(function () use ($sem, $proc) {
    if (is_resource($proc)) {
        proc_terminate($proc);
    }
    sem_release($sem);
});

// Streaming loop
while (!feof($stdout) && !connection_aborted()) {
    echo fread($stdout, 65536);
    flush();
}

The connection_aborted() check inside the loop is essential: when the client closes the tab, PHP detects the disconnect, exits the loop, the shutdown function kills the ffmpeg process, and releases the semaphore slot. Without this, orphan ffmpeg processes pile up until reboot.

Infrastructure adjustment: pm.max_children in PHP-FPM was bumped from 5 to 25. Each active stream monopolizes one FPM worker for its entire duration — unavoidable with synchronous streaming. With 5 workers, the 6th visitor waits in the dark.

SQLite ffprobe cache

Before choosing the streaming mode, the JS needs to know the file's codec. ffprobe answers that, but takes 2 to 12 seconds on a large file (disk access, header parsing). My first version did trial-and-error detection on the JS side: try native, wait 2 seconds, if nothing plays → try remux. In practice, 2 seconds is a long wait, and it generates false positives on slow connections. The right solution: know the codec before starting anything.

SQLite cache keyed by (path, mtime). The key includes the file's mtime to automatically invalidate the cache if the file is replaced. Cache hit: ~100ms. Cold: 2-12s (once per file, never again).

function probeVideo(string $path): array {
    $db  = new PDO('sqlite:' . PROBE_CACHE_DB);
    $key = hash('xxh64', $path . '|' . filemtime($path));

    $row = $db->query(
        "SELECT data FROM probe_cache WHERE key = " . $db->quote($key)
    )->fetch();

    if ($row) {
        return json_decode($row['data'], true);
    }

    $cmd    = 'ffprobe -v quiet -print_format json -show_streams ' . escapeshellarg($path);
    $result = json_decode(shell_exec($cmd), true);

    $db->prepare("INSERT OR REPLACE INTO probe_cache (key, data) VALUES (?, ?)")
       ->execute([$key, json_encode($result)]);

    return $result;
}

The JS state machine

The client-side player is built around a state object S that evolves as video events fire. No framework, no external state manager — just a mutable object watched by targeted event listeners.

const S = {
    step:        'native',   // mode currently being tried
    confirmed:   null,       // confirmed working mode (skip re-test on seek)
    stallCount:  0,          // watchdog retry counter
    seekPending: false,      // seek in progress (debounce)
};

Probe-driven mode selection

chooseModeFromProbe(streams) inspects the streams returned by ffprobe and picks the mode:

function chooseModeFromProbe(streams) {
    const video     = streams.find(s => s.codec_type === 'video');
    const audio     = streams.find(s => s.codec_type === 'audio');
    const vcodec    = video?.codec_name;
    const acodec    = audio?.codec_name;
    const container = currentFile.ext.toLowerCase();

    if (vcodec === 'h264' && container === 'mkv') return 'remux';
    if (vcodec === 'h264' && container === 'mp4' && acodec === 'aac') return 'native';
    if (vcodec === 'h264' && container === 'mp4') return 'transcode';

    // VP9, AV1, HEVC: try native if the browser claims support
    const probe = document.createElement('video').canPlayType(mimeFor(vcodec));
    return probe !== '' ? 'native' : 'transcode';
}

Silent HEVC hardware decode failure

Safari and Chrome on Windows sometimes claim HEVC support via canPlayType(). When hardware decoding silently fails on these browsers, it does so with extraordinary discretion: no error event, no stalled event, nothing in the console. Audio starts playing normally. The image freezes on the first frame — or worse, stays black. video.videoWidth stays at 0, and that's the only available signal.

setTimeout(() => {
    if (video.videoWidth === 0 && S.step === 'native') {
        console.warn('Silent HEVC failure → cascade to transcode');
        S.step = 'transcode';
        reloadStream();
    }
}, 1500);

A codec that "supports" a format but fails to decode silently is exactly as useful as one that doesn't support it — except it's much harder to detect.

Stall watchdog with exponential backoff

My first watchdog implementation used a fixed timeout per mode. The problem surfaced immediately on large HEVC files: ffmpeg takes several seconds to start a transcode (parsing, memory allocation, producing the first keyframe). The watchdog would lose patience and relaunch the stream. Two concurrent ffmpeg processes started up, competed for CPU cores, slowed each other down. The semaphore saturated on the third retry. The user saw a spinner forever with no explanation — while on the server side, several zombie ffmpeg processes waited for their four slots.

Exponential backoff per mode fixed that: the first retry waits 20s in transcode mode, the second 40s, capped at 120s. A legitimately slow-starting ffmpeg has time to produce its first fragments before the watchdog panics.

const BASE_TIMEOUTS = { native: 5, remux: 10, transcode: 20, burnSub: 30 };

function startWatchdog() {
    clearTimeout(stallTimer);
    const base    = BASE_TIMEOUTS[S.confirmed ?? S.step] ?? 15;
    const timeout = Math.min(base * Math.pow(2, S.stallCount), 120) * 1000;

    stallTimer = setTimeout(() => {
        S.stallCount++;
        console.warn(`Stall #${S.stallCount}, retrying from ${video.currentTime}s`);
        reloadStream(video.currentTime);
    }, timeout);
}

video.addEventListener('timeupdate', () => startWatchdog());
video.addEventListener('waiting',    () => startWatchdog());

The watchdog resets on every timeupdate (active playback). If it fires, we restart the stream from the current position. stallCount doubles the delay on each retry, capped at 120s.

Subtitles: text and image tracks

Text subtitles (SRT/ASS) via WebVTT

Text subtitles are extracted by ffmpeg as WebVTT on demand and served as plain text. On the JS side, I don't use the native <track> element: positioning is too limited and ASS rendering is non-existent. I use an absolutely-positioned <div> overlay, positioned via getBoundingClientRect with an offset for the control bar.

For files with thousands of cues, finding the active cue on each timeupdate would be O(n). I use a binary search for the initial position, then a pointer that advances linearly — O(log n) on the first seek, O(1) afterwards.

function findCueIndex(cues, time) {
    let lo = 0, hi = cues.length - 1;
    while (lo < hi) {
        const mid = (lo + hi) >> 1;
        if (cues[mid].end < time) lo = mid + 1;
        else hi = mid;
    }
    return lo;
}

let cuePtr = 0;

video.addEventListener('timeupdate', () => {
    const t = video.currentTime;
    while (cuePtr < cues.length - 1 && cues[cuePtr].end < t) cuePtr++;
    if (cues[cuePtr].start > t + 1) cuePtr = findCueIndex(cues, t);

    const cue = cues[cuePtr];
    subOverlay.textContent = (cue.start <= t && t <= cue.end) ? cue.text : '';
});

Font size is proportional to video width (2.5%, min 13px), recalculated by a ResizeObserver on the container element.

Image subtitles (PGS/VOBSUB) — burn-in

PGS (Blu-ray) and VOBSUB (DVD) subtitles are bitmap images: CSS overlay is not an option. The only solution is to burn them directly into the video via ffmpeg, which triggers a full transcode with the subtitles filter.

This cost me half a day. On my test files — extracts I had prepared myself — the subtitles filter worked flawlessly. Then I tested on real Blu-ray files: subtitles shifted, cropped, sometimes absent. The problem: PGS subtitle canvas dimensions don't always match the video dimensions. A 1080p Blu-ray MKV can carry a subtitle canvas at 1920x1080, but also at 1920x816 or any value inherited from the original mastering. A plain subtitles filter stretches or crops without warning. The fix, found after a while in a 2019 ffmpeg-user mailing list thread, is scale2ref: fit the subtitle canvas to the video resolution before the overlay.

ffmpeg -i input.mkv \
  -filter_complex \
    "[0:v][0:s:0]scale2ref[vid][sub]; \
     [sub]scale=iw:ih[sub2]; \
     [vid][sub2]overlay" \
  -c:v libx264 -preset ultrafast -crf 23 \
  -c:a aac \
  -movflags frag_keyframe+empty_moov+default_base_moof \
  -f mp4 pipe:1

On the JS side, detecting an image subtitle track sets S.step = 'burnSub' with the track index (burnSub=N in the URL). The watchdog timeout rises to 30s for this combination — transcode startup with embedded subtitles is slower than a plain transcode.

UX: controls, mode badge, fullscreen

A few UX decisions are worth explaining. The mode badge (green = REMUX, orange = TRANSCODE, grey = NATIVE) is clickable to cycle modes manually. It's primarily a debugging tool — when remux produces an audio artifact or transcode is inexplicably slow, one click lets you force the other mode without reloading the page. In practice, end users reach for it too when something isn't working.

Fullscreen is triggered on .player-card (the parent div), not on <video> directly. Calling element.requestFullscreen() on <video> causes the browser to render its own controls on top of the custom ones — on Firefox the result is particularly chaotic. On the parent div, custom controls stay visible and functional. Auto-hide kicks in after 3 seconds of mouse inactivity, with the cursor hidden. Single tap = play/pause, double tap = fullscreen, with a 250ms debounce to tell them apart. Keyboard shortcuts (Space/K, left/right arrow ±10s, F, M) map to the same actions.

One iOS detail worth its weight in debugging time: Accept-Ranges: none in PHP response headers for remux and transcode modes. Safari on iOS sends Range requests on <video> elements even when the source is a pipe. Without this header, Safari attempts a byte-range on a non-seekable stream and gets a broken response — playback dies within the first second. Volume, mute state, and playback speed are persisted in localStorage across sessions. The requestAnimationFrame throttle on timeupdate prevents stacking progress bar updates at 60 fps during active playback.

Conclusion

A video player that "actually works" on a heterogeneous file library is much more surface area than expected. Most of the code isn't in the player itself — it's in handling the edge cases: silent codec failures, bitmap subtitles, orphan processes, iOS doing byte-range on a pipe.

If I were starting over: implement the ffprobe probe and SQLite cache first, before writing a single line of JS. Having codec truth available in 100ms completely changes the mode selection logic and simplifies everything downstream. The state machine, the watchdog, the subtitle burn-in — all of it becomes predictable once you know exactly what you're dealing with before you start.

Full source code is on GitHub (ohugonnot/sharebox).