Portfolio

Reverse Engineering Arte's API

From scraping Next.js internals to intercepting Android traffic. How I mapped Arte's private API to build a Netflix-style frontend for their catalogue.

I’ve always been curious about how streaming platforms work under the hood. Not just the video playback, the whole pipeline. How the catalogue is structured, how editorial decisions flow through an API and end up as what you see on screen.

I wanted a spare-time project around that, with a real API behind it. Real content, real edge cases, editorial depth.. Something more than a toy dataset. I looked for public APIs but nothing had enough complexity to stress-test real UI patterns, so I started poking around streaming platforms I actually use. Arte caught my attention. Rich catalogue, well-structured pages.. and more importantly, when looking at the network tab, some things looked very promising.

On the design side, I didn’t want to come up with my own UI. I wanted to study and reproduce one that already works. Netflix was the obvious pick. Autoplay previews, carousels, cinematic layouts, hover states.. a lot of subtle UI patterns that you only really understand by rebuilding them.

So the idea became: Arte’s data, Netflix’s design, BOOM: Arteflix was born. Everything pulled live from Arte’s own API, no hardcoded content. You replace arte.tv with arteflix.kinoo.dev in any URL and you get the same catalogue in a different skin.

For the implementation, I went with ReScript and React. Sound type system, exhaustive pattern matching, variants.. all the functional programming beauty compiled to clean readable JavaScript. Underrated language in the era of typed JS IMHO.

The first problem to tackle was: Arte has no public API.

Phase 1: Scraping Arte data

Arte’s website runs on Next.js. Until they migrated to the App Router, the Pages Router embedded all server-side props in a <script id="__NEXT_DATA__"> tag right in the HTML. A JSON blob with everything the page needed.

Pretty easy to spot. Just open devtools, navigate around, and you quickly see how the data flows. Arte uses an API-first approach where each page is composed of zones, each zone being a section of the page (hero, carousel, editorial picks..). A zone has its own title, content items, and a template field that tells the frontend how to render it: horizontal-landscape, slider-square, vertical-portrait.. The whole page is just an array of zones.

This structure felt very familiar, similar to what you’d design for generating sites from a headless CMS; which makes sense given how editorial-heavy the content is. But I wasn’t the one who designed the schemas. So I had to reverse engineer them by navigating pages and reading the responses. Each template comes with its own specific data shape, which maps perfectly to ReScript’s variant types and pattern matching. I was pretty happy about it, just had to browse a few pages to map out the different templates, and we were good to go.

The first version of Arteflix was front-end only: fetch the Arte page, parse the HTML, grab __NEXT_DATA__, done.

Phase 2: CORS hits

I grabbed a Netflix design system from Figma, built a first implementation, and once I was happy enough with the result, posted it on LinkedIn.

Shortly after, Arte’s pages started returning CORS headers that blocked cross-origin requests from the browser. Whether related to my post or coincidence, the browser-side scraping approach was dead.

A quick word on CORS

Back in the early web, browsers had no restrictions on cross-origin requests. Any page could fetch any URL. That became a problem pretty fast, a malicious website could hit your router admin panel, your banking session.. The Same-Origin Policy was introduced to stop that, browsers started blocking pages from reading responses from a different origin.

CORS (Cross-Origin Resource Sharing) came later as a way to selectively relax that policy. A server can send headers saying “this origin is allowed to read my responses.” If you’re not on the list, the browser won’t give you the data. The request still happens though.

The thing is, CORS is purely a browser thing. It’s a convention, nothing more. Anything outside the browser, a script, a backend, curl, can make the exact same request and get the full response. So as a scraping protection it’s pretty weak. It stops random JavaScript from calling your API, but it won’t stop someone from setting up a proxy.

Phase 3: server-side proxy

No browser access? Fine. That kind of thing just makes me want to keep going, no way I’m stopping days after going live. I set up a proxy in my Next.js API routes that fetches Arte pages, parses the data, and serves it to the frontend. Same approach, just one layer back. Also a good occasion to add some SSR.

Some work later, a VPS with Next.js running the proxy, and Arteflix was back.

I knew that sooner or later, Arte would switch to the App Router and the JSON blob would disappear. Even if I didn’t plan to maintain this project ad vitam aeternam, I wanted it to survive that migration. And sure enough, it happened. The __NEXT_DATA__ script tag disappeared entirely.

My scraping approach was dead for real this time. Sure, I could start parsing the full HTML, scrape the rendered DOM, try to reconstruct the JSON from there.. but who wants to handle that? Boring, fragile, and not future-proof at all. So, was there another way to fetch their data?

Phase 4: finding the actual API

I already had a few endpoints bookmarked from earlier but hadn’t pursued them since scraping worked fine. Time to dig in. Between what I’d already found and some research on related open-source projects, I pieced together the API surface. Two main layers:

EMAC v4 (content/catalogue)

The main content API. “EMAC” probably stands for something like “Editorial Management And Content”, Arte’s internal system for managing their entire catalogue. Every page on Arte, categories, program pages.. all served by EMAC. The response structure is the same one I’d been scraping from __NEXT_DATA__: pages composed of zones, each zone with its own content items and a template that tells the frontend how to render it. Jackpot. I already had ReScript schemas for all of that, just had to point my proxy to these endpoints instead of scraping HTML.

What’s interesting is how the website accesses it. Arte doesn’t call the API directly from the browser. They route everything through their own reverse proxy at www.arte.tv/api/rproxy/emac/v4/.... This rproxy sits between the frontend and the actual API at api.arte.tv/api/emac/v4/..., handles authentication server-side. The Bearer token never touches the browser. From the client side, you just hit the rproxy path and get data back, no API key needed.

The actual API at api.arte.tv requires a static Bearer token on every request. Not per-user auth, just a gate. But through rproxy, all of that is transparent.

rproxy/emac/v4/{lang}/web/pages/HOME          → homepage
rproxy/emac/v4/{lang}/web/pages/{CODE}         → category (CIN, SER, ACT...)
rproxy/emac/v4/{lang}/web/programs/{id}/       → program details
rproxy/emac/v4/{lang}/web/collections/RC-{id}/ → collection
rproxy/emac/v4/{lang}/web/                     → full nav structure

The /web/ segment is a “support” identifier, the website uses /web/, the mobile app uses /app/. Same API behind both, but the loading strategy differs (more on that later).

Player v2 (video streams)

The second layer handles video playback. Give it a program ID, it returns stream URLs and metadata. JSON:API format, everything you need to build a player.

player/v2/config/{lang}/{id}      → stream config, HLS manifests
player/v2/trailer/{lang}/{id}     → trailer
player/v2/playlist/{lang}/{id}    → playlist
player/v2/config/{lang}/LIVE      → live stream

Unlike EMAC, Player v2 calls go directly to api.arte.tv, not through a reverse proxy. The endpoints currently work without a Bearer token, but the mobile app sends one on every request. If Arte starts enforcing it, anything calling without the token breaks.

The stream URLs in the config response point to a dedicated CDN, where .m3u8 manifests are generated on-the-fly:

manifest/v1/Generate/{token}/{audioTrack}/{quality}/{id}.m3u8

The token is session-specific, the audio track can be fr, VF (version française), or VOF (version originale), and quality is something like XQ+CHEV1. The CDN requires no authentication, the token in the path is enough. Segments are served as CMAF (Common Media Application Format) over HLS.

HLS works by splitting a video into small chunks listed in a .m3u8 playlist. A master manifest points to variant playlists per quality level, and the player switches between them based on bandwidth. CMAF is the segment containing fragmented MP4 compatible with both HLS and DASH. It’s the standard for most streaming platforms, and players like Video.js handle it out of the box.

Image CDN

Arte serves all images through a dedicated CDN with on-the-fly resizing:

api-cdn.arte.tv/img/v2/image/{hash}

Every content item in EMAC responses includes an image URL with a __SIZE__ placeholder. The frontend replaces it with the desired dimensions and gets a resized version on the fly. No authentication needed, but the CDN rate-limits aggressively. Load too many images at once and you get 429s. I ended up building a client-side asset queue with concurrency limits and exponential backoff, but that’s for another article.

The switch

With that, I had everything I needed to reproduce the same data I was scraping before, but calling the API directly. No authentication to worry about on my side, the only thing to bypass is CORS, which is just fetching server-side like I already did. Rewrote the data layer to hit these endpoints instead of parsing HTML. Same proxy architecture, same flow, just no more scraping in the middle.

Phase 5: intercepting the Android app

I was fairly confident I had found most of the API surface from the website, but I’d only looked for what I needed: content and video streams. Since mobile doesn’t use SSR, all API calls happen directly from the client so more easy to see. And mobile apps sometimes use different endpoints, or at least different paths. The website uses /web/ in EMAC routes, does the Android app do the same? Only one way to find out.

The plan: intercept the Arte Android app’s traffic with mitmproxy.

The setup

I tried patching the APK with apk-mitm first, but the Arte app ships as a split APK and apktool corrupted some resources during re-encoding. Ended up just pointing my phone’s Wi-Fi proxy at mitmproxy and installing the CA cert manually. Arte doesn’t do certificate pinning, so everything went through clean.

Phone proxy set, mitmproxy running, open the Arte app and browse around. Every API call shows up in real time. Went through the homepage, every category, programs, even Shorts which are not present on the website.

What I found

The mobile app uses /app/ instead of /web/ in EMAC paths: same API, different support identifier. But the interesting part is everything else the app calls that the website doesn’t expose as clearly or that I didn’t check more.

I discovered SSO v3, a whole auth layer I hadn’t mapped yet. Watch history, next episode tracking, personalized content.. all powered by Keycloak (OpenID Connect) under the hood. The interesting part: it supports an anonymous grant type, so even without logging in, the app gets a JWT valid for 5 years and can access all user-scoped endpoints.

New interesting endpoints:

  • /emac/v4/{lang}/app/zones/{uuid}/content is how the app lazy-loads individual zones. This one is interesting. The /app/ page response returns all zones of the homepage, but every single one comes back empty. The app then fetches each zone’s content individually as the user scrolls, with pagination and filtering by collection or content type. Makes total sense for a mobile app on cellular data, you don’t want to download all zones worth of content upfront. For Arteflix I’m sticking with /web/ though. One fat request is exactly what you want for SSR, and since everything goes through my proxy on a tiny VPS, one round-trip beats dozens.

  • /emac/v4/{lang}/app/pages/SHORTS and /player/v2/playlist/{lang}/SHORTS for short-form content. App-only, not exposed on the website. There’s also a SHORTS_ONBOARDING and SHORTS_END page for the intro and end screens. Could be interesting to add to Arteflix later.

  • /emac/v4/{lang}/app/pages/SEARCH?query={text} is the search endpoint. It’s also present on the website through rproxy, I just hadn’t looked at it before. Turns out search is just another EMAC page. Same zone/content structure as everything else, just with a query param on top. The app fires a request on every keystroke, no debounce, no separate autocomplete endpoint. For Arteflix this is great because it fits right into the existing data pipeline, same schema, same proxy path.

Between the website and the app, I now had the full picture. Content, video streams, user data, search, navigation, even short content the website doesn’t expose. More than enough to build a complete Netflix-style frontend on top of it.

What’s next

I got attached to this project. What started as a weekend experiment turned into something I genuinely want to keep building. Rich content APIs, search, performance, editorial logic, UI patterns that need to feel right.. streaming platforms touch all of that.

There’s still features to add and UI to polish on the frontend side. But I started this project curious about how the whole pipeline works. Now I want to build one.

Looking at it, the backend breaks down into five parts. First, the most obvious one: a CDN for image processing and resizing. Then something like EMAC: the content API itself, where editorial decisions become structured data that the frontend just renders. On top of that, a Player service for stream resolution, manifest generation and multi-language audio tracks. Then SSO for authentication, user profiles, watch history and personalized content, with anonymous support. And finally, search. From the consumer side, search is “just another EMAC page.” Building the thing that makes it just another page is a different story. Behind that, there’s indexing, ranking, relevance, a whole search engine abstracted away behind the same API contract. A good excuse to finally try Elixir for real.