Unlimited OCR: One-Shot Long-Horizon Parsing

(github.com)

201 points | by ingve 3 hours ago

16 comments

robotswantdata 2 hours ago
Very interesting.
The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.
Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.
Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:
Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.
Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.
Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!
[-]
- _puk 41 minutes ago
  This hits a sweet spot I think for conversations too. I've been playing (for quite a while) on trying to encapsulate long running conversations.
  You have the overriding context, facts that don't change very often at all. The participants names, their backgrounds etc.
  Then you have some very fine grained facts (what they ate for breakfast this morning) which might be useful right now, but are irrelevant outside of a general trend over the longer term.
  When trying to reconstruct a conversation you really need to find the right balance without pulling in everything that has ever been discussed.
  This definitely is worth further investigation.
  [-]
  - ewild 26 minutes ago
    This sounds like we are trying to add an LSTM into a transformer
- d675 1 hour ago
  See, leetcode is useful. As I do this leetcode grind, I’ve been why techniques exist / how they’re used irl. Lots of interesting stuff there
  [-]
  - ai_fry_ur_brain 1 hour ago
    Who said it wasnt useful, dont listen to those people.
    [-]
    - Xevion 52 minutes ago
      People who are applying to jobs and are tested with LeetCode problems to assess their skill level, despite the two not really being correlated or relevant for the position
      [-]
      - galbar 25 minutes ago
        As someone that gets very annoyed when having to do LeetCode in interviews...
        Knowing algorithms, data structures and their memory and time complexities is very relevant for SWE. I've had teammates that didn't understand them and everything was fine until when it wasn't (scaling and performance issues).
        Or, as I put it to a teammate: "Would you rather review the PR of someone that understands the difference between a set and a list or the PR of someone who doesn't?". This was after we interviewed a candidate with ~15 YoE, on paper, that didn't know the difference.
peatmoss 1 hour ago
I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible.
I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).
I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)
I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.
[-]
- indiv0 22 minutes ago
  > music is basically a greenfield for AI wherever you look
  AIN'T THAT THE TRUTH.
  My girlfriend is studying musicology and she has some physical disabilities that make it difficult for her to write things down sometimes. So I try to help her by writing some AI-powered TTS/OCR/etc. apps here and there. It becomes painfully obvious that music was never considered an important part of any AI training dataset, anywhere.
  These days, I'm pleasantly surprised by how well Opus 4.8 understands/explains music theory (as you said). But ask him to transcribe/OCR/OMR some sheet music and he'll confidently give you the MusicXML/Lilypond equivalent of "2 + 2 = horse".
  I really hope this ignored area will be swept up with the rest of the rising AI wave, but it's still criminally undervalued.
- singpolyma3 1 hour ago
  What about sheet music typesetting formats like https://abcnotation.com/ ?
  [-]
  - peatmoss 1 hour ago
    I forgot to mention ABC. I have seen a few LLMs look at that. There was a model / paper published a couple years back called ChatMusician that built around it.
    With the caveat that I'm not terribly fluent in ABC, it seems to me that simple things are simple, but hard things seem to be nearly pathological. And (again, maybe a lapse in my understanding) it seems like there may be a fair number of concepts that are impossible to convey in ABC?
    Lastly, if I understand correctly, ABC got its start and is mostly popular as a simplified format for church songbooks. I'd imagine that would, uh, influence the training corpora towards sounding a bit... church songbooky.
    EDIT: I may have been overly dismissive of ABC on first glance. It does seem like people have extended it quite a bit, and that it's at least, in theory, capable of encoding most of what I'd expect. And it's human readable, which is a benefit. Though, readability does take a stiff penalty the more richness you add (e.g. dynamics, articulations, stacked notes, etc)
- genxy 31 minutes ago
  Create a benchmark for this problem that researchers can easily run and the problem will solve itself.
- WhitneyLand 1 hour ago
  “there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio”
  It may not be necessary…a lot of the training pairs/data for this could probably be procedurally created via code.
  Would be pretty fun to work on and see it come to life.
  [-]
  - peatmoss 52 minutes ago
    I'd imagine that rendered audio that just used midi voices (even high quality "Real Instruments" midi voices) would be pretty brittle for e.g. stem separation or automatic transcription. In a best case, I think you'd start with a clean digital representation, render sheet music imagery, and then have lots of recordings by a bunch of real instrumentalists playing the same music.
    On the topic of stem separation, I've wondered about creating a quasi-synthetic dataset by taking chunks of recordings by real musicians playing them back in a real space in various combinations and recording the resulting analog-blended cacophony. Could repeat in various environments like cathedrals, basement bars, etc for realism :-)
- mcbetz 1 hour ago
  I observe that space regularly and the only really good solution is soundslice. You scan and review some edge cases and get really good results. Paid service by a small company, very worthy to be supported!
  [-]
  - peatmoss 5 minutes ago
    I just signed up a trial, and uploaded a messy Real Book scan. It did very well! It missed the coda markings, but then again the directive in the Real Book was nonstandard. I guess that's a case where a multimodal model might have been able to read the text ("after solos, D.C. al coda") and do something smarter.
novoreorx 2 minutes ago
FYI, "Unlimited OCR Works" is a Fate/stay night reference. The original "Unlimited Blade Works" is a magic whose entire premise is copying weapons other people forged
KitN 2 hours ago
"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."
Class Act.
[-]
- gcr 1 hour ago
  I don’t understand the shade being thrown ?
  [-]
  - nickspacek 1 hour ago
    It's the opposite of shade, unless GP is being sarcastic. "Class act" is normally a compliment, and in the context here it sounds to me like they're congratulating Baidu/the researchers in being transparent about where their ideas came from.
    [-]
    - pbhjpbhj 50 minutes ago
      To be fair, I think I see "[real] class act" almost always used sarcastically.
janpeuker 41 minutes ago
Paper under https://arxiv.org/abs/2606.23050
(As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)
pmarreck 1 hour ago
my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?
A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect
[-]
- drakmo 50 minutes ago
  If I would want to achieve 100% recognition results I would combine this method with an image model recreating the original document from the transcribed text and matching the layout. One can do that with using all but the page or paragraph from the document you want to recreate (to avoid recreating the exact passage under test from the image artifact directly). After reconstructing you can do an optical comparison that specifically matches misaligned characters and find the errors. Rinse and repeat. Expensive but it would guarantee 100% recognition.
- pbhjpbhj 41 minutes ago
  You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).
  In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.
  Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).
  In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.
  Sometimes you care that it's guessed, sometimes you want just the best guess.
overflowy 1 hour ago
What are the requirements for running this locally?
piterrro 31 minutes ago
can someone explain how is this different than feeding the VLM model one page at a time?
manipalite 2 hours ago
Whatever happened to Reducto, was very promising 12-15 months ago
alansaber 1 hour ago
We've invented chunking? We are so back.
[-]
- ahknight 1 hour ago
  Streaming.
shevy-java 47 minutes ago
Is this an academic paper that is published in year xyz, but in +5 years nobody will remember it anymore?
madikz 1 hour ago
[flagged]
swordlucky666 50 minutes ago
[dead]
ramon156 1 hour ago
I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies
Oras 2 hours ago
OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?
I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
[-]
- joss82 1 hour ago
  I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.
  OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.
- chpatrick 2 hours ago
  It absolutely hasn't been solved, it's just got pretty decent in recent years.
  [-]
  - malfist 1 hour ago
    Pretty decent might be quiet the stretch. I'd term it almost acceptable, but only if you're using commercial solutions like amazon's textract, doing it with open source tools is at best, extremely painful and vaguely accurate.
- sscaryterry 2 hours ago
  Detecting characters almost, layout no.
  [-]
  - wongarsu 2 hours ago
    Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.
    But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)
    Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen
- vulture916 2 hours ago
  I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):
  "A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."
- ljouhet 1 hour ago
  Real question: what tool do you use? (for long/complex documents with tables, code, maths)
  - marker (with --force-ocr) gives me the best results
  - Mistral OCR (seems really great, but I never managed to get it work)
  - Mathpix (tried a long time ago)
  - docling (gives me garbage, I must use it wrong)
  - Unlimited OCR (will try it)
  - ???
  [-]
  - Oras 1 hour ago
    - Azure Document Intelligence (has an option to return markdown too including headers and footers).
    - AWS Textract
    [-]
    - badlibrarian 48 minutes ago
      Exactly. They're both very expensive and prone to surprising you. Sometimes in a good way, sometimes in a bad way. I'd rate them 85%, but you have to run a test because they both fail in different ways on the 15%.
  - ai_fry_ur_brain 1 hour ago
    poma-ai has really great chunking techniques that chunk the document based on the document structure/heirarchy.
    We use it on 200 page IEEE standards that are notoriously complex, filled with tables and diagram. Highly reccomend.
- cannonpalms 2 hours ago
  I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.
  [-]
  - ta988 2 hours ago
    This is already used in OCR, tesseract uses that.
- Aboutplants 2 hours ago
  lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go
- ta988 2 hours ago
  Cost, throughput, latency...
  [-]
  - Oras 2 hours ago
    Traditional OCR is faster, cheaper, and much more reliable than LLMs
    [-]
    - j16sdiz 2 hours ago
      If you consider non-English script, traditional OCR is not more reliable.
      CJK have lots of character and high confusion rate.
      Arabic scripts are complex and have lots of morphs.
      Vietnamese have easily confused diacritics.
      Thai have lots of non-standard fonts.
    - ta988 2 hours ago
      I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.
    - JodieBenitez 1 hour ago
      I wish it were. Alas...
- mschuster91 1 hour ago
  > I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?
  Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.
  Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.
  Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...
  [1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
- JohnKemeny 2 hours ago
  OCR has definitely not "been solved long time ago", what are you talking about?
  In your opinion, what is SOTA here?