I don't really see how the vacuum can effectively clean a whole room or flat using only a CNN of the current image in front of the robot. This would help detect obstacles, but a bumper sensor would do that as well.
All but the most basic vacuum robots map their work area and devise plans how to clean them systematically. The others just bump into obstacles, rotate a random amount and continue forward.
Don't get me wrong, I love this project and the idea to build it yourself. I just feel like that (huge) part is missing in the article?
Not saying that it’s viable here to build a world map since things like furniture can move but some systems, e.g. warehouse robots do use things like lights to triangulate on the assumption that the lights on the tall ceiling are fixed and consistent.
The classic Roombas from a decade or so ago worked without any sort of mapping or camera at all -- they basically did a version of the "run and tumble" algorithm used by many bacteria -- go in one direction until you can't any more then go off in a random new one. It may not be efficient but it does work for covering territory.
Cool project! That validation loss curve screams train set memorization without generalization ability.
Too little train data, and/or data of insufficient quality. Maybe let the robot run autonomously with an (expensive) VLM operating it to bootstrap a larger train dataset without needing to annotate it yourself.
Or maybe the problem itself is poorly specified, or intractable with your chosen network architecture. But if you see that a vision llm can pilot the bot, at least you know you have a fighting chance.
Check out using maybe some kind of monocular depth estimation models, like Apple's Depth Pro (https://github.com/apple/ml-depth-pro) and use the depth map to predict a path?
All but the most basic vacuum robots map their work area and devise plans how to clean them systematically. The others just bump into obstacles, rotate a random amount and continue forward.
Don't get me wrong, I love this project and the idea to build it yourself. I just feel like that (huge) part is missing in the article?
Not saying that it’s viable here to build a world map since things like furniture can move but some systems, e.g. warehouse robots do use things like lights to triangulate on the assumption that the lights on the tall ceiling are fixed and consistent.
It could easily understand so much about the environment with even a small multimodal model.
Too little train data, and/or data of insufficient quality. Maybe let the robot run autonomously with an (expensive) VLM operating it to bootstrap a larger train dataset without needing to annotate it yourself.
Or maybe the problem itself is poorly specified, or intractable with your chosen network architecture. But if you see that a vision llm can pilot the bot, at least you know you have a fighting chance.
Very cool project though!
(Lidar can of course also be echolocation).