Member of technical staff - multimodal vlm/llm

Freiburg (Elbe)

Black Forest Labs

Inserat online seit: 8 November

Beschreibung

What if the future of generative AI isn't just better images or better text, but models that understand both—and use that understanding to create in ways neither modality could alone? We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M downloads. But here's the frontier we're exploring: vision-language models that don't just caption images or generate from prompts, but truly understand the relationship between visual and linguistic information. Models that can enhance prompts intelligently, moderate content contextually, and unlock generative capabilities we haven't imagined yet. That's the research you'll lead. What You'll Pioneer You'll run cutting-edge projects in multimodal vision-language and large language models, integrating them into our media generation pipeline in ways that push beyond what either modality could achieve alone. This isn't about implementing existing VLMs—it's about developing novel approaches that make FLUX more powerful, more controllable, and more aligned with what creators actually need. You'll be the person who: Leads the development and training of state-of-the-art multimodal vision-language models within the FLUX technology stack—not just applying existing architectures, but innovating on them Designs and implements specialized fine-tuning strategies for VLMs to address specific use cases and performance requirements that general-purpose models can't handle Develops and optimizes LLM implementations for prompt enhancement, content moderation, and novel applications that improve how people interact with generative models Drives innovation by integrating VLM/LLM capabilities into our media generation pipeline in creative ways that enhance generative capabilities Conducts research to creatively combine vision and language models—exploring questions about how these modalities can inform and improve each other Maintains cutting-edge knowledge of the latest developments in multimodal AI and LLM research, evaluating emerging models and architectures for potential integration Collaborates with cross-functional teams to implement and deploy models at scale, contributing to architectural decisions and technical roadmap planning Documents and shares research findings with the broader team, translating breakthroughs into practical improvements Questions We're Wrestling With How can vision-language models improve prompt understanding in ways that make generation more controllable and aligned with user intent? What's the right architecture for integrating VLMs into diffusion model workflows without creating computational bottlenecks? How do you fine-tune vision-language models for specialized creative tasks that weren't in the training data? Where can LLMs enhance the generative pipeline—prompt rewriting, content moderation, parameter suggestion—and where would they add more friction than value? What novel capabilities emerge when you deeply integrate vision and language understanding into generative workflows? How do you evaluate whether multimodal models are actually improving generation quality versus just adding complexity? These aren't solved problems—they're research directions we're actively exploring. Who Thrives Here You've trained and fine-tuned large-scale vision-language models and understand the nuances of multimodal learning. You have strong intuitions about what makes VLMs work well, backed by either publications or practical projects that pushed the field forward. You're comfortable operating at the intersection of research and production, where models need to be both innovative and deployable. You likely have: Demonstrated expertise in training and fine-tuning large-scale vision-language models—not just using pre-trained ones, but developing them Strong publication record or practical experience with relevant projects in multimodal AI research that shows you can push the frontier Proficiency in PyTorch or similar deep learning frameworks with deep understanding of their capabilities and limitations Experience with distributed training systems and large-scale model optimization—because VLMs don't fit on one GPU Track record of implementing and scaling AI models in production environments where research meets real-world constraints We'd be especially excited if you: Have experience with diffusion models and generative AI architectures alongside autoregressive modeling—understanding how different paradigms can complement each other Bring a background in computer vision that informs your approach to multimodal models Contribute to open-source AI projects and understand the community Have worked in fast-paced startup environments where iteration speed matters Bring strong software engineering practices and system design skills Have experience with open-source VLM inference frameworks like vLLM What We're Building Toward We're not just adding VLMs to our stack—we're exploring fundamental questions about how vision and language understanding can make generative models more powerful and more aligned with human intent. Every model you train teaches us something about multimodal learning. Every integration reveals new capabilities. Every research finding shapes where the field goes next. If that sounds more compelling than applying existing techniques, we should talk. We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Bewerben

E-Mail Alert anlegen

Speichern

Mehr Stellenangebote