At a time when AI technology is booming, medical agents, as an intelligent module embedded in the hospital homepage, should provide patients with efficient and convenient consultation and consultation services, but many products face the dilemma of low user usage. The author of this article reviewed the whole process of its launch in nearly 1,000 hospitals across the country from 0 to 1 by personally participating in the research and development and implementation of a medical agent product.
Recently, the participating Agent products have been launched in nearly 1,000 hospitals across the country~ (scattering flowers)
In the process of research, pilot, and landing, we have stepped on many pitfalls and accumulated a lot of experience. Write these down, hoping to give some reference to friends who are doing AI landing.
Note: This Agent is a module embedded in the homepage of each hospital, and patients have any questions about the hospital, such as “Is there an emergency?” Can I rent a wheelchair? What department does the child have stomach pain”, etc., and directly consult or register online.
1. As an AI product manager, finding real user needs is more important than technical understanding
At first, we didn’t think about being an AI Agent, but the frequent problems in the background attracted attention:
“What should I do if I have hyperthyroidism four months pregnant?”
“Which floor is the blood drawn?”
“Is there a painless gastroscopy?”
These questions seem simple, but almost no one can answer them. It’s not that the doctors are unprofessional, but that the information is too scattered and too long-tailed. Many answers are hidden in the hospital’s HIS system, official accounts, and even paper roll-ups, and users are either unwilling to check or don’t know where to check.
Therefore, we think: if even the hospital staff can’t answer clearly, is it possible for AI to become the “unified answerer”? We actually have a lot of trump cards in our hands: medical knowledge graph, doctor Q&A data, hospital service information and number sources…… Rough assessment, covering most scenarios is fine.
But we didn’t come up and did it, but first looked at whether anyone in the market had done it: platform products (Kangkang, Anjier) were biased towards health consultation, and the service information could not be answered; Vertical products (iFLYTEK Xiaoyi) are re-consulted, and hospital operation issues are almost blank.
How can product managers do a good job in B-end digitalization?
All walks of life have taken advantage of the ride-hail of digital transformation and achieved the rapid development of the industry. Since B-end products are products that provide services for enterprises, how should enterprises ride the digital ride?
View details >
The conclusion is clear: users have demand, and no one in the market does it, so we decided to give it a try.
2. MVP idea verification requirements, no need to go all in model architecture from the beginning
We did not start working as soon as we came up, but used MVP product thinking, with the minimum investment, and in the shortest time, after verifying the core user value, we established the project.
At that time, Myta AI Search already had the ability to upload knowledge bases, so I uploaded several hospital-related information to Myta’s knowledge base, and asked Myta to answer based on the knowledge base to complete the preliminary MVP plan.
Guess how long it took to build the entire knowledge base?
10 minutes.
It was unthinkable to build an MVP product in such a short period of time, but now with the blessing of various AI products, I can also build a usable product in minutes if I don’t know code.
After building, I sent this hospital AI jack-of-all to my colleagues and user experience, although it was a little rough, and there was not even a prompt to optimize the answer, they still felt that it was a lot more efficient in obtaining hospital information, because this information originally required them to find the official account, Xiaohongshu, ask acquaintances, and even call the hospital for consultation.
In addition to the higher efficiency of information acquisition, some users even want to find a suitable doctor directly in the MVP plan for online consultation, drug purchase, or registration. Coincidentally, we also happen to have online consultation doctors, drugs and number sources, which can perfectly undertake user demands.
After verifying the real user needs, we started working.
3. There are no boundaries between products, operations, and R&D
In fact, our team has not set up a project specifically for AI projects before, and most of them are sub-functions that require large model capabilities and are appropriately introduced into large models. This time it is a product based on a large model, so the product, operation, R&D, and testing teams are also exploring their respective work boundaries and how to collaborate in the project.
Looking back now, in addition to completing their own work, everyone will more or less “intervene” in the work of other functions. For example:
- The MVP solution is built independently of the product and does not require R&D participation at all.
- The operation will participate in the work of writing prompts, iterating the workflow together with products and R&D;
- R&D will participate in user research, conduct user interviews and refine insights;
- The product will directly write the knowledge base structure and complete the knowledge base design together with R&D.
In addition to these, everyone also tried their best to make this product more perfect, and even ran to hospitals in the suburbs to take pictures of the hospital’s public notice board with their own hands to verify the accuracy of the large model’s reply to questions. Although it was very hard, everyone felt that it was worth it.
4. You don’t need the best model, but the right process + the right model
Many people think that to be an agent, you must use the “strongest large model”, such as GPT-4o, which has many parameters, strong reasoning, and deep understanding. But in real engineering practice, we don’t need the strongest model, but the most suitable model in the most suitable position.
For example, in our Agent product, a user’s question may trigger these models in turn:
【1】Intent recognition models → judge patient intentions.
The responsibility of this model is very clear: to judge the patient quickly, stably, and at low cost: What department is the symptom? Where is the floor located? Is there parking at the hospital? At this time, we will choose a model with fast response and low cost, which does not require much “thinking power”, but must distribute the intention to the corresponding workflow steadily, accurately and quickly.
【2】Information retrieval model → Find hospital information, doctors, and number sources.
This part has extremely high requirements for “accuracy”, and checking the wrong floor and doctor information will directly mislead users. Therefore, we rely more on structured databases + retrieval capabilities, rather than relying solely on generative models to play freely.
【3】Content generation models → give clear and understandable answers.
When the retrieved information is returned in a structured manner, we need a model to “polish” the answers to make the content more colloquial, relatable, and in line with the tone of the medical scene. Here we used a large model with medium capabilities (not the one with the most parameters), but did a detailed prompt design to ensure that the reply content was “accurate, concise, and friendly”.
【4】Security review model → ensures that content is not at risk.
Medical care is a highly sensitive scenario, and the model cannot be randomly recommended, guessed, or played sideball, so content filtering + safety rule review is also required. This part of the model must add multiple bottom-up mechanisms, such as keyword filtering, whitelisting, grayscale control, etc.
Therefore, our Agent product is not “stuffing a large model into it and letting it take over”, but we, as commanders, let the right model, at the right node, do the right thing.
5. The dataset and evaluation system are the lifeline of the landing agent
When we first launched, we didn’t actually invest much time in the “evaluation system”. We think that as long as the large model can answer a few typical questions correctly, the effect will not be much worse.
But here’s the truth: we stepped on the pit.
We have encountered many questions of “it looks like the answer, but it is actually not right”, and if you look closely, it is actually caused by the lack of evaluation set coverage, such as:
The user asked: “I have chest tightness for 3 days and now I keep coughing?” ”
The model answered: “It is recommended to call the hospital for consultation”.
Analysis: This is a correct intention understanding, but the model chooses to refuse to answer conservatively and answer coldly, indicating that there is a lack of SFT fine-tuning + humanistic care + bottom-up mechanism.
The user asked: “I have a sore throat, can I get the HPV vaccine by the way?” ”
Model answered: “It is recommended that you go to the otolaryngology. ”
Analysis: This belongs to the user having multiple intentions, but only answering one, indicating that the model lacks multi-intent recognition ability or lacks the mechanism of “primary and secondary information judgment”.
Later, we realized that we had to establish a systematic assessment method:
【1】Build an assessment set.Cover all core intent types (symptom department, service, science popularization, etc.).
【2】Split dimension evaluation.The questioning method under each intention should cover “standardized questioning”, “vague questioning”, “oral questioning”, “single round of questioning”, “multiple rounds of questioning”, etc., and even distinguish whether the patient is elderly, middle-aged, or children.
【3】Finely label the expected output.Distinguish between “correct/incomplete/misaligned/nonsense/refusal” levels.
【4】Attribution of errors.Is it an identification problem? Can’t retrieve? The prompt is not written well? Insufficient corpus coverage?
With these systematic evaluations, we can make the model move from “able to answer” to “correct and stable”.
6. Medical scenarios must be SFT, otherwise the risk is uncontrollable
Today, when the performance of the general large model is strong, we also had illusions at the beginning: “The model is already very strong, maybe it can answer well without tuning?” But when we really put agents into the medical scene, we realized that medical care is not “like it”, but “must be right”.
What problems do you encounter if you don’t do SFT (Supervised Fine-Tuning)?
- The model will recommend departments that the hospital does not have at all, because it is “made up for granted” from the Internet;
- The patient said “I have a stomach ache in the first trimester”, it said “hang up the gastroenterology”, ignoring the risk word “pregnancy”;
- The model occasionally outputs “I recommend you to have ×× surgery”, which is a medical restricted area, and ordinary models have no sense of boundaries at all.
This kind of problem is very hidden and may not be visible in the demo stage, and once it is launched, there may be medical malpractice-style public opinion. So what did we do later?
- Self-built medical Q&A dataset, manual annotation + fine-tuning;
- establish a high-risk keyword database and combine content security models for multi-layer filtering;
- All generations are limited to low-risk tasks such as “answering service information + recommended consultation registration”;
- Do multiple rounds of grayscale testing to ensure that the model “would rather not answer than answer”.
In this Agent project, my biggest experience is two:
First, even if you are an AI product manager, the most important thing is still to find real user needs.
Instead of volume model parameters or complex frameworks, it is a down-to-earth observation of users, understanding problems, and verifying requirements.
Second, I completely let go of my blind worship of the “big model”.
AI is not a magic wand, it is just a powerful tool. What really makes the product land has always been:
- Real user research
- Systematic data evaluation
- Review of “why did you answer wrong” again and again