How to Train an AI Chatbot on Your School's Data

A generic AI chatbot can answer "what is a UCAS tariff point?" A chatbot trained on your school's data can answer "do I meet the entry requirements for your MSc Marketing if I have a 2:2 and three years of agency experience?" That distinction is the difference between a widget and a recruitment tool. This guide covers every step: what data to use, how to structure it, the difference between RAG and fine-tuning, UK GDPR boundaries, and how to improve the chatbot continuously once it is live.

For the full strategic picture, start with our AI Chatbot for Schools: The Complete Guide.

Why Training Matters: Generic vs Institution-Specific

A generic large language model (LLM) has broad knowledge but zero institutional context. It does not know your tuition fees, your application deadlines, whether your undergraduate programmes are UCAS-listed, or how your Clearing process works. Left untrained, it will either refuse to answer or — worse — hallucinate plausible-sounding but incorrect details about your institution.

Schools with an AI chatbot trained on their own data reduce first-contact drop-off from 91% to 76%, generating 167% more first contacts (Source: funnel analysis across 30 schools, 2025–2026 cohort). That lift does not come from deploying any chatbot — it comes from deploying one that actually knows your institution.

The training process — more precisely, the knowledge-base configuration process — is what creates that difference. It takes a working LLM and grounds its responses in your authoritative content: no invented fees, no hallucinated entry criteria, no fabricated Open Day dates.

What Data to Use: Your School's Knowledge Assets

The best source material for a school chatbot is the content your admissions team already produces and maintains. The question is not "where do we get this data?" but "which of what we already have should we prioritise?"

Data Source	Content Type	Priority	Format
Programme pages	Degree titles, modules, duration, outcomes	High	Web pages / PDF
Entry requirements	UCAS tariff points, A-level grades, foundation routes, mature student criteria	High	Web pages / PDF
Tuition fees	Undergraduate, postgraduate, international, part-time	High	Web pages / PDF
Application process	How to apply via UCAS, direct application, clearing routes	High	Web pages
Open Day schedule	Dates, registration links, campus locations	High	Web pages / structured list
Existing FAQ	Questions your admissions team already answers repeatedly	High	Text document
Scholarship and bursary pages	Eligibility criteria, amounts, deadlines	Medium	Web pages / PDF
Accommodation information	Campus halls, private options, costs	Medium	Web pages
Student services	Wellbeing, disability support, careers	Medium	Web pages
Accreditation and rankings	TEF rating, professional body accreditations	Medium	Web pages
Alumni outcomes	Graduate employment rates, typical salaries, case studies	Low	Web pages
Staff profiles	Key contacts in admissions and student services	Low	Web pages

Focus the first version of your knowledge base on the top four or five categories. Automated classification of 12,000 Skolbot conversations in 2025 found that 72% of prospect questions are simple FAQ questions answerable with that core content — fees, entry requirements, application process, and programme descriptions. Expanding to the full list matters, but it should not delay deployment.

What to exclude from the knowledge base: anything not publicly intended for prospects. Internal appeals procedures, staff HR documents, financial reporting, strategic planning documents, and individual student records have no place in a chatbot knowledge base — and including them creates both GDPR risk (covered below) and reputational exposure if the chatbot surfaces confidential content.

Data Preparation: Cleaning, Formatting, Structuring

Raw documents do not become a knowledge base automatically. Unstructured PDFs, outdated web content, and internally inconsistent fee schedules will produce a chatbot that gives confident but wrong answers. Data preparation is unglamorous work, but it is the single biggest determinant of chatbot quality.

Step 1: Audit What You Have

List every document and web page you plan to include. For each one, confirm:

Is it current? Fees for the 2024–25 cycle are not the fees for 2026–27. Entry requirements change. Open Day dates are updated every year. Any document that has not been reviewed in the last 12 months needs verification before ingestion.
Is it authoritative? Your admissions team's informal FAQs in a shared Drive folder may contain useful questions, but the answers need to match your official programme pages. Use the official version as the source of truth.
Does it conflict with anything else? A common problem in multi-campus institutions is that two programme pages list slightly different entry criteria for what is nominally the same course. Resolve conflicts before ingestion, not after.

Step 2: Convert to Clean Text

LLMs process text. PDFs can be ingested, but complex layouts — multi-column brochures, graphic-heavy prospectuses, scanned images of documents — produce garbled text when extracted. For each source document:

Export PDFs to clean text or structured HTML where possible
Remove headers, footers, page numbers, and boilerplate legalese that will confuse the model
Break long documents into logical sections (each programme should be its own chunk, not buried in a 200-page prospectus)
Standardise terminology — if some pages say "UCAS tariff points" and others say "UCAS points", pick one

Step 3: Structure for Retrieval

When using RAG (see the next section), the knowledge base is divided into chunks that the system retrieves in response to a question. How you chunk your data affects retrieval quality significantly.

Chunk by logical unit, not by arbitrary word count. A programme description with entry requirements, modules, and career outcomes should stay together — splitting it mid-way through entry requirements produces incomplete answers.
Add metadata to each chunk: programme name, level (undergraduate/postgraduate), subject area, campus location. Metadata allows the retrieval system to filter before it searches, improving accuracy.
Keep chunks between 300 and 600 words. Too short, and the chatbot lacks enough context to answer well. Too long, and retrieval becomes noisy.

Step 4: Version Control

Your knowledge base will need to be updated regularly. Build a simple version-control habit from day one: record what was changed, when, and why. When the chatbot gives a wrong answer in six months' time, you need to trace whether the source data was wrong or the retrieval logic failed.

Training Methodology: RAG vs Fine-Tuning

This is the question most admissions teams reach for when they first engage with AI chatbot projects — and the answer, for school chatbots, is almost always the same.

RAG (Retrieval-Augmented Generation): How It Works

RAG does not change the underlying language model. Instead, it gives the model access to a searchable knowledge base at the moment a question is asked. The process is:

A prospect types a question
The system searches your knowledge base for the most relevant chunks
Those chunks are passed to the LLM as context
The LLM generates a response grounded in that context

The key property of RAG is that the model only answers based on what is in your knowledge base. If a prospect asks about a programme you do not offer, the chatbot says so — it does not invent one. If the fee for your BSc Business Studies is £14,500, the chatbot quotes £14,500 — not a plausible-sounding figure it generated from general knowledge.

RAG is the right approach for school chatbots because:

It is updatable without retraining. When your fees change, you update a document in the knowledge base. The chatbot reflects the new figure immediately, with no model retraining required.
It is auditable. Every response can be traced to a source chunk. This is essential for QAA and Office for Students compliance contexts — you can verify that the chatbot's answer came from your official published content.
It reduces hallucination. Grounded responses are significantly less likely to fabricate information than free-generation responses from an uncontrolled LLM.

Fine-Tuning: When It Is and Is Not Appropriate

Fine-tuning means retraining the underlying model on your data — teaching it to write like you, understand your terminology, and adopt your institutional voice. It is expensive (compute costs, specialist ML engineering), slow (weeks of iteration), and requires retraining every time your content changes materially.

For a school chatbot answering admissions questions, fine-tuning adds cost and complexity without a meaningful accuracy advantage over a well-constructed RAG system. Fine-tuning is appropriate when:

You need the model to adopt a very specific linguistic register or brand voice at a deep level
You are building a system that generates long-form content (programme descriptions, email drafts) rather than answering questions
Your volume of interactions justifies the engineering investment

For the vast majority of UK private higher education institutions, RAG with a well-maintained knowledge base is the right choice. Gartner's analysis of enterprise AI adoption consistently shows that data quality and retrieval architecture outperform model sophistication as drivers of accuracy in domain-specific chatbot deployments.

Training a chatbot on school data intersects directly with data protection law. The Information Commissioner's Office (ICO) has published guidance on AI and data protection that applies directly to how institutions build and deploy AI systems.

What You Can Use

Publicly available content you authored. Your website pages, your published prospectus, your official FAQ — all of this is data your institution created and published for the purpose of informing prospective students. Using it to train an AI that informs prospective students is consistent with the original purpose. No new legal basis is required.

Anonymised and aggregated internal data. If you extract patterns from past enquiries — "the 30 most common questions asked by applicants in 2025" — and use those questions to improve your knowledge base, you are not processing personal data. Aggregate patterns are not personal data under UK GDPR.

Consented operational data. If a prospect interacts with your chatbot and gives explicit consent for their conversation to be used to improve the service, that conversation log can be used for knowledge-base gap analysis. Consent must be freely given, specific, informed, and unambiguous — generic privacy policy acceptance does not constitute valid consent for this purpose.

What You Cannot Use

Individual student records. Academic transcripts, personal statements, reference letters, disability disclosures, tuition fee payment records — these are personal data processed under specific legal bases (typically contractual necessity or legal obligation). Using them to train an AI system would constitute secondary processing that is incompatible with the original purpose, violating the purpose limitation principle under UK GDPR Article 5(1)(b).

Unmoderated live conversations containing third-party personal data. Live chat transcripts often contain personal information that was not consented for AI training — a prospect's A-level results, their disability status, a parent's email address. Before using any conversation data, it must be reviewed for personal data, stripped of identifiers, and confirmed against your data retention schedule.

Staff personal data. Individual staff profiles, employment terms, internal communications — none of this has a place in a prospect-facing knowledge base or in training data.

Data from third parties without a data-sharing agreement. UCAS data, student finance records from Student Loans Company, school-reported reference information — your institution is a data processor or controller for this data under specific terms. Using it for AI training almost certainly falls outside those terms.

The ICO's guidance on AI and data protection is the definitive reference. Your Data Protection Officer should review the knowledge-base content list before deployment. GDPR compliance for chatbot data is also covered in detail in our guide to protecting prospect data.

Transparency Requirements

Under UK GDPR, prospects interacting with your chatbot must be informed that they are interacting with an automated system. The chatbot must identify itself as an AI. If it collects personal data (name, email, programme interest), the privacy notice must cover this processing. This is a legal requirement, not a design choice.

Continuous Improvement: Learning From Real Conversations

A chatbot knowledge base is not a one-time project. The first version will have gaps. Prospects will ask questions your knowledge base does not cover. Some answers will be incomplete. The value of a deployed chatbot is that it tells you, in real time, where those gaps are.

The Unanswered Question Log

The single most important continuous improvement input is the list of questions the chatbot failed to answer satisfactorily. Most chatbot platforms expose this data as "unresolved conversations" or "fallback triggers". Review this log weekly for the first three months of deployment, then monthly thereafter.

For each unanswered question category, the response is always one of three things:

Add content to the knowledge base. The question is reasonable; the content simply does not exist yet. Write it and ingest it.
Improve existing content. The content exists but is not being retrieved correctly — often because it is buried in a long document or uses different terminology than prospects use. Restructure or rewrite the relevant chunk.
Flag for human escalation. Some question types should never be handled by the chatbot. If prospects are repeatedly asking about individual circumstances that require adviser judgement, configure those triggers to route to a human. Our article on AI Chatbot vs Human Advisor: When to Hand Over covers those escalation triggers in detail.

Seasonal Updates: The Admissions Calendar

Higher education in the UK runs on a predictable annual cycle. Your knowledge base needs to track it:

September–October: new cohort entry requirements confirmed, updated programme pages live
October–January: UCAS application window; fee and scholarship information must be accurate
February–June: postgraduate recruitment peaks; clearing preparation begins
August: Clearing; the chatbot must know current-cycle Clearing vacancies and amended grade thresholds

Build these update cycles into your calendar alongside your standard content review process. A chatbot quoting last year's UCAS deadline during January is not a minor error — it is an active recruitment liability.

Conversation Analysis for Programme Intelligence

Beyond gap-filling, conversation data reveals what prospective students actually care about — as opposed to what your marketing team assumes they care about. If the chatbot logs show that 40% of prospective postgraduate applicants ask about funded places before they ask about programme content, that is a signal about how you position your postgraduate offer. If first-generation applicants consistently ask about student support before fees, that tells you something about the order in which your programme pages should present information.

For a practical guide to deploying the chatbot correctly on your website and ensuring it is surfaced to the right visitors, see our article on How to Integrate an AI Chatbot into Your School Website.

FAQ

How long does it take to build a school chatbot knowledge base from scratch?

For a typical private higher education institution with 10–20 programmes, the knowledge base preparation takes one to two working days. This covers auditing existing content, converting documents to clean text, resolving inconsistencies, and structuring content into retrieval-ready chunks. The initial chatbot configuration and testing adds another half day. The constraint is almost always content review time — confirming that every fee figure and entry requirement is accurate for the current cycle — not technical work.

What is the difference between RAG and fine-tuning in plain terms?

RAG gives the AI access to a searchable library of your documents at the moment it answers a question. Fine-tuning rewrites the AI's internal knowledge by training it on your data. For school chatbots, RAG is almost always the right choice: it is faster to deploy, cheaper to maintain, easier to update, and produces auditable, grounded responses that can be traced to specific source documents.

How do we prevent the chatbot from hallucinating incorrect fee or entry information?

The core prevention mechanism is RAG with a high-quality, up-to-date knowledge base. A RAG-based chatbot is instructed to answer only from its retrieved context — if the context does not contain an answer, it says so rather than generating one. Additional safeguards include: configuring the chatbot to cite its source ("according to our 2026–27 fee schedule"), setting a fallback response for questions outside the knowledge base, and running monthly accuracy audits on a sample of conversations. Hallucination risk is not zero, but with a well-maintained knowledge base it is far lower than with an unconstrained LLM.

Can we use UCAS data or Student Loans Company records in the knowledge base?

No. Data processed under your agreements with UCAS, Student Loans Company, or any other third party is subject to specific data-sharing terms. Using it as AI training data almost certainly falls outside those terms and would constitute a breach of both your contractual obligations and UK GDPR's purpose limitation principle. Consult your Data Protection Officer before including any third-party data in your knowledge base. The ICO's guidance on AI and data protection applies here.

How often should the knowledge base be updated?

At minimum, once per admissions cycle — before the UCAS application window opens in September. In practice, the highest-impact updates happen four times a year: at the start of the application cycle, after January deadline season, during the postgraduate recruitment peak (March–May), and at the start of Clearing preparation (July). Open Day dates and scholarship deadlines need updating as they change, regardless of the cycle. A well-run knowledge base is a live document, not an annual refresh.

Does a QAA-registered institution have specific obligations around AI chatbot accuracy?

QAA's UK Quality Code requires that information provided to applicants about programmes, entry requirements, and fees is accurate, up to date, and not misleading. This applies to information delivered by any channel — including AI chatbots. The practical implication is that a RAG-based chatbot with auditable source documents gives you a stronger compliance position than a free-generation LLM: every response can be traced to the authoritative published content it was drawn from.

Test your school's AI visibility for free Test Skolbot on your school in 30 seconds

For the full strategic picture, start with our AI Chatbot for Schools: The Complete Guide.

Why Training Matters: Generic vs Institution-Specific

What Data to Use: Your School's Knowledge Assets

Data Source	Content Type	Priority	Format
Programme pages	Degree titles, modules, duration, outcomes	High	Web pages / PDF
Entry requirements	UCAS tariff points, A-level grades, foundation routes, mature student criteria	High	Web pages / PDF
Tuition fees	Undergraduate, postgraduate, international, part-time	High	Web pages / PDF
Application process	How to apply via UCAS, direct application, clearing routes	High	Web pages
Open Day schedule	Dates, registration links, campus locations	High	Web pages / structured list
Existing FAQ	Questions your admissions team already answers repeatedly	High	Text document
Scholarship and bursary pages	Eligibility criteria, amounts, deadlines	Medium	Web pages / PDF
Accommodation information	Campus halls, private options, costs	Medium	Web pages
Student services	Wellbeing, disability support, careers	Medium	Web pages
Accreditation and rankings	TEF rating, professional body accreditations	Medium	Web pages
Alumni outcomes	Graduate employment rates, typical salaries, case studies	Low	Web pages
Staff profiles	Key contacts in admissions and student services	Low	Web pages