By Thisara Priyamal in AWS AI — 13 Oct 2025

AWS AI Model Deployment & Inference - Real-time, Batch, A/B Testing | Sinhala Guide

ආයුබෝවන් යාළුවනේ!

අද කාලේ, Software Engineering කියන ක්ෂේත්‍රය ඇතුළේ Machine Learning (ML) සහ Artificial Intelligence (AI) කියන්නේ නැතුවම බැරි දෙයක් වෙලා. හැමෝම AI models හදන්න, train කරන්න ලොකු උනන්දුවක් දැක්වුවත්, ඒ model එක හදලා ඉවර වුණාම ඒක production environment එකකට deploy කරන්නේ කොහොමද, ඒකෙන් inference කරන්නේ කොහොමද කියන එක ගැන තියෙන දැනුම ටිකක් අඩුයි. ඒක තමයි ඇත්තටම අමාරුම කොටස. මොකද, model එකක් පුහුණු කරනවා වගේම, ඒක සැබෑ ලෝකයේ ගැටලු විසඳන්න භාවිතා කරන්න පුළුවන් විදියට යොදවන එකත් (deploy කරන එකත්) ඒ වගේම වැදගත්.

ඉතින්, මේ tutorial එකෙන් අපි බලමු AWS (Amazon Web Services) පාවිච්චි කරලා AI model එකක් සාර්ථකව deploy කරලා, ඒකෙන් inference ලබා ගන්නේ කොහොමද කියලා. අපි Real-time සහ Batch inference අතර වෙනස, endpoint configure කරන්නේ කොහොමද, A/B testing, shadow deployments වගේ advanced strategies, සහ inference pipeline එක optimize කරගෙන වියදම් අඩු කරගන්න පුළුවන් ක්‍රම ගැනත් කතා කරමු. මේක ඔයාලට අලුත් දැනුමක් ලබා දෙන්න හොඳ අවස්ථාවක් වෙයි!

AI Model Deployment - මූලික අදහස (Basic Concept)

සරලවම කියනවා නම්, AI Model Deployment කියන්නේ අපි හදපු, train කරපු Machine Learning model එක production environment එකකට දාලා, ඒක usersලාට හෝ වෙනත් applications වලට පාවිච්චි කරන්න පුළුවන් විදියට සූදානම් කරන එකට. හිතන්න, ඔයා ගෙදර පරිගණකයක model එකක් train කළා කියලා. ඒක අරගෙන, ලෝකයේ ඕනෑම තැනක ඉන්න කෙනෙකුට, ඕනෑම වෙලාවක පාවිච්චි කරන්න පුළුවන් විදියට හදලා දෙන්න ඕනේ. ඒකට තමයි deployment ඕනේ වෙන්නේ.

AWS වලදී, මේ වැඩේට ප්‍රධානම වශයෙන් පාවිච්චි කරන්නේ Amazon SageMaker. SageMaker කියන්නේ Machine Learning lifecycle එකේ හැම අදියරක්ම, ඒ කියන්නේ data labeling එකේ ඉඳන් model build කරනවා, train කරනවා, tune කරනවා, සහ deploy කරනවා දක්වා, පහසු කරන service එකක්. ඒකෙන් අපිට පුළුවන් අපේ model එක API endpoint එකක් විදියට expose කරන්න. එතකොට, වෙනත් applications වලට ඒ API එකට requests යවලා model එකෙන් predictions, recommendations, හෝ classifications ලබාගන්න පුළුවන්.

Real-time Inference vs. Batch Inference - තීරණාත්මක තේරීම (Crucial Choice)

අපේ model එක production එකට දැම්මට පස්සේ, අපිට ඒකෙන් predictions ලබාගන්න පුළුවන් ප්‍රධාන ක්‍රම දෙකක් තියෙනවා: Real-time Inference සහ Batch Inference. මේ දෙකේ වෙනස සහ ඒවා පාවිච්චි කරන අවස්ථා ගැන අපි දැන් බලමු.

Real-time Inference

මොකක්ද මේක? Real-time Inference කියන්නේ අපිට ඉතා ඉක්මනින්, ක්ෂණිකව model එකෙන් prediction එකක් අවශ්‍ය වන අවස්ථා වලට. user කෙනෙක් යම්කිසි action එකක් ගත්ත ගමන්ම, ඒකට අදාළ prediction එක ලබා දීම තමයි මෙහිදී සිදුවෙන්නේ.
පාවිච්චි කරන අවස්ථා:
- Online recommendation systems (e.g., Netflix 'ඔබට ගැලපෙනවා' කියලා recommend කරන ඒවා)
- Fraud detection (ක්ෂණිකව වංචනික ගනුදෙනු හඳුනාගැනීම)
- Chatbots (පරිශීලකයාගේ ප්‍රශ්න වලට ක්ෂණික පිළිතුරු දීම)
- Live image/video processing (e.g., self-driving cars)
විශේෂාංග: අඩු latency (ප්‍රතිචාර කාලය), ඉහළ availability, ක්ෂණික ප්‍රතිචාර. AWS වලදී මේකට SageMaker Real-time Endpoints භාවිතා කරනවා.

Batch Inference

මොකක්ද මේක? Batch Inference කියන්නේ එකවර විශාල දත්ත ප්‍රමාණයක් (batch එකක්) model එකට දීලා, ඒ හැම එකකටම predictions ටික එකවර ලබාගන්න එක. මෙහිදී ක්ෂණික ප්‍රතිචාරයක් අවශ්‍ය නැහැ.
පාවිච්චි කරන අවස්ථා:
- Daily/Weekly reports generate කිරීම (e.g., customer churn predictions for marketing campaigns)
- Data enrichment (විශාල දත්ත ගබඩාවක් ML predictions වලින් update කිරීම)
- Offline recommendation engines (පරිශීලකයාගේ past behavior මත පදනම්ව predictions)
විශේෂාංග: ඉහළ throughput (එකවර විශාල දත්ත ප්‍රමාණයක් process කිරීමේ හැකියාව), අඩු වියදම් (බොහෝ විට), latency එක එතරම් වැදගත් නැහැ. AWS වලදී මේකට SageMaker Batch Transform jobs භාවිතා කරනවා.

Practical Example: Real-time Inference with SageMaker Endpoint

අපි හිතමු අපිට තිබ්බා කියලා train කරපු model එකක්, ඒක SageMaker Endpoint එකක් විදියට deploy කරලා තියෙනවා කියලා. ඒකෙන් prediction එකක් ගන්න පුළුවන් Python boto3 library එක පාවිච්චි කරලා මේ විදියට:


import boto3
import json

# Your SageMaker endpoint name
endpoint_name = 'your-model-endpoint-name'

# Input data for inference. This usually needs to be formatted
# according to what your model expects (e.g., CSV, JSON, protobuf).
# Let's assume a simple JSON input for this example.
# Make sure your content type matches your model's expected input.
input_data = {"features": [5.1, 3.5, 1.4, 0.2]}

# Create a SageMaker runtime client
sagemaker_runtime = boto3.client('sagemaker-runtime')

try:
    # Invoke the endpoint
    response = sagemaker_runtime.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json', # Or 'text/csv', etc., depending on your model
        Body=json.dumps(input_data) # Convert dictionary to JSON string
    )

    # Read the response body
    result = response['Body'].read().decode('utf-8')
    print(f"Prediction: {result}")

except Exception as e:
    print(f"Error invoking endpoint: {e}")

මේ boto3 code එකෙන් වෙන්නේ, අපි deploy කරපු SageMaker Endpoint එකට අපේ input data එක යවලා, ඒකෙන් ලැබෙන prediction එක print කරන එකයි. 'your-model-endpoint-name' කියන තැනට ඔයාලගේ endpoint එකේ නම දාන්න ඕනේ, ඒ වගේම input_data එකත් model එකට ගැලපෙන විදියට හදන්න ඕනේ.

SageMaker Endpoint Configuration - ඔබේ මොඩලය සුපිරි විදියට පවත්වාගෙන යමු (Let's maintain your model super well)

ඔබේ model එක deploy කරද්දී, හරියට configure කරන එක ගොඩක් වැදගත්. ඒකෙන් model එකේ performance එක, scalability එක, සහ වියදම් පවා තීරණය වෙනවා.

Instance Types

Endpoint එකක් deploy කරද්දී, අපි ඒකට පාවිච්චි කරන්න ඕනේ compute instance එක මොකක්ද කියලා තෝරන්න ඕනේ. මේවා තමයි AWS EC2 instances.

CPU-based instances (e.g., ml.m5, ml.c5): මේවා general purpose workloads වලට, text-based models වලට, සහ අඩු complexity තියෙන models වලට හොඳයි. සාමාන්‍යයෙන් මේවා GPU instances වලට වඩා අඩුයි.
GPU-based instances (e.g., ml.p3, ml.g4dn): Deep Learning models, computer vision tasks, සහ NLP models වගේ ඉහළ computational power එකක් අවශ්‍ය වන models වලට මේවා ගොඩක් හොඳයි. මේවා CPU instances වලට වඩා වේගවත් වුණත්, වියදම් වැඩියි.

ඔබේ model එකේ අවශ්‍යතා අනුව, data traffic එක අනුව, සහ ඔබේ budget එක අනුව තමයි මේවා තෝරාගත යුත්තේ. හරියට තෝරාගත්තොත්, හොඳම performance එක අඩුම වියදමකින් ලබාගන්න පුළුවන්.

Auto-scaling

අපි හිතමු ඔයාගේ application එකට එක වෙලාවකට requests 100ක් ආවා, ඊළඟට requests 10,000ක් ආවා කියලා. මේ වගේ අවස්ථා වලදී, එකම instance ගානකින් maintain කරන්න බෑ. මෙන්න මේකට තමයි Auto-scaling ඕනේ වෙන්නේ.

මොකක්ද මේක? Auto-scaling කියන්නේ ඔබේ model endpoint එකට එන traffic එකේ ප්‍රමාණය අනුව, deploy කරලා තියෙන instances ගාන ස්වයංක්‍රීයව වැඩි කිරීම (scale out) හෝ අඩු කිරීම (scale in).
වැදගත්කම:
- Performance: ඉහළ traffic එකක් ආවත්, ඔබේ model එක මන්දගාමී වෙන්නේ නැතුව හොඳින් වැඩ කරනවා.
- Cost Savings: traffic එක අඩු වෙලාවට instances ගාන අඩු කරන නිසා, අනවශ්‍ය විදියට සල්ලි නාස්ති වෙන්නේ නැහැ. 'අපරාදේ සල්ලි නාස්ති කරනවට වඩා හොඳයිනේ!'
- High Availability: එක instance එකකට ප්‍රශ්නයක් ආවත්, තව instances තියෙන නිසා service එකට බලපෑමක් වෙන්නේ නැහැ.

SageMaker වලදී, අපි අපිට අවශ්‍ය metrics (e.g., CPU utilization, memory utilization, invocation per instance) මත පදනම්ව auto-scaling policies set කරන්න පුළුවන්.


{
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/your-endpoint-name/variant/AllTraffic",
    "MinCapacity": 1,
    "MaxCapacity": 5, # Maximum number of instances
    "TargetTrackingScalingPolicyConfiguration": {
        "TargetValue": 70.0, # Target average CPU utilization in percent
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        },
        "ScaleOutCooldown": 300, # 5 minutes cooldown before scaling out again
        "ScaleInCooldown": 600   # 10 minutes cooldown before scaling in again
    }
}

මේ JSON config එකෙන් පෙන්වන්නේ, SageMaker endpoint එකේ instances ගාන 1 ත් 5 ත් අතර වෙනස් කරන්න පුළුවන් විදියට set කරන හැටි. SageMakerVariantInvocationsPerInstance metric එකේ TargetValue 70.0 (එක instance එකකට තත්පරයකට එන requests ගාන 70ක් වගේ) වුණාම, auto-scaling trigger වෙනවා.

Advanced Deployment Strategies - බුද්ධිමත්ව යෙදවීම (Deploying Smartly)

Model එකක් deploy කරනවා කියන්නේ, 'එකපාරටම දාලා ඉවරයි' කියන එක නෙවෙයි. විශේෂයෙන්ම mission-critical applications වලදී, අපි deployment strategies ගැන හිතන්න ඕනේ. A/B Testing සහ Shadow Deployments කියන්නේ මේ වගේ අවස්ථා වලදී ගොඩක් ප්‍රයෝජනවත් වන strategies දෙකක්.

A/B Testing

මොකක්ද මේක? A/B Testing කියන්නේ, එකම model එකේ හෝ එකම use case එකේ versions දෙකක් (model A සහ model B) එකවර production එකට දාලා, traffic එකෙන් කොටසක් එක model එකකටත්, අනිත් කොටස අනිත් model එකටත් යවලා, ඒ දෙකේ performance එක සසඳා බැලීම. 'අපි ටිකක් අත්හදා බලමු!' වගේ වැඩක්.
ඇයි මේක අවශ්‍ය?
- අලුත් model එකක් පැරණි model එකට වඩා හොඳින් වැඩ කරනවාද කියලා බලාගන්න.
- Model එකේ different hyper-parameters, algorithms, හෝ feature sets වල performance එක evaluate කරන්න.
- පරිශීලක අත්දැකීම් (user experience) වැඩි දියුණු කරන්නේ කොහොමද කියලා තේරුම් ගන්න.
SageMaker වලදී: SageMaker endpoint එකක් හදද්දී, අපිට production variants කිහිපයක් deploy කරන්න පුළුවන්. ඒ සෑම variant එකකටම, අපිට අදාළ model එක, instance type එක, සහ එන traffic එකෙන් කීයක් මේ variant එකට යවන්න ඕනේද කියලා (e.g., 80% to Model A, 20% to Model B) define කරන්න පුළුවන්.

Shadow Deployments (Dark Launch)

මොකක්ද මේක? Shadow Deployment කියන්නේ අලුත් model එකක් production එකට deploy කරනවා, නමුත් ඒකෙන් ලැබෙන predictions usersලාට පෙන්නන්නේ නැහැ. ඒකෙන් වෙන්නේ, production traffic එකේ copy එකක් අලුත් model එකට යවලා, ඒක කොහොමද වැඩ කරන්නේ කියලා බලන එක විතරයි. 'නොපෙනෙන විදියට ටෙස්ට් කරනවා' වගේ වැඩක්.
ඇයි මේක අවශ්‍ය?
- අලුත් model එකක performance එක, latency, සහ errors production traffic එකත් එක්ක කොහොමද කියලා බලාගන්න, කිසිම අවදානමකින් තොරව.
- Performance regressions හඳුනාගන්න. (පැරණි model එකට වඩා අලුත් එක මන්දගාමීද?)
- System stability එක test කරන්න.
ක්‍රියාකාරීත්වය: සාමාන්‍යයෙන්, production endpoint එකට එන traffic එකේ duplicate එකක් අලුත් shadow endpoint එකට යවනවා. දෙකෙන් ලැබෙන predictions සසඳා බලලා, කිසිම ප්‍රශ්නයක් නැත්නම් විතරයි අලුත් model එක live traffic එකට යොදවන්නේ.

Inference Pipeline Optimization & Cost Reduction - වේගය වැඩි කරමු, වියදම් අඩු කරමු (Let's increase speed, reduce costs)

Model එකක් deploy කරලා inference කරනවා කියන්නේ සල්ලි යන වැඩක්. ඉතින්, අපිට පුළුවන් කොහොමද මේ වියදම් අඩු කරගෙන, performance එක වැඩි කරගන්නේ කියලා බලමු.

Optimization Techniques

Model Compression:
- Quantization: Model එකේ weights සහ activations වල precision එක අඩු කරන එක (e.g., 32-bit floats වලින් 8-bit integers වලට). මේකෙන් model එකේ size එක අඩු වෙනවා, inference speed එක වැඩි වෙනවා, සහ memory usage එකත් අඩු වෙනවා. පොඩි වුණාට වැඩේ හොඳට කරනවා.
- Pruning: Model එකේ වැදගත්කමක් නැති connections හෝ neurons අයින් කරන එක.
- Knowledge Distillation: ලොකු, complex model එකකින් (teacher) පොඩි, සරල model එකකට (student) දැනුම 'උගන්වන' එක.
Efficient Instance Types: කලින් කතා කරපු විදියට, ඔබේ workload එකට හරියටම ගැලපෙන CPU/GPU instance type එක තෝරාගන්න එක. අත්‍යවශ්‍ය නැති වෙලාවට GPU instances පාවිච්චි නොකර ඉන්න එක.
Batching Requests: Real-time inference වලදී වුණත්, එකවර request එකක් වෙනුවට පොඩි batch එකක් විදියට requests කිහිපයක් එකවර model එකට යවන්න පුළුවන් නම්, inference speed එක වැඩි කරගන්න පුළුවන්.
Optimized Inference Code: model එක load කරන විදිය, pre-processing, post-processing කරන විදිය optimize කරන එකත් වැදගත්.

Cost Reduction Techniques

Effective Auto-scaling: අපි කලින් කතා කරපු විදියට, auto-scaling හරියට configure කරන එකෙන් traffic එක අඩු වෙලාවට instances ගාන අඩු කරලා, වියදම් ගොඩක් ඉතිරි කරගන්න පුළුවන්.
Right-sizing Instances: ඔබේ model එකට අවශ්‍ය වන resources (CPU, Memory, GPU) වලට වඩා වැඩි resources තියෙන instances පාවිච්චි නොකර ඉන්න එක. 'අවශ්‍ය ප්‍රමාණයට වඩා වැඩියෙන් දාගෙන මොකටද?'
Spot Instances (for Batch Transform): Batch Transform jobs වලදී, Spot Instances පාවිච්චි කරන්න පුළුවන්. මේවා on-demand instances වලට වඩා ගොඩක් අඩුයි, නමුත් AWS වලට අවශ්‍ය වුණොත් ඒවා cancel කරන්න පුළුවන්. Batch jobs වලදී මේක ප්‍රශ්නයක් වෙන්නේ නැහැ. (Real-time endpoints වලට Spot instances සාමාන්‍යයෙන් භාවිතා කරන්නේ නැහැ, uptime එක වැදගත් නිසා).
Monitoring and Alerts: ඔබේ endpoint එකේ performance එක සහ cost එක නිතරම monitor කරන එක වැදගත්. CloudWatch වගේ services පාවිච්චි කරලා, අසාමාන්‍ය වියදම් හෝ performance drop එකක් වුණොත් alerts receive කරන්න පුළුවන්.

නිගමනය (Conclusion)

ඉතින් යාළුවනේ, මේ tutorial එකෙන් අපි AWS වල AI model deployment සහ inference ගැන ගොඩක් දේවල් ඉගෙන ගත්තා. Model එකක් train කරනවා වගේම, ඒක production එකට සාර්ථකව යොදවලා, නිවැරදිව maintain කරන එකත් කොච්චර වැදගත්ද කියලා දැන් ඔයාලට තේරෙනවා ඇති.

අපි Real-time සහ Batch inference වල වෙනස්කම්, SageMaker Endpoint එකක් configure කරද්දී instance types සහ auto-scaling වල වැදගත්කම, වගේම A/B Testing සහ Shadow Deployments වගේ advanced strategies ගැනත් දැනුම ලබා ගත්තා. ඒ වගේම, model එකේ performance එක වැඩි කරගෙන, අනවශ්‍ය වියදම් අඩු කරගන්න පුළුවන් ක්‍රම ගැනත් අපි කතා කළා.

මේ දැනුම ඔයාලගේ ඊළඟ Project එකට ගොඩක් ප්‍රයෝජනවත් වෙයි කියලා මම විශ්වාස කරනවා. AI models හදන එක විතරක් නෙවෙයි, ඒක ලෝකයට ගෙනියන එකත් ඒ වගේම වැදගත්. 'මොන දේ කළත් වැඩේ හරියට කරන්න ඕනේනේ!'

ඔබේ අදහස්, ප්‍රශ්න, සහ මේ ගැන තියෙන අත්දැකීම් පහතින් comment කරන්න! මේක implement කරලා බලන්න, ඒක තමයි හොඳම ඉගෙනීමේ ක්‍රමය. තවත් මේ වගේ වැදගත් tutorial එකකින් හමුවෙමු!

AWS AI Model Deployment & Inference - Real-time, Batch, A/B Testing | Sinhala Guide

AI Model Deployment - මූලික අදහස (Basic Concept)