The overall process of building an evaluation pipeline looks like this:
Select Your Dataset: Choose or upload datasets to serve as the basis for your evaluations, whether for scoring, regression testing, or bulk job processing.
Build Your Pipeline: Start by visually constructing your evaluation pipeline, defining each step from input data processing to final evaluation.
Run Evaluations: Execute your pipeline, observe the results in a spreadsheet-like interface, and make informed decisions based on comprehensive metrics and scores.
Initiate a Batch Run: Start by creating a new batch run, which requires specifying a name and selecting a dataset.
Dataset Selection: Upload a CSV/JSON dataset, or create a dataset from historical data using filters like time range, prompt template logs, scores, and metadata. Learn more here.
You now have a pipeline. Preview mode allows you to iterate with live feedback, allowing for adjustments in real-time.
Click ‘Add Step’ to start building your pipeline, with each column representing a step in the evaluation process.Steps execute in order left to right. That means that if a column depends on a previous column, make sure it appears to the right of the dependency.
If the last step of your evaluation pipeline contains all booleans or numeric values, that will be consider the score for the row. Your full evaluation report will have a scorecard of the average of this last step.NOTE: All cells in the last column must be boolean or all must be numeric. If any cell deviates, the score will not be calculated