Evaluation Setup
Simulation of 1,350 conversations across varied attack scenarios and victim profiles
Benchmarks:
- SE-VSim Generated Dataset (Social Engineering Simulation) [New]
Metrics:
- Number of conversations
- Fleiss' Kappa (Annotation Agreement)
- Statistical methodology: Fleiss' Kappa calculated for inter-annotator agreement.
Key Results
| Benchmark |
Metric |
Baseline |
This Paper |
Δ |
| The paper primarily presents the construction and validation of the SE-VSim dataset. |
| SE-VSim Dataset |
Total Conversations |
0 |
1350 |
+1350
|
| Attack Success Labeling |
Fleiss' Kappa |
0 |
0.796 |
+0.796
|
Main Takeaways
- Generated a balanced dataset covering three professional attacker roles (Recruiter, Journalist, Funding Agency) and three target information types (PII, Financial, Patents).
- Demonstrated that LLMs (GPT-4o-mini) can reliably replace human annotators for complex social engineering success labeling (Kappa=0.796).
- Established a simulation framework that successfully integrates Big Five personality traits into victim agents, allowing for diverse conversation trajectories.