Skip to main content
← Back to Liquid Nanos LFM2-350M-PII-Extract-JP extracts personally identifiable information (PII) from Japanese text as structured JSON. Output can be used to mask sensitive information on-device for privacy-preserving applications.

Specifications

PropertyValue
Parameters350M
Context Length32K tokens
TaskPII Detection
LanguageJapanese

Privacy Protection

On-device PII masking

Compliance

Data protection compliance

Document Redaction

Automated redaction

Prompting Recipe

Use temperature=0 (greedy decoding) for best results. This model is intended for single-turn conversations only.
System Prompt Format:
Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>
Extract specific entities by listing only what you need (e.g., Extract <human_name>). List categories in alphabetical order for optimal performance. Output Format: JSON with lists per category. Empty lists for missing entities. Outputs entities exactly as they appear (including notation variations) for exact-match masking.

Quick Start

Install:
pip install transformers torch
Run:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "LiquidAI/LFM2-350M-PII-Extract-JP"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

system_prompt = "Extract <address>, <company_name>, <email_address>, <human_name>, <phone_number>"

user_input = """γ“γ‚“γ«γ‘γ―γ€γƒ©γƒŸγƒ³γ•γ‚“γ« B200 GPU γ‚’ 10000 台 至ζ€₯請求してください。
ι€£η΅‘ε…ˆγ― [email protected] (ι›»θ©±η•ͺ号010-000-0000) γ§γ€γ“γ‚Œγ― C. elegans
η·šθ™«γ«η€ζƒ³γ‚’εΎ—γŸγƒ‹γƒ₯γƒΌγƒ©γƒ«γƒγƒƒγƒˆγƒ―γƒΌγ‚―γ‚’γƒΌγ‚­γƒ†γ‚―γƒγƒ£γ‚’ δ»Šγ™γζ§‹η―‰γ™γ‚‹γŸγ‚γ«δΈε―ζ¬ γ§γ™γ€‚"""

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_input}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, temperature=0, do_sample=False)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# Output: {"address": [], "company_name": [], "email_address": ["[email protected]"],
#          "human_name": ["γƒ©γƒŸγƒ³"], "phone_number": ["010-000-0000"]}