文学検索 (Bungaku Kensaku)

文学検索
(Bungaku Kensaku)

AIを利用して日本作家の著作を検索いたします
(AI-powered semantic search for Japanese literature)

Scenario:

A sophisticated semantic search platform designed specifically for modern Japanese literature from the Meiji and Taisho eras. This system handles the unique challenges of searching philosophical content based on ideas and concepts, rather than simple keywords.

Built with Spring Boot and Python microservices on a modern AWS infrastructure, this application demonstrates advanced natural language processing, vector database integration, and scalable cloud architecture. The system is designed from the ground up to handle growth from 5 books to 300+ volumes.

Homepage:
– Clean, scholarly interface aesthetic
– Search with integrated book/series filtering
– Responsive design with Thymeleaf elements

Vector Database:
– Pinecone vector database integration
– Text chunk embedding with Open AI API
– Semantic search across texts

Database Schema:
– JPA entities for books, chunks, and authors
– Apache PDFBox for text processing
– Scalable architecture for large libraries

Infrastructure:
– Spring Boot application deployed on EC2
– PostgreSQL for metadata and book information
– Route 53 DNS routing and SSL management

Search Results:
– Search results showing book/chapter context
– Text summaries & relavance with Open AI API
– Relevance scoring optimization

Dig into the details at the Github page.

Full Summary:

A multilingual AI-powered semantic search engine for Japanese literature, featuring intelligent context-aware search results and natural language understanding.

Features

Semantic Search: AI-powered understanding of queries in both Japanese and English
Context-Aware Results: Each result includes intelligent summaries explaining relevance and context
Multilingual Support: Search Japanese literature with queries in multiple languages
Intuitive UI: Clean, scholarly interface inspired by traditional Japanese design
Vector Database: Uses embeddings for sophisticated content matching beyond keyword search
Scalable Architecture: Designed to grow from 5 to 100+ books seamlessly

Architecture

Technology Stack

Backend: Java Spring Boot
Frontend: Thymeleaf templates with responsive CSS/JavaScript
Database: PostgreSQL (metadata & book information)
Vector Search: Pinecone vector database
AI Integration: OpenAI embeddings and ChatGPT for intelligent summaries
Cloud: AWS EC2

System Design

Frontend (Thymeleaf) → Spring Boot API → Vector Search Service → PostgreSQL
                                      ↘ Pinecone Vector DB
                                      ↘ OpenAI API (embeddings + summaries)

Getting Started

Prerequisites

Java 17+
Maven 3.6+
PostgreSQL 12+
API keys for:
OpenAI (for embeddings and AI summaries)
Pinecone (for vector search)

Installation

Clone the repository

git clone https://github.com/yourusername/bungaku-kensaku.git
cd bungaku-kensaku

Set up configuration

# Copy example configuration
cp src/main/resources/application-example.properties src/main/resources/application-local.properties

# Edit application-local.properties with your actual credentials

Configure environment variables

export OPENAI_API_KEY="your-openai-api-key"
export PINECONE_API_KEY="your-pinecone-api-key"
export DB_PASSWORD="your-database-password"

Set up PostgreSQL database

CREATE DATABASE bungaku_kensaku_db;
CREATE USER your_username WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE bungaku_kensaku_db TO your_username;

Run the application
```
mvn spring-boot:run
```
Access the application
- Navigate to http://localhost:8080
- Default demo credentials:
  - Username: demo
  - Password: changeme
- Or configure your own via environment variables:
```
export DEMO_USERNAME=your-username
export DEMO_PASSWORD=your-password
```

📁 Project Structure

src/main/java/com/senseisearch/
├── controller/        # REST API endpoints
├── service/           # Business logic (search, AI, document processing)
├── repository/        # JPA data access layer
├── model/             # Entity classes
├── config/            # Configuration classes
└── util/              # Utility classes

src/main/resources/
├── templates/               # Thymeleaf HTML templates
├── static/                  # CSS, JavaScript, images
└── application*.properties  # Configuration files

How It Works

Document Processing Pipeline

PDF Upload: Documents are uploaded and stored in PostgreSQL
Text Extraction: Content is extracted and chunked (500 tokens with overlap)
Embedding Generation: Each chunk is converted to vector embeddings using OpenAI
Vector Storage: Embeddings are indexed in Pinecone for fast similarity search
Metadata Storage: Book information and chunk metadata stored in PostgreSQL

Search Process

Query Processing: User query is converted to embeddings
Similarity Search: Pinecone finds most relevant text chunks
Context Assembly: Retrieved chunks are enriched with book/chapter context
AI Summary: OpenAI generates intelligent explanations for each result
Result Presentation: Clean, organized results with context and relevance explanations

Use Cases

Academic Research: Deep exploration of philosophical and literary texts
Study Groups: Finding relevant passages for discussion topics
Cross-Reference Search: Discovering connections between different works
Multilingual Access: Non-Japanese speakers accessing Japanese literature
Contextual Learning: Understanding passages within their broader narrative context

AIを利用して日本作家の著作を検索いたします (AI-powered semantic search for Japanese literature)