ConnAIsseur: Cross-Domain Recipe Recommendation via BERT Embedding Transfer
Overview
ConnAIsseur is a full-stack AI-driven recipe recommendation system whose core innovation is cross-domain semantic alignment: rather than requiring users to rate recipes directly (cold-start problem), the system infers culinary preferences from restaurant selections. Restaurant reviews and recipe descriptions are mapped into a shared 768-dimensional BERT embedding space, where the centroid of a user’s selected restaurant embeddings becomes their “latent taste profile” — enabling preference transfer from dining affinities to recipe suggestions via KNN retrieval with cosine similarity on KD-trees. The NLP pipeline combines domain-adaptive DistilBERT pre-training (Masked Language Modeling on Yelp reviews), multi-strategy keyword extraction (KeyBERT + YAKE + RAKE ensemble), and a Yelp web scraping pipeline with BeautifulSoup4. The full-stack application pairs a React SPA (Material-UI, React Router v6, token-based auth) with a Django REST Framework backend (4 models, 15+ API endpoints, SearchFilter full-text search) deployed on AWS EC2 with Nginx reverse proxy and HTTPS/SSL.
System Architecture
[Yelp Web Scraper] → [NLP Pipeline: BERT Pre-training + Keyword Extraction + Embeddings]
↓
[Pre-computed 768-dim Embeddings in CSV]
↓
[React SPA] ←──REST API──→ [Django REST Framework] ←→ [SQLite Database]
↓
[KNN Recommendation Engine (scikit-learn KD-tree)]
The system operates in two phases: (1) offline — data scraping, domain-adaptive BERT pre-training, embedding computation, and database population; (2) online — user profile construction from restaurant selections, real-time KNN-based recipe retrieval, and recipe search/browse.
NLP Pipeline: Domain-Adaptive Pre-training
DistilBERT Masked Language Modeling
Domain-adaptive pre-training fine-tunes distilbert-base-uncased (6 transformer layers, 12 attention heads, 768 hidden dim, ~66M parameters) on Yelp review text via Masked Language Modeling — 15% of tokens randomly masked, model learns to predict them from context. This adapts the general-purpose language model to culinary domain vocabulary and semantics (e.g., “umami”, “sous vide”, “al dente”) before embedding extraction.
| Parameter | Value |
|---|---|
| Base model | distilbert-base-uncased |
| Task | MLM (15% masking probability) |
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| Weight decay | 0.01 |
| Epochs | 10 |
| Block size | 128 tokens |
| Data split | 90% train / 10% validation |
| Evaluation | Perplexity (exp(eval_loss)) per epoch |
| Best model selection | load_best_model_at_end=True |
Text Embedding Generation (Three Strategies)
| Strategy | Model | Method | Dimensions |
|---|---|---|---|
| Primary | textattack/bert-base-uncased-yelp-polarity | [CLS] token from last hidden layer | 768 |
| Alternative | distilbert-base-nli-mean-tokens | Mean pooling (Sentence-BERT) | 768 |
| Fallback | GloVe-wiki-gigaword-300 | Mean of word vectors | 300 |
The production system uses a BERT model fine-tuned on Yelp polarity classification, extracting the [CLS] token representation (outputs.last_hidden_state[:, 0, :]) as a 768-dimensional embedding for both restaurant reviews and recipe descriptions — projecting them into a shared semantic space.
Multi-Strategy Keyword Extraction Ensemble
Three complementary approaches capture different facets of keyword relevance:
| Method | Library | Config | Approach |
|---|---|---|---|
| KeyBERT | keybert | 3-gram, max-sum diversity, top 20 | Contextual (BERT cosine similarity) |
| YAKE | yake | 3-gram, dedup threshold 0.9, top 20 | Statistical (TF, position, co-occurrence) |
| RAKE | rake_nltk | Default | Rule-based (phrase frequency/degree) |
Cross-Domain Recommendation Engine
Latent Taste Profile Construction
When a user selects restaurants on the Preference page:
- Retrieve 768-dim
review_embeddingfor each selected restaurant from the database - Compute centroid:
profile_embedding = mean(restaurant_embeddings)— the average embedding becomes the user’s latent taste profile - If food preference embeddings exist, the final profile averages both restaurant and food preference centroids
KNN Retrieval with Cosine-via-Euclidean
def knn(point, data, n_neighbors, metric="cosine"):
# L2-normalize for cosine: ||a-b||² = 2(1 - cos(a,b)) on unit sphere
point = point / np.linalg.norm(point)
data = data / np.linalg.norm(data, axis=1, keepdims=True)
model = NearestNeighbors(n_neighbors=n_neighbors, algorithm='kd_tree', metric='euclidean')
model.fit(data)
return model.kneighbors(np.expand_dims(point, 0), n_neighbors, return_distance=False)
Cosine similarity is implemented by L2-normalizing both query and candidates, then using Euclidean distance on the unit hypersphere. This is mathematically equivalent (‖a−b‖² = 2(1−cos(a,b)) for unit vectors) and enables scikit-learn’s KD-tree algorithm (which doesn’t natively support cosine distance).
Stochastic candidate sampling: Rather than searching the entire recipe database, the system randomly samples NUM_RANDOM_RECIPES=100 candidates and retrieves the top NUM_RECIPES_RECOMMEND=10 — trading recall for sub-second latency, suitable for real-time web serving.
Web Application
React SPA Frontend
| Component | Technology |
|---|---|
| Framework | React 17.0.2 (class components) |
| Routing | React Router DOM v6 (HashRouter) |
| Styling | Material-UI v5.3.1 + Emotion + Bootstrap 4.6 |
| HTTP | Axios 0.25.0 with token-based auth headers |
| Layout | react-horizontal-scrolling-menu (recipe carousels) |
Component hierarchy: App.js (global auth state via this.state + localStorage persistence) → LoginPage | NavBar + HomePage (saved + recommended carousels) | PreferencePage (restaurant search + selection) | RecipePage (detail view) | SearchPage (results grid) | ProfilePage. |
Django REST Framework Backend
4 Django models:
| Model | Key Fields |
|---|---|
Recipe | title, duration, ingredients, instructions, keywords, keyword_embedding, text_embedding (768-dim) |
Restaurant | title, address, description, link, keywords, keyword_embedding, review_embedding (768-dim) |
Profile | user (OneToOne→User), bio, dietary, food_prefs, restaurants, recipes, profile_embedding (768-dim) |
FoodImage | title, description, link |
15+ REST API endpoints via DRF’s DefaultRouter:
| Category | Endpoints | Key Features |
|---|---|---|
| Recipes | list, detail, recommend, add, remove, get_recipe | SearchFilter on title/instructions/keywords |
| Restaurants | list, search, add, remove, get_restaurant | SearchFilter on title/address/description/keywords |
| Profiles | get_profile, update_profile, update_preferences | Recompute profile embedding on preference change |
| Auth | login, register, logout | Token authentication, password validation |
Dynamic serializer selection per action (get_serializer_class()) with separate list vs. detail serializers — list views exclude embedding fields for bandwidth efficiency.
Yelp Web Scraping Pipeline
BeautifulSoup4-based scraper targeting Yelp restaurant pages:
- Review text extraction via CSS class targeting (
raw__09f24__T4Ezm) - Metadata: restaurant title, category tags (parsed from
/chref anchors), star ratings - POS tagging via
nltk.pos_tag()to categorize extracted keywords by part of speech (NN, JJ, etc.) - Multiprocessing with configurable chunk sizes for parallel scraping
- Progress tracking via text files for resumable execution across 150+ restaurants
Deployment (AWS EC2)
| Component | Specification |
|---|---|
| Cloud | AWS EC2 (t2.micro, Ubuntu 20.04) |
| IP | Elastic IP for persistence |
| Reverse proxy | Nginx |
| TLS | HTTPS via SSL certificate |
| CORS | Nginx header management |
| Memory | 2GB swap file for Node.js + Python |
| Database | SQLite3 |
Data loading scripts (save_recipes_to_database.py, save_restaurants_to_database.py) read pre-computed embedding CSVs and populate the Django ORM, decoupling the expensive NLP pipeline from the real-time serving infrastructure.
Tech Stack
Python (3.8+), Django (3.1) + Django REST Framework (3.12), React (17.0.2), Material-UI (5.3), HuggingFace Transformers (BERT, DistilBERT), Sentence-Transformers, scikit-learn (KNN/KD-tree), KeyBERT, YAKE, RAKE-NLTK, NLTK, Gensim (GloVe), BeautifulSoup4, Pandas, NumPy, Axios, Nginx, AWS EC2