Spaces:

vespa-engine
/

colpali-vespa-visual-retrieval

Running on L4

App Files Files Community

thomasht86 commited on Oct 30, 2024

Commit

5d22e58

verified ·

1 Parent(s): fa270d9

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +10 -0
prepare_feed_deploy.py +59 -46
requirements.txt +84 -2
vespa_feed_to_hf_dataset.py +42 -0

README.md CHANGED Viewed

@@ -126,6 +126,16 @@ python main.py
 ## Deploy to huggingface 🤗
 To deploy, run
 ```bash

 ## Deploy to huggingface 🤗
+### Compiling dependencies
+Before a deploy, make sure to run this to compile the `uv` lock file to `requirements.txt` if you have made changes to the dependencies:
+```bash
+uv pip compile pyproject.toml -o requirements.txt
+```
+### Deploying to huggingface
 To deploy, run
 ```bash

prepare_feed_deploy.py CHANGED Viewed

@@ -1,16 +1,16 @@
 # %% [markdown]
 # # Visual PDF Retrieval - demo application
-#
 # In this notebook, we will prepare the Vespa backend application for our visual retrieval demo.
 # We will use ColPali as the model to extract patch vectors from images of pdf pages.
 # At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results.
-#
 # To see the application in action, visit TODO:
-#
 # The web application is written in FastHTML, meaning the complete application is written in python.
-#
 # The steps we will take in this notebook are:
-#
 # 0. Setup and configuration
 # 1. Download the data
 # 2. Prepare the data
@@ -18,14 +18,14 @@
 # 4. Deploy the Vespa application
 # 5. Create the Vespa application
 # 6. Feed the data to the Vespa application
-#
 # All the steps that are needed to provision the Vespa application, including feeding the data, can be done from this notebook.
 # We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa.
-#
 # %% [markdown]
 # ## 0. Setup and Configuration
-#
 # %%
 import os
@@ -83,11 +83,11 @@ os.environ["TOKENIZERS_PARALLELISM"] = "false"
 # %% [markdown]
 # ### Create a free trial in Vespa Cloud
-#
 # Create a tenant from [here](https://vespa.ai/free-trial/).
 # The trial includes $300 credit.
 # Take note of your tenant name.
-#
 # %%
 VESPA_TENANT_NAME = "vespa-team"
@@ -95,17 +95,17 @@ VESPA_TENANT_NAME = "vespa-team"
 # %% [markdown]
 # Here, set your desired application name. (Will be created in later steps)
 # Note that you can not have hyphen `-` or underscore `_` in the application name.
-#
 # %%
-VESPA_APPLICATION_NAME = "colpalidemo2"
 VESPA_SCHEMA_NAME = "pdf_page"
 # %% [markdown]
 # Next, you need to create some tokens for feeding data, and querying the application.
 # We recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission).
 # The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section.
-#
 # %%
 VESPA_TOKEN_ID_WRITE = "colpalidemo_write"
@@ -113,7 +113,7 @@ VESPA_TOKEN_ID_READ = "colpalidemo_read"
 # %% [markdown]
 # We also need to set the value of the write token to be able to feed data to the Vespa application.
-#
 # %%
 VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
@@ -124,7 +124,7 @@ VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
 # We will also use the Gemini API to create sample queries for our images.
 # You can also use other VLM's to create these queries.
 # Create a Gemini API key from [here](https://aistudio.google.com/app/apikey).
-#
 # %%
 GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or input(
@@ -152,21 +152,21 @@ processor = ColPaliProcessor.from_pretrained(MODEL_NAME)
 # %% [markdown]
 # ## 1. Download PDFs
-#
 # We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund).
 # The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance.
-#
 # These reports are the ones we are going to use for this showcase.
 # Here are some sample images:
-#
 # ![Sample1](./static/img/gfpg-sample-1.png)
 # ![Sample2](./static/img/gfpg-sample-2.png)
-#
 # %% [markdown]
 # As we can see, a lot of the information is in the form of tables, charts and numbers.
 # These are not easily extractable using pdf-readers or OCR tools.
-#
 # %%
 import requests
@@ -180,16 +180,20 @@ html_content = response.text
 soup = BeautifulSoup(html_content, "html.parser")
 links = []
-# Find all <a> elements with the specific classes
-for a_tag in soup.find_all("a", href=True):
-    classes = a_tag.get("class", [])
-    if "button" in classes and "button--download-secondary" in classes:
         href = a_tag["href"]
         full_url = urljoin(url, href)
         links.append(full_url)
-links
 # %%
 # Limit the number of PDFs to download
@@ -274,7 +278,8 @@ pdfs
 # %% [markdown]
 # ## 2. Convert PDFs to Images
-#
 # %%
 def get_pdf_images(pdf_path):
@@ -300,6 +305,7 @@ for pdf in tqdm(pdfs):
         pdf_pages.append(
             {
                 "title": title,
                 "url": pdf["url"],
                 "path": pdf_file,
                 "image": image,
@@ -324,17 +330,17 @@ print(f"Number of text with length == 0: {Counter(text_lengths)[0]}")
 # %% [markdown]
 # ## 3. Generate Queries
-#
 # In this step, we want to generate queries for each page image.
 # These will be useful for 2 reasons:
-#
 # 1. We can use these queries as typeahead suggestions in the search bar.
 # 2. We can use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic.
-#
 # The prompt for generating queries is taken from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien.
-#
 # We will use the Gemini API to generate these queries, with `gemini-1.5-flash-8b` as the model.
-#
 # %%
 from pydantic import BaseModel
@@ -413,6 +419,7 @@ def generate_queries(image, prompt_text, pydantic_model):
         }
     return queries
 # %%
 for pdf in tqdm(pdf_pages):
     image = pdf.get("image")
@@ -488,9 +495,10 @@ with open("output/pdf_pages.json", "w") as f:
 # %% [markdown]
 # ## 4. Generate embeddings
-#
 # Now that we have the queries, we can use the ColPali model to generate embeddings for each page image.
-#
 # %%
 def generate_embeddings(images, model, processor, batch_size=2) -> np.ndarray:
@@ -530,6 +538,7 @@ def generate_embeddings(images, model, processor, batch_size=2) -> np.ndarray:
     all_embeddings = np.concatenate(embeddings_list, axis=0)
     return all_embeddings
 # %%
 # Generate embeddings for all images
 images = [pdf["image"] for pdf in pdf_pages]
@@ -540,9 +549,10 @@ embeddings.shape
 # %% [markdown]
 # ## 5. Prepare Data on Vespa Format
-#
 # Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa.
-#
 # %%
 def float_to_binary_embedding(float_query_embedding: dict) -> dict:
@@ -555,10 +565,12 @@ def float_to_binary_embedding(float_query_embedding: dict) -> dict:
         binary_query_embeddings[k] = binary_vector
     return binary_query_embeddings
 # %%
 vespa_feed = []
 for pdf, embedding in zip(pdf_pages, embeddings):
     url = pdf["url"]
     title = pdf["title"]
     image = pdf["image"]
     text = pdf.get("text", "")
@@ -580,6 +592,7 @@ for pdf, embedding in zip(pdf_pages, embeddings):
             "id": id_hash,
             "url": url,
             "title": title,
             "page_number": page_no,
             "blur_image": base_64_image,
             "full_image": base_64_full_image,
@@ -616,7 +629,7 @@ len(vespa_feed)
 # %% [markdown]
 # ## 5. Prepare Vespa Application
-#
 # %%
 # Define the Vespa schema
@@ -631,6 +644,7 @@ colpali_schema = Schema(
                 match=["word"],
             ),
             Field(name="url", type="string", indexing=["summary", "index"]),
             Field(
                 name="title",
                 type="string",
@@ -720,9 +734,7 @@ colpali_schema = Schema(
         DocumentSummary(
             name="suggestions",
             summary_fields=[
-                Summary(
-                    name="questions"
-                ),
             ],
             from_disk=True,
         ),
@@ -756,11 +768,12 @@ mapfunctions = [
 # Define the 'bm25' rank profile
 colpali_bm25_profile = RankProfile(
     name="bm25",
-    inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
     first_phase="bm25(title) + bm25(text)",
     functions=mapfunctions,
 )
 # A function to create an inherited rank profile which also returns quantized similarity scores
 def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
     return RankProfile(
@@ -770,6 +783,7 @@ def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
         summary_features=["quantized"],
     )
 colpali_schema.add_rank_profile(colpali_bm25_profile)
 colpali_schema.add_rank_profile(with_quantized_similarity(colpali_bm25_profile))
@@ -941,7 +955,7 @@ vespa_application_package = ApplicationPackage(
 # %% [markdown]
 # ## 6. Deploy Vespa Application
-#
 # %%
 VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY") or input(
@@ -966,17 +980,18 @@ print(f"Application deployed. Token endpoint URL: {endpoint_url}")
 # %% [markdown]
 # Make sure to take note of the token endpoint_url.
 # You need to put this in your `.env` file - `VESPA_APP_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application.
-#
 # %% [markdown]
 # ## 8. Feed Data to Vespa
-#
 # %%
 # Instantiate Vespa connection using token
 app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN)
 app.get_application_status()
 # %%
 def callback(response: VespaResponse, id: str):
     if not response.is_successful():
@@ -987,5 +1002,3 @@ def callback(response: VespaResponse, id: str):
 # Feed data into Vespa asynchronously
 app.feed_async_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback)

 # %% [markdown]
 # # Visual PDF Retrieval - demo application
+#
 # In this notebook, we will prepare the Vespa backend application for our visual retrieval demo.
 # We will use ColPali as the model to extract patch vectors from images of pdf pages.
 # At query time, we use MaxSim to retrieve and/or (based on the configuration) rank the page results.
+#
 # To see the application in action, visit TODO:
+#
 # The web application is written in FastHTML, meaning the complete application is written in python.
+#
 # The steps we will take in this notebook are:
+#
 # 0. Setup and configuration
 # 1. Download the data
 # 2. Prepare the data
 # 4. Deploy the Vespa application
 # 5. Create the Vespa application
 # 6. Feed the data to the Vespa application
+#
 # All the steps that are needed to provision the Vespa application, including feeding the data, can be done from this notebook.
 # We have tried to make it easy for others to run this notebook, to create your own PDF Enterprise Search application using Vespa.
+#
 # %% [markdown]
 # ## 0. Setup and Configuration
+#
 # %%
 import os
 # %% [markdown]
 # ### Create a free trial in Vespa Cloud
+#
 # Create a tenant from [here](https://vespa.ai/free-trial/).
 # The trial includes $300 credit.
 # Take note of your tenant name.
+#
 # %%
 VESPA_TENANT_NAME = "vespa-team"
 # %% [markdown]
 # Here, set your desired application name. (Will be created in later steps)
 # Note that you can not have hyphen `-` or underscore `_` in the application name.
+#
 # %%
+VESPA_APPLICATION_NAME = "colpalidemo"
 VESPA_SCHEMA_NAME = "pdf_page"
 # %% [markdown]
 # Next, you need to create some tokens for feeding data, and querying the application.
 # We recommend separate tokens for feeding and querying, (the former with write permission, and the latter with read permission).
 # The tokens can be created from the [Vespa Cloud console](https://console.vespa-cloud.com/) in the 'Account' -> 'Tokens' section.
+#
 # %%
 VESPA_TOKEN_ID_WRITE = "colpalidemo_write"
 # %% [markdown]
 # We also need to set the value of the write token to be able to feed data to the Vespa application.
+#
 # %%
 VESPA_CLOUD_SECRET_TOKEN = os.getenv("VESPA_CLOUD_SECRET_TOKEN") or input(
 # We will also use the Gemini API to create sample queries for our images.
 # You can also use other VLM's to create these queries.
 # Create a Gemini API key from [here](https://aistudio.google.com/app/apikey).
+#
 # %%
 GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or input(
 # %% [markdown]
 # ## 1. Download PDFs
+#
 # We are going to use public reports from the Norwegian Government Pension Fund Global (also known as the Oil Fund).
 # The fund puts transparency at the forefront and publishes reports on its investments, holdings, and returns, as well as its strategy and governance.
+#
 # These reports are the ones we are going to use for this showcase.
 # Here are some sample images:
+#
 # ![Sample1](./static/img/gfpg-sample-1.png)
 # ![Sample2](./static/img/gfpg-sample-2.png)
+#
 # %% [markdown]
 # As we can see, a lot of the information is in the form of tables, charts and numbers.
 # These are not easily extractable using pdf-readers or OCR tools.
+#
 # %%
 import requests
 soup = BeautifulSoup(html_content, "html.parser")
 links = []
+url_to_year = {}
+# Find all 'div's with id starting with 'year-'
+for year_div in soup.find_all("div", id=lambda x: x and x.startswith("year-")):
+    year_id = year_div.get("id", "")
+    year = year_id.replace("year-", "")
+    # Within this div, find all 'a' elements with the specific classes
+    for a_tag in year_div.select("a.button.button--download-secondary[href]"):
         href = a_tag["href"]
         full_url = urljoin(url, href)
         links.append(full_url)
+        url_to_year[full_url] = year
+links, url_to_year
 # %%
 # Limit the number of PDFs to download
 # %% [markdown]
 # ## 2. Convert PDFs to Images
+#
 # %%
 def get_pdf_images(pdf_path):
         pdf_pages.append(
             {
                 "title": title,
+                "year": int(url_to_year[pdf["url"]]),
                 "url": pdf["url"],
                 "path": pdf_file,
                 "image": image,
 # %% [markdown]
 # ## 3. Generate Queries
+#
 # In this step, we want to generate queries for each page image.
 # These will be useful for 2 reasons:
+#
 # 1. We can use these queries as typeahead suggestions in the search bar.
 # 2. We can use the queries to generate an evaluation dataset. See [Improving Retrieval with LLM-as-a-judge](https://blog.vespa.ai/improving-retrieval-with-llm-as-a-judge/) for a deeper dive into this topic.
+#
 # The prompt for generating queries is taken from [this](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html#an-update-retrieval-focused-prompt) wonderful blog post by Daniel van Strien.
+#
 # We will use the Gemini API to generate these queries, with `gemini-1.5-flash-8b` as the model.
+#
 # %%
 from pydantic import BaseModel
         }
     return queries
 # %%
 for pdf in tqdm(pdf_pages):
     image = pdf.get("image")
 # %% [markdown]
 # ## 4. Generate embeddings
+#
 # Now that we have the queries, we can use the ColPali model to generate embeddings for each page image.
+#
 # %%
 def generate_embeddings(images, model, processor, batch_size=2) -> np.ndarray:
     all_embeddings = np.concatenate(embeddings_list, axis=0)
     return all_embeddings
 # %%
 # Generate embeddings for all images
 images = [pdf["image"] for pdf in pdf_pages]
 # %% [markdown]
 # ## 5. Prepare Data on Vespa Format
+#
 # Now, that we have all the data we need, all that remains is to make sure it is in the right format for Vespa.
+#
 # %%
 def float_to_binary_embedding(float_query_embedding: dict) -> dict:
         binary_query_embeddings[k] = binary_vector
     return binary_query_embeddings
 # %%
 vespa_feed = []
 for pdf, embedding in zip(pdf_pages, embeddings):
     url = pdf["url"]
+    year = pdf["year"]
     title = pdf["title"]
     image = pdf["image"]
     text = pdf.get("text", "")
             "id": id_hash,
             "url": url,
             "title": title,
+            "year": year,
             "page_number": page_no,
             "blur_image": base_64_image,
             "full_image": base_64_full_image,
 # %% [markdown]
 # ## 5. Prepare Vespa Application
+#
 # %%
 # Define the Vespa schema
                 match=["word"],
             ),
             Field(name="url", type="string", indexing=["summary", "index"]),
+            Field(name="year", type="int", indexing=["summary", "attribute"]),
             Field(
                 name="title",
                 type="string",
         DocumentSummary(
             name="suggestions",
             summary_fields=[
+                Summary(name="questions"),
             ],
             from_disk=True,
         ),
 # Define the 'bm25' rank profile
 colpali_bm25_profile = RankProfile(
     name="bm25",
+    inputs=[("query(qt)", "tensor<float>(querytoken{}, v[128])")],
     first_phase="bm25(title) + bm25(text)",
     functions=mapfunctions,
 )
 # A function to create an inherited rank profile which also returns quantized similarity scores
 def with_quantized_similarity(rank_profile: RankProfile) -> RankProfile:
     return RankProfile(
         summary_features=["quantized"],
     )
 colpali_schema.add_rank_profile(colpali_bm25_profile)
 colpali_schema.add_rank_profile(with_quantized_similarity(colpali_bm25_profile))
 # %% [markdown]
 # ## 6. Deploy Vespa Application
+#
 # %%
 VESPA_TEAM_API_KEY = os.getenv("VESPA_TEAM_API_KEY") or input(
 # %% [markdown]
 # Make sure to take note of the token endpoint_url.
 # You need to put this in your `.env` file - `VESPA_APP_URL=https://abcd.vespa-app.cloud` - to access the Vespa application from your web application.
+#
 # %% [markdown]
 # ## 8. Feed Data to Vespa
+#
 # %%
 # Instantiate Vespa connection using token
 app = Vespa(url=endpoint_url, vespa_cloud_secret_token=VESPA_CLOUD_SECRET_TOKEN)
 app.get_application_status()
 # %%
 def callback(response: VespaResponse, id: str):
     if not response.is_successful():
 # Feed data into Vespa asynchronously
 app.feed_async_iterable(vespa_feed, schema=VESPA_SCHEMA_NAME, callback=callback)

requirements.txt CHANGED Viewed

@@ -24,8 +24,15 @@ attrs==24.2.0
     # via aiohttp
 beautifulsoup4==4.12.3
     # via python-fasthtml
 cachetools==5.5.0
     # via google-auth
 certifi==2024.8.30
     # via
     #   httpcore
@@ -39,16 +46,27 @@ click==8.1.7
     # via
     #   typer
     #   uvicorn
 colpali-engine==0.3.1
     # via
     #   visual-retrieval-colpali (pyproject.toml)
     #   vidore-benchmark
 contourpy==1.3.0
     # via matplotlib
 cryptography==43.0.1
     # via pyvespa
 cycler==0.12.1
     # via matplotlib
 datasets==2.21.0
     # via
     #   mteb
@@ -168,11 +186,16 @@ itsdangerous==2.2.0
 jinja2==3.1.4
     # via
     #   pyvespa
     #   torch
 joblib==1.4.2
     # via scikit-learn
 kiwisolver==1.4.7
     # via matplotlib
 loguru==0.7.2
     # via vidore-benchmark
 lucide-fasthtml==0.0.9
@@ -181,6 +204,8 @@ lxml==5.3.0
     # via
     #   lucide-fasthtml
     #   pyvespa
 markdown-it-py==3.0.0
     # via rich
 markupsafe==2.1.5
@@ -201,11 +226,17 @@ multidict==6.1.0
     #   yarl
 multiprocess==0.70.16
     # via datasets
 networkx==3.3
     # via torch
 numpy==1.26.4
     # via
     #   accelerate
     #   colpali-engine
     #   contourpy
     #   datasets
@@ -217,6 +248,8 @@ numpy==1.26.4
     #   scikit-learn
     #   scipy
     #   seaborn
     #   transformers
     #   vidore-benchmark
 oauthlib==3.2.2
@@ -229,7 +262,10 @@ packaging==24.1
     #   huggingface-hub
     #   matplotlib
     #   peft
     #   transformers
 pandas==2.2.3
     # via
     #   datasets
@@ -247,8 +283,14 @@ pillow==10.4.0
     #   pdf2image
     #   sentence-transformers
     #   vidore-benchmark
 polars==1.9.0
     # via mteb
 proto-plus==1.24.0
     # via
     #   google-ai-generativelanguage
@@ -277,8 +319,12 @@ pycparser==2.22
     # via cffi
 pydantic==2.9.2
     # via
     #   google-generativeai
     #   mteb
 pydantic-core==2.23.4
     # via pydantic
 pygments==2.18.0
@@ -334,7 +380,9 @@ requests==2.32.3
     #   mteb
     #   pyvespa
     #   requests-toolbelt
     #   transformers
 requests-toolbelt==1.0.0
     # via pyvespa
 rich==13.9.2
@@ -366,27 +414,47 @@ sentence-transformers==3.1.1
 sentencepiece==0.2.0
     # via vidore-benchmark
 setuptools==75.1.0
-    # via visual-retrieval-colpali (pyproject.toml)
 shad4fast==1.2.1
     # via visual-retrieval-colpali (pyproject.toml)
 shellingham==1.5.4
     # via typer
 six==1.16.0
     # via python-dateutil
 sniffio==1.3.1
     # via
     #   anyio
     #   httpx
 soupsieve==2.6
     # via beautifulsoup4
 sqlite-minutils==3.37.0.post3
     # via fastlite
 starlette==0.39.2
     # via python-fasthtml
 sympy==1.13.3
     # via torch
 tenacity==9.0.0
     # via pyvespa
 threadpoolctl==3.5.0
     # via scikit-learn
 tokenizers==0.20.0
@@ -408,6 +476,7 @@ tqdm==4.66.5
     #   mteb
     #   peft
     #   sentence-transformers
     #   transformers
 transformers==4.45.1
     # via
@@ -416,10 +485,14 @@ transformers==4.45.1
     #   sentence-transformers
     #   vidore-benchmark
 typer==0.12.5
-    # via vidore-benchmark
 typing-extensions==4.12.2
     # via
     #   anyio
     #   google-generativeai
     #   huggingface-hub
     #   mteb
@@ -448,10 +521,19 @@ vespacli==8.391.23
     # via visual-retrieval-colpali (pyproject.toml)
 vidore-benchmark==4.0.0
     # via visual-retrieval-colpali (pyproject.toml)
 watchfiles==0.24.0
     # via uvicorn
 websockets==13.1
     # via uvicorn
 xxhash==3.5.0
     # via datasets
 yarl==1.13.1

     # via aiohttp
 beautifulsoup4==4.12.3
     # via python-fasthtml
+blis==0.7.11
+    # via thinc
 cachetools==5.5.0
     # via google-auth
+catalogue==2.0.10
+    # via
+    #   spacy
+    #   srsly
+    #   thinc
 certifi==2024.8.30
     # via
     #   httpcore
     # via
     #   typer
     #   uvicorn
+cloudpathlib==0.20.0
+    # via weasel
 colpali-engine==0.3.1
     # via
     #   visual-retrieval-colpali (pyproject.toml)
     #   vidore-benchmark
+confection==0.1.5
+    # via
+    #   thinc
+    #   weasel
 contourpy==1.3.0
     # via matplotlib
 cryptography==43.0.1
     # via pyvespa
 cycler==0.12.1
     # via matplotlib
+cymem==2.0.8
+    # via
+    #   preshed
+    #   spacy
+    #   thinc
 datasets==2.21.0
     # via
     #   mteb
 jinja2==3.1.4
     # via
     #   pyvespa
+    #   spacy
     #   torch
 joblib==1.4.2
     # via scikit-learn
 kiwisolver==1.4.7
     # via matplotlib
+langcodes==3.4.1
+    # via spacy
+language-data==1.2.0
+    # via langcodes
 loguru==0.7.2
     # via vidore-benchmark
 lucide-fasthtml==0.0.9
     # via
     #   lucide-fasthtml
     #   pyvespa
+marisa-trie==1.2.1
+    # via language-data
 markdown-it-py==3.0.0
     # via rich
 markupsafe==2.1.5
     #   yarl
 multiprocess==0.70.16
     # via datasets
+murmurhash==1.0.10
+    # via
+    #   preshed
+    #   spacy
+    #   thinc
 networkx==3.3
     # via torch
 numpy==1.26.4
     # via
     #   accelerate
+    #   blis
     #   colpali-engine
     #   contourpy
     #   datasets
     #   scikit-learn
     #   scipy
     #   seaborn
+    #   spacy
+    #   thinc
     #   transformers
     #   vidore-benchmark
 oauthlib==3.2.2
     #   huggingface-hub
     #   matplotlib
     #   peft
+    #   spacy
+    #   thinc
     #   transformers
+    #   weasel
 pandas==2.2.3
     # via
     #   datasets
     #   pdf2image
     #   sentence-transformers
     #   vidore-benchmark
+pip==24.3.1
+    # via visual-retrieval-colpali (pyproject.toml)
 polars==1.9.0
     # via mteb
+preshed==3.0.9
+    # via
+    #   spacy
+    #   thinc
 proto-plus==1.24.0
     # via
     #   google-ai-generativelanguage
     # via cffi
 pydantic==2.9.2
     # via
+    #   confection
     #   google-generativeai
     #   mteb
+    #   spacy
+    #   thinc
+    #   weasel
 pydantic-core==2.23.4
     # via pydantic
 pygments==2.18.0
     #   mteb
     #   pyvespa
     #   requests-toolbelt
+    #   spacy
     #   transformers
+    #   weasel
 requests-toolbelt==1.0.0
     # via pyvespa
 rich==13.9.2
 sentencepiece==0.2.0
     # via vidore-benchmark
 setuptools==75.1.0
+    # via
+    #   visual-retrieval-colpali (pyproject.toml)
+    #   marisa-trie
+    #   spacy
+    #   thinc
 shad4fast==1.2.1
     # via visual-retrieval-colpali (pyproject.toml)
 shellingham==1.5.4
     # via typer
 six==1.16.0
     # via python-dateutil
+smart-open==7.0.5
+    # via weasel
 sniffio==1.3.1
     # via
     #   anyio
     #   httpx
 soupsieve==2.6
     # via beautifulsoup4
+spacy==3.7.5
+    # via visual-retrieval-colpali (pyproject.toml)
+spacy-legacy==3.0.12
+    # via spacy
+spacy-loggers==1.0.5
+    # via spacy
 sqlite-minutils==3.37.0.post3
     # via fastlite
+srsly==2.4.8
+    # via
+    #   confection
+    #   spacy
+    #   thinc
+    #   weasel
 starlette==0.39.2
     # via python-fasthtml
 sympy==1.13.3
     # via torch
 tenacity==9.0.0
     # via pyvespa
+thinc==8.2.5
+    # via spacy
 threadpoolctl==3.5.0
     # via scikit-learn
 tokenizers==0.20.0
     #   mteb
     #   peft
     #   sentence-transformers
+    #   spacy
     #   transformers
 transformers==4.45.1
     # via
     #   sentence-transformers
     #   vidore-benchmark
 typer==0.12.5
+    # via
+    #   spacy
+    #   vidore-benchmark
+    #   weasel
 typing-extensions==4.12.2
     # via
     #   anyio
+    #   cloudpathlib
     #   google-generativeai
     #   huggingface-hub
     #   mteb
     # via visual-retrieval-colpali (pyproject.toml)
 vidore-benchmark==4.0.0
     # via visual-retrieval-colpali (pyproject.toml)
+wasabi==1.1.3
+    # via
+    #   spacy
+    #   thinc
+    #   weasel
 watchfiles==0.24.0
     # via uvicorn
+weasel==0.4.1
+    # via spacy
 websockets==13.1
     # via uvicorn
+wrapt==1.16.0
+    # via smart-open
 xxhash==3.5.0
     # via datasets
 yarl==1.13.1

vespa_feed_to_hf_dataset.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import pandas as pd
+from dotenv import load_dotenv
+import os
+import base64
+from PIL import Image
+import io
+from datasets import Dataset, Image as HFImage
+from pathlib import Path
+from tqdm import tqdm
+load_dotenv()
+df = pd.read_json("output/vespa_feed_full.jsonl", lines=True)
+df = pd.json_normalize(df["fields"].tolist())
+dataset_dir = Path("hf_dataset")
+image_dir = dataset_dir / "images"
+os.makedirs(image_dir, exist_ok=True)
+def save_image(image_data, filename):
+    img_data = base64.b64decode(image_data)
+    img = Image.open(io.BytesIO(img_data))
+    img.save(filename)
+for idx, row in tqdm(df.iterrows()):
+    blur_filename = os.path.join(image_dir, f"blur_{idx}.jpg")
+    full_filename = os.path.join(image_dir, f"full_{idx}.jpg")
+    save_image(row["blur_image"], blur_filename)
+    save_image(row["full_image"], full_filename)
+    df.at[idx, "blur_image"] = blur_filename
+    df.at[idx, "full_image"] = full_filename
+# Step 3: Convert to Hugging Face Dataset
+dataset = (
+    Dataset.from_dict(df.to_dict(orient="list"))
+    .cast_column("blur_image", HFImage())
+    .cast_column("full_image", HFImage())
+)
+dataset.push_to_hub("vespa-engine/gpfg-QA", private=True)