Krystian Safjan's Blog

OSI Approved in license metadata for Python project

2024-04-16T00:00:00+02:00

When it comes to sharing and distributing Python projects, clarity about licensing is crucial. A license tells users what they can and cannot do with your code, impacting everything from contributions to commercial use. The pyproject.toml file, as outlined in PEP 518 – Specifying Minimum Build System Requirements for Python Projects and enhanced by subsequent PEPs, provides a standardized format for project metadata, including license information. By specifying an "OSI Approved" license in your pyproject.toml, you not only make your software's terms of use explicit but also ensure you’re adhering to the standards that have become the cornerstone of open-source software.

In this blog post, we'll dive into the [project] section of the pyproject.toml file, where you can define your project's license. We will explore the reasons why including an OSI Approved license in your Python project metadata is not just good practice but also a reflection of your commitment to the open-source community's values.

Join us as we unpack the benefits of explicit licensing, how it can save you and your users legal headaches, and why it's an essential component of any reputable Python project.

OSI Approval

"OSI Approved" in the context of a license listed in a pyproject.toml file for a Python project indicates that the license under which the project is distributed has been approved by the Open Source Initiative (OSI).

The Open Source Initiative is a non-profit organization that advocates for open-source software and maintains the Open Source Definition. A license that is OSI Approved means that it complies with this definition, which includes a set of criteria to ensure free redistribution, access to source code, and other important freedoms related to software usage and distribution.

Therefore, if a project's pyproject.toml specifies a license as "OSI Approved," it implies that the software can be freely used, modified, and shared under the terms that are consistent with the open-source movement, as validated by the OSI's review process. It's a mark of recognition that the software license adheres to open-source principles.

Examples of OSI approved licenses

The Open Source Initiative (OSI) approves various licenses that comply with their Open Source Definition. Some of the most commonly used OSI approved licenses include:

MIT License
Apache License 2.0
GNU General Public License (GPL) 2.0 & 3.0
BSD Licenses (2-clause and 3-clause)
GNU Lesser General Public License (LGPL)
Mozilla Public License 2.0
Eclipse Public License
GNU Affero General Public License (AGPL)
Creative Commons Zero v1.0 Universal (CC0)
Artistic License 2.0

Each of these licenses has its own terms and conditions, varying in terms of how redistributions must credit

Examples of licenses without approval from OSI

There are several licenses that are not approved by the Open Source Initiative (OSI) either because they have not been submitted for approval, they discriminate against persons or groups, they discriminate against fields of endeavor, they are not technology-neutral, or for other reasons that conflict with the Open Source Definition. Some examples include:

Various Creative Commons licenses (excluding CC0):

Licenses such as Creative Commons Attribution-NonCommercial (CC BY-NC), Creative Commons Attribution-NoDerivatives (CC BY-ND), and others are not OSI-approved because they impose restrictions on commercial use or the creation of derivative works, which goes against the Open Source Definition that requires the license to allow modifications and derived works.

JSON License ("The Software shall be used for Good, not Evil"):

The JSON License includes a clause that specifies "The Software shall be used for Good, not Evil," which introduces a subjective and non-legal term. This clause can be interpreted in various ways and thus makes it non-free according to the open-source criteria, which insist on no discrimination against fields of endeavor.

The Apple Public Source License 1.x:

The earlier versions of the Apple Public Source License were not approved by the OSI due to concerns about the license's compatibility and restrictions on the usage of the covered software. Apple addressed these issues in version 2.0, which was approved by the OSI.

Badgeware Licenses (Original Attribution Assurance License, the Honest Public License, etc.):

These licenses require the display of a logo, badge, or attribution in a manner that the OSI considers overly burdensome and not in line with the Open Source Definition's requirements for free redistribution.

Microsoft Limited Public License (Ms-LPL) and the Microsoft Reciprocal License (Ms-RL):

These licenses were specific to Windows and were not technology-neutral, making them incompatible with the OSI's requirement that open-source licenses must not restrict the software to run on a particular operating system or environment.

NOTE: When using software under a specific license or choosing a license for your own work, checking the current OSI-approved list is advisable for the most up-to-date information.

GitHub - Troubleshooting 'Permission to repo.git denied to user'

2024-03-21T00:00:00+01:00

When working with GitHub, you might encounter an issue that restricts you from executing a 'git push' operation due to permissions. This post will guide you through solving this problem and help you understand its causes.

Let's look at the issue in focus. You are trying to push to your repository:

git push origin main

However, the terminal returns an error that resembles:

ERROR: Permission to USER/REPO.git denied to USER.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

This issue often arises when you are operating two GitHub accounts interchangeably on the same machine and are using SSH keys for authorization.

One straightforward workaround, although not the most efficient, is to simply log out from the conflicting GitHub account, reboot the machine, and then attempt to push again. However, this method is a quick-fix and might not solve the underlying issue.

The root of this problem often lies within SSH authentication, particularly involving the 'ssh-agent'. The ssh-agent is a program that holds private keys used for public key authentication (RSA, DSA, ECDSA, Ed25519).

The solution is to clear all identities (SSH keys) stored by the ssh-agent:

ssh-add -D

The command ssh-add -D deletes all identities from the agent.

However, if you want to be more precise and remove specific keys, you can specify the file path of the key:

ssh-add -d ~/.ssh/id_rsa
ssh-add -d ~/.ssh/github

The above commands remove the id_rsa and github keys, respectively.

After cleaning up the stored keys, you can add the required SSH key back to the ssh-agent. This key should correspond to the GitHub account you intend to push to:

ssh-add ~/.ssh/github

By performing these steps, you can effectively manage your SSH keys and prevent conflicts when working with multiple GitHub accounts on the same machine.

Open Source LLM Observability Tools and Platforms

2024-02-22T00:00:00+01:00

LLM Observability in the Context of LLMOps for Generative AI

AI is transforming the world, and one area where it has made significant strides is in generative models, particularly in the field of Large Language Models (LLMs) like GPT-3 and transformer models. However, as impressive as these models are, managing, monitoring, and understanding their behavior and output remains a challenge. Enter LLMOps, a new field focusing on the management and deployment of LLMs, and a key aspect of this is LLM Observability.

LLM Observability in the Context of LLMOps for Generative AI
What is LLM Observability?
Expected Functionalities of an LLM Observability Solution
Open Source LLM Observability Tools and Platforms
Other - related
References

What is LLM Observability?

LLM Observability is the ability to understand, monitor, and infer the internal state of an LLM from its external outputs. It encompasses several areas including model health monitoring, performance tracking, debugging, and evaluating model fairness and safety.

In the context of LLMOps, LLM Observability is critical. LLMs are complex and can be unpredictable, producing outputs that range from harmless to potentially harmful or biased. It's therefore essential to have the right tools and methodologies for observing and understanding these models' behaviors in real-time, during training, testing, and after deployment.

Expected Functionalities of an LLM Observability Solution

Model Performance Monitoring: An observability solution should be able to track and monitor the performance of an LLM in real-time. This includes tracking metrics like accuracy, precision, recall, and F1 score, as well as more specific metrics like perplexity or token costs in the case of language models.
Model Health Monitoring: The solution should be capable of monitoring the overall health of the model, identifying and alerting on anomalies or potentially problematic patterns in the model's behavior.
Debugging and Error Tracking: If something does go wrong, the solution should provide debugging and error tracking functionalities, helping developers identify, trace, and fix issues.
Fairness, Bias, and Safety Evaluation: Given the potential for bias and ethical issues in AI, any observability solution should include features for evaluating fairness and safety, helping ensure that the model's outputs are unbiased and ethically sound.
Interpretability: LLMs can often be "black boxes", producing outputs without clear reasoning. A good observability solution should help make the model's decision-making process more transparent, providing insights into why a particular output was produced.
Integration with Existing LLMOps Tools: Finally, the solution should be capable of integrating with existing LLMOps tools and workflows, from model development and training to deployment and maintenance.

LLM Observability is a crucial aspect of LLMOps for generative AI. It provides the visibility and control needed to effectively manage, deploy, and maintain Large Language Models, ensuring they perform as expected, are free from bias, and are safe to use.

Open Source LLM Observability Tools and Platforms

Azure OpenAI Logger - - "Batteries included" logging solution for your Azure OpenAI instance.
Deepchecks - - Tests for Continuous Validation of ML Models & Data. Deepchecks is a Python package for comprehensively validating your machine learning models and data with minimal effort.
Evidently - - Evaluate and monitor ML models from validation to production.
Giskard - - Testing framework dedicated to ML models, from tabular to LLMs. Detect risks of biases, performance issues and errors in 4 lines of code.
whylogs - - The open standard for data logging
lunary - - The production toolkit for LLMs. observability, prompt management, and evaluations.
openllmetry - - Open-source observability for your LLM application, based on OpenTelemetry
phoenix (Arize Ai) - - AI Observability & Evaluation - Evaluate, troubleshoot, and fine-tune your LLM, CV, and NLP models in a notebook.
langfuse - - Open source LLM engineering platform. observability, metrics, evals, prompt management SDKs + integrations for Typescript, Python
LangKit - - An open-source toolkit for monitoring Large Language Models (LLMs). Extracts signals from prompts & responses, ensuring safety & security. Features include text quality, relevance metrics, & sentiment analysis. Comprehensive tool for LLM observability.
agentops - - Python SDK for agent evals and observability
pezzo - - Open-source, developer-first LLMOps platform designed to streamline prompt design, version management, instant delivery, collaboration, troubleshooting, observability and more.
Fiddler AI - - Evaluate, monitor, analyse, and improve machine learning and generative models from pre-production to production. Ship more ML and LLMs into production, and monitor ML and LLM metrics like hallucination, PII, and toxicity.
OmniLog - - Observability tool for your LLM prompts.

Non-open source

- Generative AI Studio - Galileo

Other - related

Great Expectations - Always know what to expect from your data.
AgentOps-AI/tokencost - Easy token price estimates for LLMs
observability prompts - LLM observability related prompts
LLM Observability
baml - A programming language to build strongly-typed LLM functions. Testing and observability included
aperture - Rate limiting, caching, and request prioritization for modern workloads

References

The Most Powerful Mac Productivity and Automation Apps

2024-01-24T00:00:00+01:00

Alfred: A productivity app for Mac OS X, which boosts your efficiency with hotkeys, keywords, text expansion, and more.
BetterTouchTool: Allows you to configure many types of gestures for your Mac’s Trackpad, Magic Mouse, and Keyboard.
Hazel: A system preference pane that works silently in the background, automatically filing, organizing, and cleaning up your desktop.
Automator: A built-in Mac utility for automating tasks. You can create workflows, watch folders, and set up automated actions.
Keyboard Maestro: Enhances the power of your keyboard by creating macros that can automate virtually anything on your Mac.
QuickSilver: A light, fast, and free Mac application launcher that also replaces your task switcher.
Things: Task management software that makes it easy to stay organized and get things done.
Ulysses: A feature-rich text editor for writers that allows you to manage and organize all your writing in a single app.
BetterSnapTool: Allows users to quickly and easily manage their window positions and sizes by either dragging them to one of the screen's corners or to the top, left or right side of the screen.
Bartender: Lets you organize your menu bar apps by hiding them, rearranging them, or moving them to the Bartender Bar.
Magnet: Keeps your workspace organized and allows you to snap application windows in different halves or quarters of your screen.

Avoid using curl -u “username:secret”!

2024-01-20T00:00:00+01:00

When invoking an endpoint guarded by Basic Authentication, you might resort to the -u username:password feature in curl.

curl -u "jane@examplewebsite.com:mySecretGuard" http://api.myawesomeapp.com/information

However, this approach is not the most efficient or secure.

In executing this command, the credentials are archived in your shell history, posing a considerable security threat.

On the bright side, there's a straightforward solution to this issue!

Now you can generate a file in your home directory titled .netrc as shown below:

machine api.myawesomeapp.com  
  login jane@examplewebsite.com  
  password mySecretGuard

Afterwards, when running the curl command, just include -n and the credentials will be fetched from the file you just created.

curl -n http://api.myawesomeapp.com/information

To give you more context, curl is a command-line tool for getting or sending data using URL syntax. It supports various protocols, including but not limited to HTTP, HTTPS, FTP, and FTPS. Curl is widely used for making API requests.

In addition, the .netrc file is a special file that stores login and initialisation information used by the auto-login process. It generally resides in the user's home directory. This file can contain information like the name of the machine to which to connect, and any necessary usernames and passwords.

On a final note, remember that this method works only with the curl command. Other command-line tools may require different approaches to secure authentication. Always prioritise data security by opting for methods that safeguard your login credentials.

HTML5 interactive elements

2024-01-04T00:00:00+01:00

HTML5 Interactive Elements: An Overview and Usage Guide

HyperText Markup Language (HTML) is the standard markup language for documents designed to be rendered in a web browser. Over the years, HTML has evolved to keep up with the growing need for better structure and interactivity.

HTML5, the latest version, introduces several interactive tags or elements, which makes building interactive, dynamic web content easier without having to resort to JavaScript or CSS. Let's dive into these interactive elements and have a look at some examples to understand their usage better.

The `<details>` and `<summary>` Elements

The <details> and <summary> tags allow us to create an interactive widget that the user can open or close. The <summary> tag is a child of the <details> tag, representing the summary or brief description of the content in <details>.

<details>
    <summary>The Solar System</summary>
    <p>The Solar System includes the Sun, the Earth (where you are now!) and all the other planets.</p>
</details>

The Solar System

The Solar System includes the Sun, the Earth (where you are now!) and all the other planets.

The `<dialog>` Element

The <dialog> element presents content in a dialogue box or a window. You can toggle the visibility of the <dialog> by changing the 'open' attribute.

<dialog open>
    This is a dialog box!<br>
    <button onclick="this.parentElement.close()">Close</button>
</dialog>

The `<datalist>` Element

The <datalist> element permits the creation of pre-defined options for an <input> element. User can either select an option or type in their input.

<label for="browsers">Choose a browser from the list:</label>
<input list="browsers" name="browser" id="browser">
<datalist id="browsers">
  <option value="Chrome">
  <option value="Firefox">
  <option value="Internet Explorer">
  <option value="Opera">
  <option value="Safari">
</datalist>

Choose a browser from the list:
## The `` Element The `` element serves to represent the progress of a task. Use the `value` attribute to specify the current progress and the `max` attribute to indicate the progress bar's maximum value.

<progress value="70" max="100"></progress>

## The `` Element The `` tag is used to represent the scalar measurement within a known range, or a fractional value. This could be the disk usage, the relevance of a query result or any other form of gauge.

Disk usage: <meter value="0.6">60%</meter>

Disk usage: 60% ## The `` Element The `` tag is a container for calculation results. To link the output element with other elements, you can use the `for` attribute.

<form oninput="x.value=parseInt(a.value)+parseInt(b.value)">
  0<input type="range" id="a" value="50">100 +
  0<input type="range" id="b" value="50">100 =
  <output name="x" for="a b"></output>
</form>

## The `

entr - run arbitrary command when files change

2024-01-01T00:00:00+01:00

entr is a UNIX utility which runs arbitrary commands when files change. It helps in automating tasks during development such as rebuilding projects, running tests, or syncing files.

Here's a simple usage example:

ls *.c | entr make

In the above example, ls *.c lists all C files in the directory. This list is piped (|) into entr. When any of these files changes, entr executes the make command.

Some key features of entr include:

It frees up developers to focus on the code by automating rebuild tasks.
It doesn't require a configuration file or a list of tasks to run. It just reruns the command you provide it each time a file changes.
You can use it with any command that needs to operate on a file. This might be shell commands, like ls or echo, or any other CLI tool you have in your system.

Additional commands for entr include:

-r : To restart a long running process like a server when a file changes.
-p : Postpone execution until files are updated.
-s : Evaluate the first argument using the interpreter specified by the SHELL environment variable.
-d : Track directories recursively and include files that are created after the utility starts

Please note that entr requires a list of files as input. It does not discover files on its own, it expects to receive a list of files from stdin, which is usually supplied with command line utilities like ls, find or git ls-files.

Tverski Similarity Metrics

2023-12-10T00:00:00+01:00

Tversky similarity and Jaro-Winkler Similarity similarity are two different similarity metrics that are used to compare strings or sequences. They are designed for specific purposes and have different mathematical formulas and applications.

Tversky Similarity
Formula
Python Example
Jaro-Winkler Similarity (for reference)
Summary

Tversky Similarity

Tversky similarity is a metric used to compare sets, typically in the context of information retrieval, information retrieval evaluation, and recommendation systems. It was introduced by Amos Tversky in his work on set comparison. Tversky similarity takes into account the number of common elements between two sets as well as the differences in elements between them. It has two parameters, alpha and beta, which control the balance between precision and recall.

Let's dive into the mathematical formula, explanation, and Python examples for Tversky similarity.

Formula

Tversky similarity measures the similarity between two sets A and B, considering the trade-off between false positives and false negatives. The formula for Tversky similarity is:

$$ Tversky(A, B) = \frac{|A \cap B|}{|A \cap B| + \alpha |A - B| + \beta |B - A|} $$

Where: - $(|A \cap B|)$ is the size of the intersection of sets A and B. - $(|A - B|)$ is the size of the set difference of A minus B. - ($|B - A|)$ is the size of the set difference of B minus A. - $\alpha$ and $\beta$ are parameters that control the trade-off between precision and recall. When $\alpha = \beta = 1$, the Tversky similarity becomes the Jaccard similarity.

Python Example

def tversky_similarity(set_a, set_b, alpha, beta):
    intersection = len(set_a.intersection(set_b))
    a_minus_b = len(set_a.difference(set_b))
    b_minus_a = len(set_b.difference(set_a))

    similarity = intersection / (intersection + alpha * a_minus_b + beta * b_minus_a)
    return similarity

set1 = {"apple", "banana", "cherry"}
set2 = {"banana", "cherry", "date", "elderberry"}

alpha = 0.5
beta = 0.5
similarity = tversky_similarity(set1, set2, alpha, beta)
print("Tversky Similarity:", similarity)

Jaro-Winkler Similarity (for reference)

Jaro-Winkler similarity is a metric used to compare two strings, often used in record linkage and fuzzy string matching tasks. It was developed by William E. Winkler and Matthew A. Jaro. Jaro-Winkler similarity calculates a score between 0 and 1, where 1 indicates a perfect match and 0 indicates no similarity. It considers the number of matching characters between two strings and the positions of those matching characters. The Jaro-Winkler similarity gives more weight to the common prefix of the strings, making it particularly useful for comparing names and short strings. For more information about Jaro-Winkler similarity see: Jaro-Winkler Similarity.

Summary

The main differences between Tversky similarity and Jaro-Winkler similarity are:

Application Domain: Tversky similarity is used to compare sets, while Jaro-Winkler similarity is used to compare strings.
Parameters: Tversky similarity has parameters alpha and beta to control precision and recall, while Jaro-Winkler similarity does not have such parameters.
Target Data: Tversky similarity works with sets of items, while Jaro-Winkler similarity works with individual strings.
Use Cases: Tversky similarity is commonly used in information retrieval and recommendation systems, while Jaro-Winkler similarity is used in fuzzy string matching and record linkage tasks.

X::Jaro-Winkler Similarity

GitHub Search Techniques

2023-12-07T00:00:00+01:00

Search By Name: Use "in:name" along with your search term to find repositories with that name. Example: "Ruby-Projects in:name".
Search By Description: Use "in:description" along with your search term to find repositories with that term in their description. Example: "machine learning in:description".
Search By Readme: Use "in:readme" along with your search term to find repositories with that term in their README file. Example: "learn ruby in:readme".
Search By Topic: Use "in:topic" along with your search term to find repositories with that topic. Example: "mobile development in:topic".
Search By Organization: Use "org:" along with your search term to find repositories from a specific organization. Example: "org:Microsoft".
Search By License: Use "license:" along with your search term to find open-source repositories that match a certain license. Example: "license:Apache-2.0".
Search By Stars: Use "stars:>" followed by a number to find repositories with that number of stars or more. Example: "stars:>1000".
Search By Date: Use "Created" or "Updated" followed by a date in the format "YYYY-MM-DD" to find repositories created or updated after a certain date. Example: "in:date created:>2023-06-01".
Search By Forks: Use "forks:>" followed by a number to find repositories that have been forked that number of times or more. Example: "forks:>1000".
Search By Language: Use "language:" with your search term to find repositories in a specific programming language. Example: "language:ruby".
Search by Last Push: Use "pushed:>" followed by a date to find repositories updated after a certain date. Example: "pushed:>2023-03-01 rails".

These techniques can help you quickly find the repositories you need. These search tips can transform the task of searching for repositories into an enjoyable and productive experience.

Databricks Curriculum - From Zero to Hero

2023-12-04T00:00:00+01:00

Stage 1: Beginner

Topic 1: Introduction to Databricks

Prerequisites: None
Enables: Understanding of what Databricks is and what it can do.
Reasoning: As a starting point, you need to understand what Databricks is and why it's used.
Understand the concept of Databricks
Learn about the history and evolution of Databricks
Understand the benefits and use-cases of Databricks
Explore the architecture of Databricks

Topic 2: Setting up Databricks

Prerequisites: Introduction to Databricks
Enables: Ability to setup and navigate the Databricks environment.
Reasoning: Before you can use Databricks, you need to know how to set it up and navigate the platform.
Create a Databricks account
Understand the Databricks workspace
Learn how to create a Databricks cluster
Learn how to create notebooks and libraries
Understand how to manage and monitor clusters

Topic 3: Introduction to Apache Spark

Prerequisites: Setting up Databricks
Enables: Understanding of Apache Spark and its importance in Databricks.
Reasoning: Databricks is built on Apache Spark, so understanding Spark is crucial.
Understand the concept of Apache Spark
Learn about the history and evolution of Apache Spark
Understand the architecture of Apache Spark
Explore the core components of Spark: Spark SQL, Spark Streaming, MLlib, and GraphX
Understand how Spark integrates with Databricks

Topic 4: Basic Data Processing with Databricks

Prerequisites: Introduction to Apache Spark
Enables: Ability to perform basic data processing tasks in Databricks.
Reasoning: Data processing is a key function of Databricks.
Understand the concept of data processing
Learn how to load and inspect data in Databricks
Understand the basic operations on data such as filtering, aggregation, and transformation
Learn how to visualize data in Databricks
Understand how to save and export processed data

Stage 2: Intermediate

Topic 5: DataFrames and SQL in Databricks

Prerequisites: Basic Data Processing with Databricks
Enables: Ability to use DataFrames and SQL for data manipulation in Databricks.
Reasoning: DataFrames and SQL are essential tools for data manipulation in Databricks.
Understand the concept of DataFrames in Spark
Learn how to create DataFrames from different data sources
Perform operations on DataFrames such as select, filter, and aggregate
Understand the concept of SQL in Spark
Learn how to perform SQL queries on DataFrames
Understand how to convert between DataFrames and SQL

Topic 6: ETL Processes in Databricks

Prerequisites: DataFrames and SQL in Databricks
Enables: Understanding and implementation of ETL processes in Databricks.
Reasoning: ETL (Extract, Transform, Load) processes are a key part of data processing in Databricks.
Understand the concept of ETL (Extract, Transform, Load)
Learn how to extract data from different sources in Databricks
Understand how to transform data using Spark transformations
Learn how to load data into different destinations
Perform a complete ETL process on a sample dataset

Topic 7: Machine Learning with Databricks

Prerequisites: ETL Processes in Databricks
Enables: Ability to use Databricks for machine learning tasks.
Reasoning: Machine learning is a powerful tool for data analysis, and Databricks provides robust support for machine learning tasks.
Understand the concept of machine learning
Learn about the machine learning library in Spark (MLlib)
Understand the machine learning workflow: data preparation, model training, model evaluation, and model deployment
Learn how to prepare data for machine learning
Train and evaluate a machine learning model on a sample dataset

Stage 3: Advanced

Topic 8: Stream Processing in Databricks

Prerequisites: Machine Learning with Databricks
Enables: Ability to handle real-time data streams in Databricks.
Reasoning: Real-time data processing is a critical capability in many data-intensive applications.
Understand the concept of stream processing
Learn about Spark Streaming and its integration with Databricks
Understand how to ingest real-time data streams
Learn how to perform transformations and actions on data streams
Understand how to output data streams to various destinations

Topic 9: Advanced Spark Programming in Databricks

Prerequisites: Stream Processing in Databricks
Enables: Mastery of advanced Spark programming techniques in Databricks.
Reasoning: To fully leverage the power of Databricks, you need to be proficient in advanced Spark programming techniques.
Deepen understanding of Spark's core concepts
Learn about Spark's advanced features such as Spark's Catalyst Optimizer, Tungsten Execution Engine, and GraphX for graph processing
Understand how to optimize Spark applications for performance
Learn how to debug and troubleshoot Spark applications
Understand how to manage and monitor Spark applications in Databricks

Topic 10: Databricks for Data Science

Prerequisites: Advanced Spark Programming in Databricks
Enables: Ability to use Databricks as a tool for advanced data science tasks.
Reasoning: Databricks is a powerful tool for data science, and mastering its use for these tasks will enable you to tackle complex data science problems.
Understand how Databricks can be used for advanced data science tasks
Learn about Databricks' integration with popular data science libraries and tools
Understand how to perform exploratory data analysis in Databricks
Learn how to build, evaluate, and tune advanced machine learning models
Understand how to deploy machine learning models in Databricks

This curriculum provides a comprehensive path from beginner to advanced user of Databricks. By following this path, you will gain a deep understanding of Databricks and be able to use it effectively for a wide range of data processing and data science tasks.

Databricks - key concepts

2023-12-04T00:00:00+01:00

mindmap
Databricks
    Databricks Workspace
    Databricks Runtime
    Databricks File System (DBFS)
    Databricks Clusters
    Databricks Notebooks
    Databricks Jobs
    Databricks Tables

Here are some of the key features and components of Databricks:

Databricks Workspace
Databricks Runtime
Databricks File System (DBFS)
Databricks Clusters
Databricks Notebooks
Databricks Jobs
Databricks Tables

Databricks Workspace

This is the collaborative environment where you can write code, create visualizations, and share your work with others. It supports several languages including Python, SQL, R, and Scala. Read more: Create and manage your Databricks workspaces | Databricks on AWS

Databricks Runtime

This is the set of core components that run on the clusters in Databricks. It includes Apache Spark but also includes other enhancements maintained by Databricks like performance optimizations, security, and integration with other tools like Delta Lake and MLflow. Read more: What is Databricks Runtime?

Databricks File System (DBFS)

This is a distributed file system installed on Databricks clusters. It allows you to store data and share it across all nodes in a cluster. Read more: What is the Databricks File System (DBFS)?

Databricks Clusters

These are the compute resources that run your code. You can create clusters of different sizes and types depending on your workload. Read more: Compute - Azure Databricks

Databricks Notebooks

These are collaborative documents that contain code, visualizations, and text. They're great for exploratory data analysis, data science, and machine learning workflows. Read more: Introduction to Databricks notebooks

Databricks Jobs

These are the tasks or computations you run on Databricks. You can schedule jobs to run periodically, or run them on demand. Read more: Create and run Databricks Jobs

Databricks Tables

These are the structured data sources that you can query using SQL or data frame APIs in Python, R, and Scala. Read more: Delta Live Tables

Semantic Type Detection

2023-12-01T00:00:00+01:00

Semantic type detection is an important task in table representation learning, as it involves labeling table columns with standardized semantic types. This can help with understanding the contents of a table and can be used for various applications such as data discovery, data validation, and data integration. By accurately detecting the semantic types of columns, machine learning models can better understand the relationships between columns and improve their performance on tasks like table comprehension and data discovery. Additionally, semantic type detection can help with data integration, as it can help map columns from different sources that may have different naming conventions or formats.

X::Table Representation Learning

Table Representation Learning

2023-12-01T00:00:00+01:00

Table representation learning is an exciting field that focuses on understanding the structure and relationships within tabular data. This can involve learning embeddings for individual columns or entire tables, and can be used for various applications such as data discovery, data validation, and data integration.

One key aspect of table representation learning is understanding the semantics of columns, which can be used to generate metadata and help with tasks like table comprehension and data discovery.

By accurately representing columns and their relationships, table representation learning can help improve machine learning models and enable more complex analysis of tabular data.

X::Semantic Type Detection

Using Mermaid Diagrams in Pelican Blog Post

2023-11-28T00:00:00+01:00

Sometimes, you might want to embed the mermaid diagram in your blogpost written in markdown. Here is how to do it.

Embed the HTML code (recommended)

In your markdown file, you can embed HTML code loading mermaid code and initialising it, then include mermaid diagram you want.

<script type="module"> import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10/dist/mermaid.esm.min.mjs'; mermaid.initialize({ startOnLoad: true }); </script>

Here is a mermaid diagram:
<pre class="mermaid">
 graph TD 
 A[Client] --> B[Load Balancer] 
 B --> C[Server01] 
 B --> D[Server02]
</pre>

Extension

There is extension, not sure if it works:

Lee-W/md_mermaid - mermaid extension to add support for mermaid graph inside markdown file. NOTE: you need Markdown<3.2 (e.g. 3.1.1)

Store Output of the Command Into Array in Bash

2023-11-13T00:00:00+01:00

Both mapfile and read -a can be used to store the output of a command or a list of values into an array. However, the mapfile command is generally preferred when reading lines from a file, while read -a is well-suited for reading space-separated values from a string.

Let's assume that we want to store all directories (top-level) that are located in projects forlder. In other words, keeping all projects (dir names) as array elements.

projects=("$HOME"/projects/*)

# Using 'find' command with '-print0' to handle directory names with special characters
while IFS= read -r -d $'\0' line; do
    projects+=("$line")
done < <(find "${projects[@]}" -maxdepth 0 -type d -print0)

In the provided code, the read command is used together with some parameters. Here is a brief explanation:

-a : This option is used when we want to read from input and store it in an array. In the given code snippet, the input is obtained from a subshell command that lists directories (ls -d ${projects[@]}).
-r : This option prevents backslash escapes from being interpreted. It helps you to read the strings "as is".
-d $'\0' : This tells read to continue until it encounters a null byte (\0), which is the delimiter used by find . -print0.

So read -r -d $'\0' line reads input separated by null characters into the variable line. This is done inside a while loop, which continues to perform this reading operation for each directory returned by find, assigned to the projects array one by one.

The while loop structure while IFS= read -r -d $'\0' line; do is commonly used in shell scripting to read lines from a file (or in this case, results from a command substitution) in a safe manner that preserves whitespace and special characters.

IFS= is used to temporarily clear the Internal Field Separator variable, which is used by read to split the input line into separate fields. By clearing it, we ensure that read treats each line as a whole, even if it includes spaces.

In this script, the find command is used, ill-equipped with the -print0 option to output names using a null character as a delimiter, which helps in dealing with directory names that include spaces or other special characters. The -maxdepth 0 option ensures that only the directories (not their subdirectories) are listed. The -type d filter ensures that only directories are returned.

The while loop with IFS= read -r -d $'\0' handles the null delimited output from find. Within the loop, each line is appended to the projects array. Lastly, the elements of the projects array are added to the 'list' array.

The Importance of Adding a `py.typed` File to Your Typed Package

2023-11-13T00:00:00+01:00

For the Python programming, type checking might be an important aspect aspect that ensures the correctness of your code. The mypy type checker is a powerful tool that uses type annotations to verify your code. However, it might not recognize the type hints provided by your package unless you include a py.typed file. This is a common oversight that can lead to incorrect package publishing.

Understanding `py.typed`

The py.typed file is a marker file that indicates to type checkers like mypy that your package comes with type annotations. Without this file, the type checker won't use the type hints provided by your package, leading to potential type errors. This requirement is outlined in PEP-561 and the mypy documentation.

Adding `py.typed` to Your Package

Adding a py.typed file to your package is straightforward. Simply create a py.typed file in your package directory and include it in your distribution.

If you're using poetry, you can add the following lines under the [tool.poetry] section of pyproject.toml:

packages = [
  {include = "mypackage"},
  {include = "mypackage/py.typed"},
]

For those using setup.py, you can add package_data to the setup call:

setup(
    package_data={"mypackage": ["py.typed"]},
)

After adding the py.typed file, release a new version for your package. This will ensure that the type information from your packages works as expected.

Conclusion

If you're a Python package maintainer, don't forget to include a py.typed file in your typed package. This simple step can make a significant difference in ensuring the correctness of your code and the usability of your package. It's a small effort that goes a long way in maintaining the quality and reliability of your Python package.

Credits to Wu Haotian for the article Don't forget py.typed for your typed Python package - DEV Community - I have learned about this mechanism from that post.

In the Python project made with Poetry shall I add poetry.lock to the git repo or ignore it?

2023-11-12T00:00:00+01:00

up:MOC_Python_Project

In a Python project managed with Poetry, you should definitely add the poetry.lock file to your Git repository. The poetry.lock file ensures that all project dependencies are specified with fixed versions, providing deterministic builds across different environments.

By including the poetry.lock file in your repository, you ensure that anyone cloning or checking out your project will have the exact same versions of the dependencies installed. This guarantees that they will have a consistent development environment and can reproduce the same build and execution results.

Including the poetry.lock file also serves as documentation for the specific versions of the dependencies used in your project. This information can be helpful for troubleshooting and debugging purposes.

When working with Poetry, you can also add the pyproject.toml file to your Git repository. This file contains the project metadata and the dependencies specified in a readable format, giving a high-level overview of your project's requirements.

Including both the poetry.lock and pyproject.toml files ensures that others can easily set up and work with your project while maintaining consistency across different development environments.

Git change remote origin (replace with new)

2023-11-11T00:00:00+01:00

Git - Replace remote origin

To change the remote origin in Git and replace it with a new one, you can use the following steps:

Verify the existing remote origin

Check the current remote URL for the origin repository by running the command:

git remote -v

This command will display the fetch and push URLs for all the remotes.

Remove the existing remote origin

In order to replace the remote origin, you need to remove the current one. Run the command:

git remote remove origin.

This will remove the old origin from your local Git repository.

Add the new remote origin

Once you have removed the existing remote origin, you can add the new one by running the command: git remote add origin <new_remote_url>. Replace <new_remote_url> with the URL of the new remote repository you want to set as the origin.

Verify the changes You can ensure that the new remote origin is set correctly by running

git remote -v

Push the branch to the new origin

Finally, you can push your branch to the new remote origin using:

git push -u origin <branch_name>.

Replace <branch_name> with the name of the branch you want to push.

When you might need to perform this operation

There are several situations where you might want to change the remote origin (replace it with a new one) in Git. Some common examples include:

Changing the repository hosting provider: If you are migrating your codebase from one hosting provider to another (e.g., from GitHub to GitLab), you would need to update the remote origin URL to point to the new provider.
Moving the repository from a personal account to an organization account: If you initially created a repository under your personal account and later decide to move it to an organization account, you would change the remote origin to point to the new organization repository.
Renaming the repository: If you decide to change the name of your repository, you may want to update the remote origin URL to reflect the new name.
Collaborating with multiple repositories: In some cases, you might want to work with multiple remote repositories, perhaps to collaborate with different teams or maintain several mirrored repositories. Changing the remote origin allows you to switch between these repositories easily.
Fixing an incorrect or outdated remote origin: If you accidentally set the wrong remote origin URL or if the previous URL has become outdated, you can change it to point to the correct one.

Remember, changing the remote origin should be done with caution, especially in collaborative environments, as it affects the repository's remote connections. Make sure to communicate the changes to your team and consider any implications before making the switch.

SPLADE sparse vectors - explaination, properties

2023-11-10T00:00:00+01:00

style: bullet
min_depth: 2
max_depth: 6 
title: "**Contents**"

TL; DR

SPLADE is a neural retrieval model which learns query/document sparse expansion via the BERT MLM head and sparse regularization. Sparse representations benefit from several advantages compared to dense approaches: efficient use of inverted index, explicit lexical match, interpretability... They also seem to be better at generalizing on out-of-domain data (BEIR benchmark).

Intro

I have learned about SPLADE from the article: SPLADE for Sparse Vector Search Explained | Pinecone. Here below are the key concepts from the article (LLM summary)

The article discusses the evolution of search and recommendation systems, focusing on the shift from traditional "bag of words" methods to modern vector search. It explains how big tech companies like Google, Netflix, and Amazon use vector search to power their systems.

The traditional bag of words methods transform documents into a set of words, populating a sparse "frequency vector". While these methods are efficient and interpretable, they are not perfect due to their reliance on exact term matching, which doesn't align with human nature.

Dense embedding models offer a solution by allowing search based on semantic meaning. However, they require vast amounts of data for fine-tuning and don't perform well in niche domains where data is scarce.

The article introduces a solution to these problems: a merger of sparse and dense retrieval through hybrid search and learnable sparse embeddings. It focuses on SPLADE (Sparse Lexical and Expansion model), a model that uses a pretrained language model like BERT to enhance sparse vector embedding.

how it works

The idea behind the Sparse Lexical and Expansion models is that a pretrained language model like BERT can identify connections between words/sub-words (called word-pieces or “terms” in this article) and use that knowledge to enhance our sparse vector embedding.

This works in two ways, it allows us to weigh the relevance of different terms (something like the will carry less relevance than a less common word like orangutan). And it enables term expansion: the inclusion of alternative but relevant terms beyond those found in the original sequence.

Term expansion allows us to identify relevant but different terms and use them in the sparse vector retrieval step.

The most significant advantage of SPLADE is not necessarily that it can do term expansion but instead that it can learn term expansions. Traditional methods required rule-based term expansion which is time-consuming and fundamentally limited. Whereas SPLADE can use the best language models to learn term expansions and even tweak them based on the sentence context.

The article also discusses the pros and cons of sparse and dense vectors, the concept of two-stage retrieval, and the drawbacks of this approach. It then delves into the workings of SPLADE, explaining how it builds sparse embeddings and how it can be implemented using Hugging Face and PyTorch or the official SPLADE library.

The article concludes by acknowledging the limitations of SPLADE, such as its slower retrieval speed compared to other sparse methods, and suggests solutions to these problems. It also highlights the potential of mixing both dense and sparse representations using hybrid search indexes to make vector search more accurate and accessible.

X::TF-IDF with examples

References

GitHub - naver/splade: SPLADE: sparse neural search (SIGIR21, SIGIR22)

[1] T. Formal, B. Piwowarski, S. Clinchant, SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking (2021), SIGIR 21

[2] T. Formal, C. Lassance, B. Piwowarski, S. Clinchant, SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval (2021)

https://www.linkedin.com/posts/prithivirajdamodaran_%3F%3F%3F%3F%3F%3F-%3F%3F%3F%3F-%3F%3F%3F%3F%3F%3F%3F%3F-activity-7164581754270400512-Aa87?utm_source=share&utm_medium=member_desktop

TF-IDF with examples

2023-11-10T00:00:00+01:00

X::SPLADE sparse vectors - explaination, properties

TF-IDF stands for Term Frequency-Inverse Document Frequency. It's a numerical statistic used to reflect how important a word is to a document in a collection or corpus. It's often used in information retrieval and text mining.

TF-IDF is composed of two parts:

Term Frequency (TF): This measures the frequency of a word in a document. It's the ratio of the number of times a word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own term frequency.
Inverse Document Frequency (IDF): This measures the importance of the word in the entire corpus. The IDF of a rare word is high, whereas the IDF of a frequent word is likely to be low. Thus, words that occur frequently across many documents will have a lower IDF, and rare words will have a high IDF.

The TF-IDF value is calculated by multiplying these two metrics: TF and IDF.

Minimal example

High TF-IDF

Consider a document containing 100 words wherein the word 'cat' appears 3 times.

The term frequency (TF) for 'cat' is then (3 / 100) = 0.03.

Now, assume we have 10 million documents and the word 'cat' appears in one thousand of these. Then, the inverse document frequency (IDF) is calculated as log(10,000,000 / 1,000) = 4.

So, the TF-IDF weight is the product of these quantities: 0.03 * 4 = 0.12.

Low TF-IDF

Now, let's consider a common word like 'the'. Assume it appears 20 times in a document of 100 words. So, TF for 'the' is (20/100) = 0.2.

Assume 'the' appears in 1 million out of 10 million documents. So, IDF for 'the' is log(10,000,000 / 1,000,000) = 1.

The TF-IDF weight for 'the' is 0.2 * 1 = 0.2.

Even though 'the' appeared more times than 'cat' in the document, the TF-IDF weight for 'cat' is higher than 'the'. This is because IDF gives a higher weight to words that are less frequent in the corpus, making 'cat' more important than 'the' in the context of our corpus.

The formula

here is the TF-IDF equation in LaTeX format:

The term frequency $TF$ is calculated as:

$$ TF(t, d) = \frac{f_{t, d}}{\sum_{t' \in d} f_{t', d}} $$

Where:

$f_{t, d}$ is the frequency of term $t$ in document $d$
The denominator is the sum of frequencies of all terms in document $d$

The inverse document frequency $IDF$ is calculated as:

$$ IDF(t, D) = \log \frac{|D|}{|\{d \in D: t \in d\}|} $$

Where:

$|D|$ is the total number of documents in the corpus
The denominator is the number of documents where the term $t$ appears

Finally, the TF-IDF is calculated as:

$$ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D) $$

Where:

$t$ is the term
$d$ is the document
$D$ is the corpus (set of all documents)

Growth Hacking Methodology

2023-11-07T00:00:00+01:00

Growth Hacking is a marketing strategy primarily used by startups and small businesses, which focuses on rapid growth within a short time frame. It involves experimenting with and implementing creative, low-cost strategies to acquire and retain customers.

Here are some key points about Growth Hacking:

Experimentation: Growth hacking involves constant experimentation across various channels and product development paths to identify the most effective ways to grow a business.
Creativity: Growth hackers often use unconventional marketing strategies to get maximum growth. This could include viral marketing, social media, targeted advertising, SEO, email marketing, and more.
Data-Driven: Growth hacking is heavily reliant on data analysis. Growth hackers track and analyze user data to understand behavior, test hypotheses, and make informed decisions.
Agility: Growth hacking requires agility and adaptability. Growth hackers must be willing to pivot quickly, change strategies, and try new things based on what the data is telling them.
Product Development: Growth hacking isn't just about marketing. It often involves tweaking the product itself to make it more appealing or to encourage users to spread the word about it.
Customer Retention: While much of growth hacking focuses on customer acquisition, it's also about customer retention. Growth hackers look for ways to increase customer loyalty and encourage repeat business.
Viral Loops: Growth hackers often aim to create viral loops, where existing users naturally attract new users, creating a self-perpetuating cycle of growth.

An example of a successful growth hack is Dropbox's referral program. They offered extra storage space to users who referred their friends, which led to a significant increase in user sign-ups. This is a classic example of a growth hack – a simple, cost-effective solution that led to substantial growth.

References

Book: "Growth Hacker Marketing: A Primer on the Future of PR, Marketing, and Advertising" by Ryan Holiday. This book is a good starting point for understanding the concept of growth hacking.
Book: "Hacking Growth: How Today's Fastest-Growing Companies Drive Breakout Success" by Sean Ellis and Morgan Brown. Sean Ellis is the person who coined the term "growth hacking," and this book provides a deep dive into the methodology.
GrowthHackers - An online community where growth hackers share case studies, articles, and resources.
The Growth Hacking Starter Guide - Real Examples - An online guide that provides a comprehensive overview of growth hacking.
Growth Hacking Made Simple: Definition by Neil Patel
Top 17 Growth Hacking Books to Read in 2022

X::Criticism of the Lean Startup X::Product Led Growth

Product Led Growth

2023-11-07T00:00:00+01:00

Product Led Growth (PLG) is a business methodology in which the product itself serves as the primary driver of customer acquisition, conversion, and expansion. It's a model that prioritizes product usage as the key growth driver, rather than traditional marketing or sales efforts.

Here are some key points about Product Led Growth:

User-Centric: PLG focuses on the user experience. The product is designed to be so user-friendly and intuitive that it sells itself. The aim is to create a product that users love and can't live without.
Viral Growth: PLG often relies on viral growth. This means that current users recommend the product to others, creating a network effect. This can be facilitated by incorporating features that naturally encourage sharing or collaboration.
Freemium or Free Trial Models: Many PLG companies offer a freemium model or free trial to attract users. This allows users to try the product and see its value before deciding to pay for premium features.
Self-Service: PLG products are typically self-service, meaning users can sign up, use, and even upgrade the product without needing to interact with a sales team.
Data-Driven: PLG companies use data to understand user behavior, identify opportunities for improvement, and make informed decisions. They often use metrics like daily active users (DAU), monthly active users (MAU), and net promoter score (NPS) to measure success.
Customer Success Focus: In a PLG model, customer success is crucial. Companies need to ensure users are getting maximum value from the product, which often involves providing educational resources, responsive support, and regular product updates.

Examples of successful PLG companies include Slack, Dropbox, and Zoom. These companies have created products that users love, leading to rapid, organic growth.

X::Criticism of the Lean Startup

X::Growth Hacking Methodology

X::Design Thinking

References

RAG-Fusion - Enhancing Information Retrieval in Large Language Models

2023-11-06T00:00:00+01:00

In the realm of Large Language Models (LLMs) such as ChatGPT, a new technique known as Retrieval Augmented Generation (RAG) is gaining prominence. This technique is designed to enhance a user's input by incorporating additional information from an external source. This supplementary data is then leveraged by the LLM to enrich the response it generates. In this blog post, we will delve deeper into the core concept of RAG-fusion, which revolves around multiple query generation and re-ranking of results. For other methods that can improve RAG performance see my other Techniques to Boost RAG Performance in Production.

What is RAG-fusion?

The principle behind RAG-fusion is to generate multiple versions of the user's original query using a LLM, and then re-rank the results to select the most relevant retrieved parts.

NOTE: The term RAG in the name of the technique might be a bit misleading since "RAG-fusion" refers only to the first part of RAG - retrieval process.

How it works? For instance, the prompt template for this task might look something like this: "Generate multiple search queries related to: {original_query}", where {original_query} is a placeholder for the user's original query. This step enables the model to explore different perspectives and interpretations of the original query, thereby broadening the range of potential responses.

Re-ranking: A Crucial Step

The next vital step in the RAG-fusion process is re-ranking. This step is critical in determining the most pertinent answers to the user's query. The re-ranking process, often referred to as Reciprocal Rank Fusion (RRF), involves collecting ranked search outcomes from multiple strategies.

Each document is assigned a reciprocal rank score. These scores are then merged to create a new ranking. The underlying principle here is that documents that consistently appear in top positions across diverse search strategies are likely more pertinent and should, therefore, receive a higher rank in the consolidated result.

Figure 1. RAG fusion proces flow for ranking four documents A, B, C, D against three retrieval sources (can be three variants of the original user query). Source of image: Forget RAG, the Future is RAG-Fusion article by Adrian H. Raudaschl

Why RAG-fusion Matters?

It provides a boost to the LLM's ability to generate more accurate, contextually relevant responses. By considering multiple interpretations of the original query and re-ranking the results, it ensures that the model's output is as closely aligned with the user's intent as possible.

RAG-fusion might be a powerful technique that brings together the strengths of large language models and advanced information retrieval strategies. By employing multiple query generation and re-ranking, it takes a leap towards making AI-powered systems more responsive, accurate, and user-friendly.

NOTE 1: For more methods that can improve RAG performance see my other Techniques to Boost RAG Performance in Production. NOTE 2: This technique is also referred as Query-Rewriting. You can find a section on that on LlamaIndex documentation (Query Transformation Cookbook)

- X::Understanding Retrieval-Augmented Generation (RAG) empowering LLMs

Edits: - 2024-03-06 - added RAG fusion paper - 2024-02-01 - add reference to LLamaIndex Query Transform Cookbook

References

RAG-Fusion: a New Take on Retrieval-Augmented Generation
GitHub - Raudaschl/rag-fusion - exemplary implementation
Forget RAG, the Future is RAG-Fusion | by Adrian H. Raudaschl | Oct, 2023 | Towards Data Science
RAG-fusion in LangChain: usage, template code
Query Transformation Cookbook

What Is the Key Difference Between PCA and SVD?

2023-11-06T00:00:00+01:00

Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are two matrix factorization methods used in machine learning and data analysis for dimensionality reduction. Though they are used for similar purposes, there are some key differences between the two. The key difference between Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) lies in their respective applications and the matrices they operate on.

Dealing with the data

PCA primarily deals with the covariance structure of the data. It's a statistical procedure that transforms the coordinates of a dataset into a new coordinate system. In the new system, the first axis corresponds to the first principal component that accounts for the maximum variance in the data. The second axis, perpendicular to the first, aligns with the direction of the second largest variance, and so on. PCA effectively tries to find orthogonal axes (the principal components) along which the variance of the data is maximized.
SVD, on the other hand, does not rely on a covariance matrix. It is a factorization of the original data matrix, and it decomposes the original data into three matrices. This can be done without computing covariance, and even allows to work with missing data.

Computations

Both PCA and SVD involve eigen-decomposition. For PCA, the eigen-decomposition is on the covariance matrix of the data which is a square symmetric matrix of size d x d (where d is the number of features). This could be an issue if d is large, since calculating the covariance matrix and performing subsequent eigen-decomposition could be computationally expensive.
In contrast, SVD performs the decomposition on the data matrix itself (of size n x d where n is the number of observations and d is the number of features), theoretically making the computation more efficient, especially when d is much larger than n.

In summary, while the two techniques are related (PCA can actually be solved using SVD), they approach the problem of dimensionality reduction differently. PCA focuses on the covariance structure and tries to maximize variance along orthogonal axes, while SVD focuses on matrix factorization and can handle cases where data is missing. However, from an application perspective, they are generally used interchangeably.

PCA is a specific application of SVD, primarily used for dimensionality reduction, while SVD is a more general matrix decomposition technique with broader applications in linear algebra and data analysis.

Choosing technology for the LLM knowledge graph

2023-11-05T00:00:00+01:00

There are several technologies that can be used to implement a knowledge graph, depending on the specific requirements of your project. Here are three commonly used technologies for implementing knowledge graphs:

Resource Description Framework (RDF) (RDF) is a widely adopted standard for representing data in the form of triples (subject-predicate-object). It provides a flexible and extensible way to model graph data. RDF-based technologies like RDF stores or triplestores (e.g., Apache Jena, Virtuoso, Stardog) are commonly used to store and query knowledge graphs.
Graph Databases are purpose-built to store, manage, and query graph data efficiently. These databases are optimized for traversing relationships between entities and provide fast graph-based queries. Examples of popular graph databases include Neo4j, Amazon Neptune, and JanusGraph.
Triplestores are specialized databases designed specifically for RDF data. They store and query data using the RDF data model. Triplestores like Apache Jena, Virtuoso, and AllegroGraph provide features for storing and querying large-scale RDF knowledge graphs effectively.

Implementing a knowledge graph using these technologies typically involves defining a schema or ontology that describes the entities, their properties, and the semantic relationships between them. The triples or statements representing the data are then stored and indexed by the chosen technology for efficient retrieval and querying.

Prompt Discovery

2023-11-04T00:00:00+01:00

Prompt discovery, in the context of large language models and prompt engineering, refers to the systematic process of identifying, optimizing, and fine-tuning prompts that elicit desired responses from the language model. It involves a blend of linguistic, computational, and experimental techniques to formulate prompts that yield accurate and contextually relevant outputs from the model.

The goal of prompt discovery is to uncover the most effective prompts and combinations thereof to achieve specific tasks, while also considering factors like response quality, model performance, and computational efficiency.

In highly technical terms, prompt discovery encompasses several complex problems and activities:

Prompt Formulation: This involves crafting prompts that are clear, unambiguous, and tailored to the desired task. Different phrasings and structures might lead to variations in model behavior, so prompt engineers need to experiment with syntax and semantics to achieve optimal results.
Prompt Permutations: Researchers need to explore various permutations of prompts by altering wording, adding context, or using different query types. Systematically generating and testing different prompt variations is a crucial part of prompt discovery to identify which specific formulations generate the desired outputs.
Fine-tuning Parameters: Discovering the ideal fine-tuning parameters for each prompt and model combination is a complex optimization problem. Researchers must experiment with factors like learning rates, batch sizes, and optimization algorithms to fine-tune the model for specific prompts.
Benchmarking and Comparison: Comparing response quality across different prompt permutations, models, and settings is essential. This involves devising appropriate evaluation metrics to quantitatively assess the performance of the model in response to different prompts and making informed decisions based on these metrics.
Generalization and Transfer Learning: Investigating the extent to which prompts can be generalized across tasks or domains is a challenging problem. Researchers need to explore how prompts can be adapted or transferred to different tasks without sacrificing performance.
Exploration of Novel Prompts: As the field evolves, prompt engineers must continuously come up with innovative prompt formulations that push the boundaries of the model's capabilities. This might involve experimenting with new query structures, linguistic constructs, or contextual cues.

Figure 1: Flowchart illustrating the steps in prompt discovery. Starting with prompt formulation, it progresses through prompt permutations, fine-tuning parameters, benchmarking and comparison, generalization and transfer learning, to the exploration of novel prompts.

For prompt discovery, a range of tools, both existing and potentially developed in the future, can be instrumental:

Automated Prompt Generation: AI-assisted tools that automatically generate prompt variations based on input specifications could expedite the discovery process.
Prompt Optimization Algorithms: Advanced optimization algorithms tailored for prompt discovery, including genetic algorithms or reinforcement learning approaches, could efficiently explore the prompt space.
Interactive Prompt Testing Environments: User-friendly interfaces that allow prompt engineers to interactively test and fine-tune prompts with real-time model feedback can facilitate rapid iteration.
Prompt Benchmarking Platforms: Comprehensive platforms for benchmarking prompt performance across various tasks, models, and settings could aid in making informed prompt selection decisions.
Semantic Analysis Tools: Tools that provide detailed semantic analysis of prompt-response pairs can help identify patterns and nuances in model behavior, guiding prompt formulation.
Natural Language Understanding Frameworks: Advanced NLU frameworks that provide insights into model comprehension and reasoning processes can inform prompt design for better results.
Transfer Learning Techniques: Techniques that enable efficient transfer of knowledge from one prompt to another could support prompt generalization across tasks.
Continuous Model Monitoring: Real-time monitoring tools that track model performance in response to different prompts can aid in prompt discovery over time.

*Figure 2: Mindmap illustrating the key aspects of prompt discovery. It includes formulation, permutations, fine-tuning, benchmarking, generalization, novel prompts, and the different tools involved in the process.

In summary, prompt discovery is a process that involves intricate prompt formulation, thorough benchmarking, fine-tuning, and adaptation. The tools mentioned above, along with future advancements, will play a vital role in shaping the efficiency and effectiveness of prompt discovery efforts.

Techniques to Boost RAG Performance in Production

2023-11-01T00:00:00+01:00

Retrieval-Augmented Generation (RAG) is a powerful tool in the domain of machine learning, offering significant potential for improving the quality of text generation in various applications. However, optimizing its performance can be a challenging task. For the introductory text on RAG see my other article. This article discusses several advanced techniques that can be applied at different stages of the RAG pipeline to enhance its performance in a production setting.

Leveraging Hybrid Search
Utilizing Summaries for Data Chunks
Applying Query Transformations
Query Compression
Optimal Chunking Strategy
Fine-tuning Embedding Models
Enriching Metadata
Employing Re-ranking
Addressing the 'Lost in the Middle' Problem
Meta-data Filtering
Query Routing
References

Leveraging Hybrid Search

Hybrid search, a fusion of semantic search and keyword search, can be employed to retrieve pertinent data from a vector store. This method often yields superior results across a range of use cases. It essentially combines the strength of keyword search (precision) and semantic search (recall), providing a more comprehensive search solution. dups/hybrid_search

Utilizing Summaries for Data Chunks

An efficient way to enhance the quality of generation and reduce the number of tokens in the input is by summarizing the chunks of data and storing these summaries in the vector store. This technique is especially useful when dealing with data that includes numerous filler words. By summarizing the chunks, we can eliminate these superfluous elements, thereby refining the quality of the input data.

Applying Query Transformations

Query transformations can significantly enhance the quality of responses. For instance, if a system does not find relevant context for a query, the LLM can rephrase the query and try again. See the RAG-Fusion - Enhancing Information Retrieval in Large Language Models.

Similarly, the HyDE strategy generates a hypothetical response to a query and uses both for embedding lookup, which has been found to dramatically enhance performance.

Another technique involves breaking down complex queries into sub-queries, a process that LLMs tend to handle better. This approach can be integrated into the RAG system to decompose a query into multiple simpler questions.

Query Compression

Query compression, (see a tool like LongLLMLingua) is a technique for improving RAG's performance in long context scenarios where large language models often face challenges such as increased computational and financial costs, longer latency, and inferior performance. By enhancing the density and optimizing the position of key information in the input prompt, LongLLMLingua improves LLMs' perception of key information, which in turn, reduces computational load, decreases latency, and improves performance. This strategy ensures that vital information is not lost or diluted in lengthy contexts, thereby enhancing the relevance and quality of the generated output.

Optimal Chunking Strategy

There are multiple strategies that can be applied to chunking see Chunking strategies. One of the aspects can be controlling the chunk overlap. Semantic retrieval may pose a challenge when a selected chunk has meaningful context in adjacent chunks that could be missed. To mitigate this, an overlap of chunks can be implemented, whereby neighboring chunks are also passed to the Language Model (LLM) for generation. This guarantees that the surrounding context is incorporated, thus enhancing the output's quality.

Fine-tuning Embedding Models

While off-the-shelf embedding models such as BERT and Ada may suffice for many use cases, they might not adequately represent specific domains in the vector space, leading to suboptimal retrieval quality. In such instances, it would be advantageous to fine-tune an embedding model using domain-specific data to significantly improve retrieval quality.

Enriching Metadata

The provision of metadata like source information about the chunks being processed can enhance the LLM's comprehension of the context, leading to a better output generation. This additional layer of information can provide the LLM with a more holistic understanding of the data, enabling it to generate more accurate and relevant responses.

Employing Re-ranking

Semantic search may yield top-k results that are too similar to each other. To ensure a wider array of snippets, it is beneficial to re-rank the results based on other factors such as metadata and keyword matches. This diversification of snippets can lead to a more nuanced and comprehensive context for the LLM to generate responses. Re-ranker can be based on a cross-encoder.

Addressing the 'Lost in the Middle' Problem

LLMs tend not to assign equal weight to all tokens in the input, often overlooking tokens located in the middle. This phenomenon, known as the 'lost in the middle' problem, can be addressed by reordering the context snippets to place the most vital snippets at the beginning and end of the input, with less important snippets situated in the middle.

Meta-data Filtering

Meta-data, such as date tags, can be added to your chunks to improve retrieval. For example, filtering by recency can be beneficial when querying email history. Recent emails may not necessarily be the most similar from an embedding standpoint, but they are more likely to be relevant.

Query Routing

Having multiple indexes and routing queries to the appropriate index can be beneficial. For instance, different indexes could handle summarization questions, pointed questions, and date-sensitive questions. Trying to optimize one index for all these behaviors may compromise its effectiveness.

The performance of RAG in production can be significantly improved by applying a range of techniques, including hybrid search, chunk summarization, overlapping chunks, fine-tuned embedding models, metadata enhancement, re-ranking, addressing the 'lost in the middle' problem, query transformations, meta-data filtering, and query routing. These strategies will help to optimize the RAG pipeline, ensuring higher quality output and improved overall performance.

References

Retrieval Augmented Generation (RAG): What, Why and How? | LLMStack
[2307.03172] Lost in the Middle: How Language Models Use Long Contexts
10 Ways to Improve the Performance of Retrieval Augmented Generation Systems | by Matt Ambrogi | Sep, 2023 | Towards Data Science
Hypothetical Document Embeddings (HyDE) - Precise Zero-Shot Dense Retrieval without Relevance Labels
Retrieve & Re-Rank — Sentence-Transformers documentation
Improving RAG effectiveness with Retrieval-Augmented Dual Instruction Tuning (RA-DIT) | by Emanuel Ferreira | Oct, 2023 | LlamaIndex Blog
Improving RAG (Retrieval Augmented Generation) Answer Quality with Re-ranker | by Shivam Solanki | Towards Generative AI | Medium
SingleStore (db), finetuning embeddings model, CacheGPT, Nemo-Guardrails, Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs! | by Madhukar Kumar | madhukarkumar | Sep, 2023 | Medium
run-llama/finetune-embedding: Fine-Tuning Embedding for RAG with Synthetic Data
zilliztech/GPTCache: Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
NVIDIA/NeMo-Guardrails: NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
library to evaluate the context retrieved from your enterprise corpus of data (how do you know if the context being retrieved is accurate) GitHub - explodinggradients/ragas: Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
LangSmith, introduced by LangChain - a highly effective tool for monitoring and examining the responses between the app and the LLM.
[2310.15123] Branch-Solve-Merge Improves Large Language Model Evaluation and Generation

Python Expertise Level - Self-Assessment

2023-10-17T00:00:00+02:00

Sometimes you need to assess your own or candidate's level of expertise in Python programming. I have created some statements that roughly corresponds to the various level of expertise. Note that knowing programming language techniques contributes to expertise but does not make a great programmer automatically. Knowledge of algorithms and data structures, programming patterns, and software architectures are some other important factors - to mention a few.

Having that said, I still find useful this simple classification of Python programmers into three categories: beginners, advanced, and experts.

Beginners

Familiar with basic Python syntax and data types (strings, integers, lists, dictionaries).
Can write simple functions and use control flow statements (if, for, while).
Understands the concept of variables and variable scope.
Can use basic Python libraries like math and random.
Knows how to handle errors and exceptions using try/except blocks.
Can read from and write to files.
Understands the basics of object-oriented programming: classes, objects, methods.
Can use basic string and list methods for manipulation.
Knows how to use basic Python data structures like lists, tuples, and dictionaries.
Can write simple Python scripts to automate tasks.

Advanced

Understands and uses generators and decorators.
Can write complex functions and classes with multiple methods and attributes.
Understands and uses list comprehensions and lambda functions.
Can use regular expressions for pattern matching in strings (note: more regex skill that python)
Understands and uses context managers for resource management.
Can use advanced Python data structures like sets and frozensets.
Understands and uses Python's memory management and optimization techniques.
Can use Python's built-in functions like map(), filter(), reduce().
Understands and uses Python's module and package system.

Experts

Understands and uses metaclasses and descriptors.
Can write and understand asynchronous code using asyncio.
Understands and uses Python's concurrency and parallelism features.
Can use Python's C API to extend Python with C/C++ code.
Understands and uses Python's dynamic typing system to its full extent.
Can write and understand complex decorators and context managers.
Proficient in Python's debugging and profiling, using tools like pdb for debugging and cProfile for profiling to optimize their code.
Have a deep understanding of Python's Global Interpreter Lock (GIL) and how it affects multithreaded programs.

There is HN discussion on this note.

Edits:

2023-10-30: remove from Experts: 7. Understands and uses Python's garbage collection system.
2023-10-30: remove from Experts: Have a good understanding of Python's internals, such as bytecode, the Python interpreter's execution model, and how Python's data types are implemented at the C level.
2023-10-30: remove from Advanced: Can use advanced Python libraries like numpy, pandas, matplotlib not a python std lib.
added note:

Understanding the Differences in Language Models - Transformers vs. Markov Models

2023-10-07T00:00:00+02:00

In the field of machine learning and natural language processing (NLP), different models have been developed to understand and generate human language. Two such models that have gained significant attention are the Markov Models and the Transformer-based models like GPT (Generative Pretrained Transformer). While both types of models can predict the next character in a sequence, they differ significantly in their underlying mechanisms and capabilities. This article aims to delve into the intricacies of these models, with a particular focus on how the self-attention mechanism in Transformer models makes a difference compared to the fixed context length in Markov models.

Markov Models: A Brief Overview

Markov Models, named after the Russian mathematician Andrey Markov, are a class of models that predict future states based solely on the current state, disregarding all past states. This property is known as the Markov Property, and it is the fundamental assumption that underlies all Markov models.

In the context of language modeling, a Markov Model might predict the next word or character in a sentence based on the current word or character. For instance, given the word "The", a Markov Model might predict that the next word is "cat" based on the probability distribution of words that follow "The" in its training data.

The main limitation of Markov Models is their lack of memory. Since they only consider the current state, they are unable to capture long-term dependencies in a sequence. For example, in the sentence "I grew up in France... I speak fluent ___", a Markov Model might struggle to fill in the blank correctly because the relevant context ("France") is several words back.

Figure 1. Markov Model might predict the next word based on the probability distribution of words in its training data. Image Source: markov-chain-text | Modern C++ Markov chain text generator by Jarosław Wiosna

Transformer Models: An Introduction

Transformer models, on the other hand, were introduced in the seminal paper "Attention is All You Need" by Vaswani et al. (2017). They represent a significant departure from previous sequence-to-sequence models, eschewing recurrent and convolutional layers in favor of self-attention mechanisms.

GPT, developed by OpenAI, is a prominent example of a Transformer model. It is a generative model that can generate human-like text by predicting the next word in a sequence. Unlike Markov Models, GPT considers the entire context of a sequence when making predictions, allowing it to capture long-term dependencies.

The Power of Self-Attention

The key innovation of Transformer models is the self-attention mechanism. This mechanism allows the model to weigh the importance of different words in the context when predicting the next word. For instance, in the sentence "The cat, which was black and white, jumped over the ___", the model might assign more importance to "cat" and "jumped" when predicting the next word.

Self-attention is calculated using the dot product of the query and key vectors, which are learned representations of the input. The resulting attention scores are then used to weight the value vectors, which are also learned representations of the input. This weighted sum forms the output of the self-attention layer.

The self-attention mechanism allows Transformer models to consider the entire context of a sequence, rather than just the current state. This is a significant advantage over Markov Models, which are limited by their fixed context length.

Figure 2. The self-attention mechanism allows Transformer models to consider the entire context of a sequence, rather than just the current state. Image Source: A Deep Dive Into the Transformer Architecture – The Development of Transformer Models by Kevin Hooke

Fixed Context Length vs. Variable Context Length

Markov Models, due to their inherent design, have a fixed context length. They only consider the current state when making predictions, which limits their ability to capture long-term dependencies. This can lead to less accurate predictions, especially in complex sequences where the relevant context might be several states back.

Transformer models, on the other hand, have a variable context length. Thanks to the self-attention mechanism, they can consider the entire context of a sequence when making predictions. This allows them to capture long-term dependencies and make more accurate predictions.

Moreover, the self-attention mechanism allows Transformer models to dynamically adjust the context length based on the input. For instance, in a sentence with many irrelevant words, the model might focus on a few key words, effectively reducing the context length. This dynamic context length is another advantage of Transformer models over Markov Models.

Conclusion

While both Markov Models and Transformer models like GPT can predict the next character in a sequence, they differ significantly in their underlying mechanisms and capabilities. Markov Models, with their fixed context length, are limited in their ability to capture long-term dependencies. Transformer models, with their self-attention mechanism, can consider the entire context of a sequence, allowing them to capture long-term dependencies and make more accurate predictions.

Any comments or suggestions? Let me know.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language Models are Unsupervised Multitask Learners. OpenAI Blog.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Ruder, S. (2019). The Illustrated Transformer. Jay Alammar's Blog.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems.
Chollet, F. (2018). Deep Learning with Python. Manning Publications Co.
Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing. Stanford University.
Al-Rfou, R., Choe, D., Constant, N., Guo, M., & Jones, L. (2019). Character-Level Language Modeling with Deeper Self-Attention. In Proceedings of the AAAI Conference on Artificial Intelligence.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT press.
Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall.
Jelinek, F. (1997). Statistical Methods for Speech Recognition. MIT Press.
Russell, S., & Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Pearson.
Charniak, E. (1993). Statistical Language Learning. MIT Press.
Lin, T. (2015). Markov Chains and Text Generation. Towards Data Science Blog.
Goodman, J. (2001). A bit of progress in language modeling. Microsoft Research.
Rosenfeld, R. (2000). Two Decades of Statistical Language Modeling: Where Do We Go From Here?. Proceedings of the IEEE.
Nazarko, K. (2021). Word-level text generation using GPT-2, LSTM and Markov Chain. Towards Data Science Blog.
Adyatama, A. (2020). Text Generation with Markov Chains Algoritma Technical Blog.

How Agile Can Kill Creativity in Data Science team?

2023-09-29T00:00:00+02:00

Agile methodologies can provide numerous benefits to data science and analytics teams, such as quicker delivery, enhanced collaboration, and increased customer satisfaction. However, if not implemented effectively, Agile may unintentionally impede creativity in these teams. Here are a few ways Agile can potentially hinder creativity in data science/analytics.

Potential problems
Mitigation
- Complementary practices
- Frameworks tailored for data science projects
References

Potential problems

Tight deadlines and sprints

Agile typically operates on tight timelines with fixed sprints. This can limit the time available for exploration, experimentation, and creative thinking. The emphasis on adhering to strict schedules may discourage innovative approaches that require more time to develop.

Focus on deliverables

Agile methodologies often prioritize delivering functioning solutions over long-term exploration. This focus on short-term goals can discourage team members from taking the time to explore complex problems creatively, resulting in a more practical, rather than innovative, approach.

Lack of autonomy

In some Agile implementations, teams may be closely supervised or required to adhere to preset workflows. This kind of micromanagement limits individual creativity, as team members may not have the freedom to experiment, propose alternative solutions, or take calculated risks.

Constant and sudden changes

Agile projects often involve iterative development with frequent changes in priorities and requirements. While this adaptability is beneficial in many cases, it can disrupt the creative process and impede the ability to think deeply about problems. Constantly switching gears may hinder the exploration of unconventional solutions.

Overemphasis on standardized processes

Agile frameworks provide standardized processes and practices that ensure consistency and predictability. While these are essential for efficient project management, a strict adherence to these processes can stifle creativity as it may discourage deviation from the prescribed methods.

Mitigation

Complementary practices

To prevent the potential negative impact on creativity, Agile methodologies should be complemented with the following practices:

Allow dedicated time for exploration and learning outside of fixed sprints.
Encourage cross-functional collaboration and knowledge sharing to foster creativity.
Provide opportunities for innovation-driven initiatives alongside project-driven ones.
Support a psychologically safe environment that allows for experimentation and failure.
Recognize and reward creative thinking and experimentation within the team.

Frameworks tailored for data science projects

The data science teams need to adapt Agile practices to suit their specific needs and contexts, and to balance the trade-offs between speed, flexibility, and quality. Data science teams can adopt or modify a framework that is tailored for data science projects, such as the Team Data Science Process (TDSP) or the Agile Data Science Process These frameworks provide guidance on how to structure, execute, and manage data science projects using Agile principles and practices.

By adjusting Agile practices to accommodate these considerations, data science/analytics teams can create a balance between efficient project management and fostering creativity and innovation.

Any comments or suggestions? Let me know.

References

The Right Way to Job-Hop

2023-09-29T00:00:00+02:00

NOTE: The text below are the advices extracted from the podcast transcript using LLM.

Based on the podcast "The right way to Job-hop" transcript, here are some key pieces of advice on how to do "job hopping" the right way:

Key Pieces of Advice
Follow New Tech Trends
Don't Stay Too Long in One Place
Use Job Hopping to Gain a Variety of Experience
Leave a Job for a Good Reason
Stay at a Job for at Least a Year
Be Prepared to Explain Your Job Hopping
Consider Remote Opportunities
Focus on Increasing Your Personal Wealth
Ensure You're Moving Up With Each Job Change
Ask the Right Questions During Interviews
Additional pieces of advice
Understand the Impact of Job Hopping on Your Resume
Avoid Short Stints
Use Job Hopping as a Negotiation Tool
Consider the Company Culture
Be Transparent and Honest
Keep Learning and Updating Your Skills
Maintain Professional Relationships
Consider the Impact on Your Long-Term Career Goals
Take Advantage of Remote Work Opportunities
Always Leave on Good Terms, Don't Burn Bridges
Evaluate the Company's Stability
Consider the Impact on Your Work-Life Balance
Take Time to Reflect on Each Job Change
Be Prepared for Potential Negative Perceptions
Don't Job Hop Just for the Sake of It
Consider the Benefits and Drawbacks
Keep Your Skills Up to Date
Network Effectively

Key Pieces of Advice

Follow New Tech Trends

The tech industry is characterized by rapid and constant evolution. As such, it's crucial to stay abreast of emerging technologies and trends. By doing so, you can identify opportunities to gain experience in these new areas, which can enhance your skill set and make you more marketable. Job hopping can be a strategic way to follow these trends, allowing you to move between companies that are at the forefront of these changes, thereby ensuring your skills remain relevant and in-demand.

Don't Stay Too Long in One Place

Unlike many other industries where longevity in a role is often rewarded, the tech industry values adaptability and diverse experience. Given the high demand for tech skills, employers are often willing to offer competitive compensation packages to attract talent, even if the candidate has a history of changing jobs frequently. Therefore, don't hesitate to change jobs every few years if it means advancing your career, gaining new skills, or improving your compensation.

Use Job Hopping to Gain a Variety of Experience

Job hopping can provide a wealth of diverse experiences. By moving between different companies, roles, and projects, you can acquire a broad range of skills and insights. This variety can not only enhance your professional development and accelerate your career progression but also make you a more attractive candidate to potential employers who value such diverse experience.

Leave a Job for a Good Reason

While job hopping is more accepted in the tech industry, it's still important to have a valid reason for leaving each job. This could be to pursue a new opportunity, acquire new skills, seek a higher salary, or aim for a promotion. Leaving a job without a good reason could raise concerns for potential employers, who may question your commitment or reliability. Therefore, always ensure you can articulate your reasons for job changes in a positive and professional manner.

Stay at a Job for at Least a Year

While frequent job changes can be beneficial, it's advisable to stay at each job for at least a year. This duration allows you sufficient time to fully understand your role, contribute meaningfully to the company, and leave a positive impression. It also demonstrates to future employers that you can commit to a role and see projects through to completion.

Be Prepared to Explain Your Job Hopping

If your resume shows frequent job changes, be prepared to explain this during interviews. Honesty is key here. Focus on the positive aspects of job hopping, such as the diverse skills and experiences you've gained, the opportunities you've had to work on different projects or with different technologies, and how these experiences have contributed to your professional growth.

Consider Remote Opportunities

The rise of remote work has significantly expanded job opportunities. You can now work for companies based in different cities, states, or even countries without having to relocate. This can make job hopping more convenient and less disruptive to your personal life, while also opening up a wider range of potential job opportunities.

Focus on Increasing Your Personal Wealth

While loyalty to an employer is important, it's also crucial to focus on your personal financial growth. If changing jobs can help you achieve higher compensation, whether through a higher salary, better benefits, or equity options, then it's a move worth considering. Remember, your primary professional obligation is to your own career development and financial stability.

Ensure You're Moving Up With Each Job Change

Each job change should represent a step forward in your career. Whether it's a higher role, more responsibilities, or the opportunity to work with new technologies, each move should contribute to your career progression. This upward trajectory can demonstrate to potential employers your ambition, your ability to take on new challenges, and your commitment to professional growth.

Ask the Right Questions During Interviews

When interviewing for a new job, it's important to ask questions that can help you understand the company's culture and whether it aligns with your values and career goals. This can help you avoid accepting a job that isn't a good fit for you. Ask about the company's values, their approach to work-life balance, opportunities for professional development, and their expectations for the role you're applying for. This can give you a clearer picture of what it would be like to work for the company and help you make an informed decision.

Additional pieces of advice

Understand the Impact of Job Hopping on Your Resume

It's important to recognize that the perception of frequent job changes can vary across industries. In the tech sector, it's generally accepted and can even be seen as a sign of adaptability and a desire to acquire diverse skills. However, in other industries, it might raise questions about your stability or commitment. Therefore, when crafting your resume and cover letter, tailor them to address any potential concerns. Highlight the skills and experiences you've gained through job hopping and how they've contributed to your professional growth.

Avoid Short Stints

While job hopping can offer numerous benefits, extremely short stints (like three to nine months) at multiple companies can raise red flags for potential employers. It might suggest that you struggle to commit to a role or adapt to a new environment. Aim to stay at each job for at least a year, which shows that you can contribute meaningfully to a company and see projects through to completion.

Use Job Hopping as a Negotiation Tool

Job hopping can serve as a powerful negotiation tool. If you receive a job offer with a higher salary or better benefits from another company, you can use this as leverage to negotiate better terms with your current employer. This strategy can help you maximize your earning potential and benefits without necessarily having to change jobs.

Consider the Company Culture

Before deciding to hop to a new job, take the time to understand the company's culture. If the company values loyalty and long-term commitment, frequent job hopping might be viewed negatively. Conversely, if the company values diverse experiences and skills, job hopping might be seen as a positive attribute. Understanding a company's culture can help you make informed decisions about job hopping.

Be Transparent and Honest

During interviews, be transparent and honest about your reasons for job hopping. If you're leaving a job due to dissatisfaction, explain your reasons professionally and constructively. This can demonstrate to potential employers that you're thoughtful about your career decisions and are not simply leaving jobs on a whim.

Keep Learning and Updating Your Skills

The tech industry is characterized by rapid and continuous evolution. Therefore, it's crucial to keep learning and updating your skills to stay relevant. This commitment to continuous learning can make you more attractive to potential employers and open up more opportunities for job hopping.

Maintain Professional Relationships

Even if you change jobs frequently, it's important to maintain positive relationships with your former employers and colleagues. They can provide valuable references in the future and might even offer you new opportunities. Networking is a key aspect of career development, and maintaining these professional relationships can be beneficial in the long run.

Consider the Impact on Your Long-Term Career Goals

While job hopping can provide immediate benefits such as higher pay or a more desirable role, it's important to consider how it aligns with your long-term career goals. If a new job offers valuable experience or skills that align with your long-term objectives, it might be worth making the move. Always consider the long-term implications of job hopping on your career trajectory.

Take Advantage of Remote Work Opportunities

The rise of remote work has significantly expanded the job market. This means you can job hop without the geographical constraints that traditionally limited job opportunities. This can allow you to access opportunities in different cities, states, or even countries, broadening your career prospects.

Always Leave on Good Terms, Don't Burn Bridges

Regardless of your reasons for leaving a job, always strive to leave on good terms. This includes giving proper notice, completing all outstanding tasks, and offering to assist with the transition. This will help maintain your professional reputation, which is crucial when job hopping. Leaving on good terms also ensures that you leave a positive lasting impression, which can be beneficial for future job opportunities and references. When leaving a job, it's important to maintain positive relationships with your former colleagues and managers. These relationships can be valuable for networking, references, and potential future collaborations. Always leave on good terms, express gratitude for the experience, and keep the lines of communication open.

Evaluate the Company's Stability

Before making a decision to switch jobs, it's crucial to assess the stability of the prospective company. If the company exhibits signs of instability or has a high employee turnover rate, it might not be the best choice for your next move, even if the job offers a higher salary or better benefits. A stable work environment can provide a sense of security and allow for long-term growth and development.

Consider the Impact on Your Work-Life Balance

Job hopping can sometimes disrupt your work-life balance, particularly if you're constantly adapting to new roles, teams, and work environments. When considering a new job, think about how it will affect your personal life, including your family, hobbies, and personal commitments. Ensure that the new job aligns with your work-life balance goals and won't negatively impact your personal life.

Take Time to Reflect on Each Job Change

After each job change, take some time to reflect on your experiences. Consider what you learned, what you liked and disliked, and how these experiences can inform your future career decisions. This reflection can help you understand your career preferences, strengths, and areas for improvement, enabling you to make more informed decisions when job hopping.

Be Prepared for Potential Negative Perceptions

While job hopping is more accepted in the tech industry, it may still be viewed negatively by some people. Be prepared to address any potential negative perceptions during interviews. Explain why job hopping has been beneficial for your career, focusing on the diverse skills and experiences you've gained.

Don't Job Hop Just for the Sake of It

While job hopping can offer many benefits, it's important not to do it without a clear purpose. Ensure that each job change aligns with your career goals and offers valuable experience or skills. Aimless job hopping can lead to a disjointed career path and may raise red flags for potential employers.

Consider the Benefits and Drawbacks

Before deciding to job hop, weigh the benefits and drawbacks. While job hopping can offer higher salaries, diverse experiences, and faster career progression, it can also lead to instability, a lack of deep expertise in one area, and potential negative perceptions. Make sure that the benefits outweigh the drawbacks before making a move.

Keep Your Skills Up to Date

In the fast-paced tech industry, keeping your skills up to date is crucial. By continuously learning and adapting to new technologies and trends, you'll be more attractive to potential employers and better equipped to take on new roles. Consider professional development opportunities, online courses, and industry certifications to keep your skills fresh.

Network Effectively

Networking is key when job hopping. Maintain your professional relationships and make new connections in the industry. Attend industry events, join professional organizations, and leverage social media platforms like LinkedIn to expand your network. A strong professional network can open up new opportunities and make job hopping easier and more successful.

LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of characters

2023-09-27T00:00:00+02:00

LangChain RecursiveCharacterTextSplitter - Split by Tokens instead of Characters

The LangChain RecursiveCharacterTextSplitter is a tool that allows you to split text on predefined characters that are considered as a potential division points. By default, the size of the chunk is in characters but by using from_tiktoken_encoder() method you can easily split to achieve given size of the chunk in tokens instead of characters. This is especially useful since LLMs have context limits expressed in tokens not in characters. This split can be useful in various natural language processing tasks, such as language modeling or text classification.

To use the RecursiveCharacterTextSplitter, follow these steps:

Import the required module: from langchain.text_splitter import RecursiveCharacterTextSplitter
Set the desired chunk size (in tokens): CHUNK_SIZE_TOKENS = 1_500
Instantiate the RecursiveCharacterTextSplitter using the from_tiktoken_encoder method and provide the chunk size and overlap values:

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=CHUNK_SIZE_TOKENS,
    chunk_overlap=0,
)

Once the text_splitter object is created, you can use the create_documents method to split your text into documents. Make sure to pass the text to be split as a parameter in a list format:

docs = text_splitter.create_documents([text])

For alternative solutions and further discussion, you can refer to the following GitHub issue: LangChain Issue #4678.

From Fixed-Size to NLP Chunking - A Deep Dive into Text Chunking Techniques

2023-09-11T00:00:00+02:00

Understanding Chunking

Chunking is a process that aims to embed a piece of content with as little noise as possible while maintaining semantic relevance[^2]. This process is particularly useful in semantic search, where we index a corpus of documents, each containing valuable information on a specific topic.

An effective chunking strategy ensures that search results accurately capture the essence of a user's query. If our chunks are too small or too large, it may lead to imprecise search results or missed opportunities to surface relevant content. As a rule of thumb, if the chunk of text makes sense without the surrounding context to a human, it will likely make sense to the language model as well[^2]. Therefore, finding the optimal chunk size for the documents in the corpus is crucial to ensuring that the search results are accurate and relevant.

Factors Influencing Chunking Strategy
Chunking Methods
References
Further Reading

Factors Influencing Chunking Strategy

There are three main factors to consider when determining a chunking strategy for a specific use case and application:

The size of the texts to be indexed and chunked
The length and complexity of user queries
The utilization of the retrieved results in the application

Size of the Texts to be Indexed

The chunking unit and size should be adjusted according to the nature of the text. The chunk should be long enough to contain the relevant semantic load. For instance, individual words may not convey a specific message or piece of information, while putting an entire encyclopedia in one chunk may result in a chunk that is "about everything."

Length and Complexity of User Queries

Longer queries or those with greater complexity typically benefit from a smaller chunk length. This helps to narrow down the search space and improve the precision of the search results. Smaller chunks allow more focused matching against embeddings, reducing the impact of irrelevant parts within the query.
Shorter and simpler queries might not require chunking at all, as they can be processed as a single unit. Chunking may introduce unnecessary overhead in these cases, potentially hampering search performance.

Utilization of the Retrieved Results in the Application

In cases where search results are only an intermediate step in the whole chain in the app, the size of the chunk might have significant importance for the seamless operation of the application. For example, if results from multiple search queries are the input context for the prompt to the LLM, having small chunks might ease fitting all inputs in the maximum allowed context size for a given LLM. Conversely, if the search result is presented to the user, larger chunks may be more appropriate.

Chunking Methods

There are several methods for chunking text, each with its own advantages and disadvantages. The choice of method depends on the specific requirements of the use case and application.

Fixed-size (in characters) Overlapping Sliding Window

The Fixed-size overlapping sliding window method is a naive approach to text chunking, dividing the text into fixed-size pieces regarded as chunks. In this method, the text is divided based on the count of characters, making it straightforward to implement. The use of overlap in this method aids in preserving the integrity of sentences or thoughts, ensuring they are not cut in the middle. If one window truncates a thought, another window might contain the complete thought.

However, this method presents certain limitations. One significant drawback is the lack of precise control over the context size. Most language models operate on the basis of tokens rather than characters or words, making this method less efficient. The strict and fixed-size nature of the window might also result in severing words, sentences, or paragraphs in the middle, which could impede comprehension and disrupt the flow of information.

Furthermore, this method does not take semantics into account, providing no guarantee that the semantic unit of the text capturing a given idea or thought will be accurately encapsulated within a chunk. Consequently, one chunk may not be semantically distinct from another.

Use Cases

The Fixed-size overlapping sliding window method can be beneficial in certain scenarios. It is especially useful in preliminary exploratory data analysis, where the goal is to obtain a general understanding of the text structure rather than a deep semantic analysis. Additionally, it could be employed in scenarios where the text data does not have a strong semantic structure, such as in certain types of raw data or logs.

However, for tasks that require semantic understanding and precise context, such as sentiment analysis, question-answering systems, or text summarization, more sophisticated text chunking methods would be more appropriate.

Summary

Pros:

Counting characters makes implementation easy
Using overlap helps to avoid having sentences or thoughts cut in the middle - if one window is cutting the thought, perhaps another will have it in one piece.

Cons:

Not precise control of the context size - models work and size the text in tokens not in characters or words
Having a strict, fixed-size window might lead to cutting words, sentences, or paragraphs in the middle.
Doesn't take semantics into account, no guarantee that the semantic unit of text capturing the given idea, thought will be accurately captured in the chunk and another chunk will be dedicated to another idea

Use cases:

Preliminary exploratory data analysis where a general understanding of the text is required
Scenarios where the text does not have a strong semantic structure, such as certain types of raw data or logs
Not recommended for tasks requiring semantic understanding and precise contexts like sentiment analysis, question-answering systems, or text summarization

Fixed-size (in tokens) Overlapping Sliding Window

The Fixed-size sliding window method in tokens is another approach to text chunking. Unlike the character-based method, this approach divides the text into chunks based on the count of tokens that came out from the tokenizer, making it more aligned with the way language models operate.

In this method, the size of the context is more precisely controlled as it works on tokens rather than characters. A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This can make avoiding cutting words in the middle a little better than when counting characters, but the problem still persists. It can still sever sentences or thoughts in the middle, which could disrupt the flow of information. Moreover, similar to the character-based method, this approach does not take semantics into account. There's no guarantee that a chunk accurately captures a unique thought or idea, making the chunks potentially semantically inconsistent.

Where to Use It

The use cases are similar to the fixed size window based on characters count with one difference - when the count is based on tokens it works better for the tasks where we are limited by the LLM context size.

Summary

Pros:

More precise control over LLM context size as it operates on tokens, not characters.
Still, relatively easy to implement

Cons:

Can still sever sentences or thoughts in the middle
Does not take semantics into account, hence no guarantee that a chunk accurately captures a unique thought or idea

Use cases:

For exploratory, initial work with LLMs
Not recommended for tasks requiring a deep understanding of the semantics and context of the text, like sentiment analysis or text summarization

Recursive Structure Aware Splitting

Recursive Structure-aware Aware Splitting is a hybrid approach to text chunking, combining elements of the fixed-size sliding window method and the structure-aware splitting method. This method attempts to create chunks of approximately fixed sizes, either in characters or tokens, while also trying to preserve the original units of text such as words, sentences, or paragraphs.

In this method, the text is recursively split using various separators such as paragraph breaks ("\n\n"), new lines ("\n"), or spaces (" "), moving to the next level of granularity only when necessary. This allows the method to balance the need for a fixed chunk size with the desire to respect the natural linguistic boundaries of the text.

The major advantage of this method is its flexibility. It provides more precise control over context size compared to fixed-size methods, while also ensuring that semantic units of text are not unnecessarily severed.

However, this method also has its drawbacks. The complexity of implementation is higher due to the recursive nature of the splitting. There's also the risk of ending up with chunks of highly variable sizes, especially with texts of varying structural complexity.

NOTE: LangChain has an implementation of Recursively split

Where to Use It

Recursive Structure Aware Splitting is particularly useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial. This includes tasks such as text summarization, sentiment analysis, and document classification.

However, due to its complexity, it might not be the best fit for tasks that require quick and simple text chunking, or for tasks involving texts with inconsistent or unclear structural divisions.

Summary

Pros:

Balances the need for fixed chunk sizes with the preservation of natural linguistic boundaries
Provides more precise control over the context size

Cons:

Higher complexity of implementation due to the recursive nature of the splitting
Risk of ending up with chunks of highly variable sizes

Use cases:

Useful in tasks where both the granularity of tokens and the preservation of semantic integrity are crucial, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks requiring quick and simple text chunking, or tasks involving texts with inconsistent or unclear structural divisions

Structure Aware Splitting (by Sentence, Paragraph, Section, Chapter)

Structure Aware Splitting is an advanced approach to text chunking, which takes into account the inherent structure of the text. Instead of using a fixed-size window, this method divides the text into chunks based on its natural divisions such as sentences, paragraphs, sections, or chapters.

This method is particularly beneficial as it respects the natural linguistic boundaries of the text, ensuring that words, sentences, and thoughts are not cut in the middle. This aids in preserving the semantic integrity of the information within each chunk.

However, this method does have certain limitations. Handling text of varying structural complexity might be challenging. For instance, some texts might not have clearly defined sections or chapters, e.g. text extracted from the OCR output, unformatted speech-to-text outputs, text extracted from tables. Also, while it's more semantically aware than the fixed-size methods, it still doesn't guarantee perfect semantic consistency within chunks, especially for larger structural units like sections or chapters.

Where to Use It

Structure Aware Splitting is highly effective for tasks that require a good understanding of the context and semantics of the text. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.

However, it might not be the best fit for tasks involving texts that lack defined structural divisions, or for tasks that require a finer granularity, such as word-level Named Entity Recognition (NER).

Summary

Pros:

Respects natural linguistic boundaries, avoiding severing words, sentences, or thoughts
Preserves the semantic integrity of information within each chunk

Cons:

Challenging to handle text with varying structural complexity
Does not guarantee perfect semantic consistency within chunks, especially for larger structural units
We don't have control over chunk size. Chunks from given document might significantly vary in the size.

Use cases:

Effective for tasks requiring good understanding of context and semantics, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks involving texts that lack defined structural divisions, or tasks needing finer granularity, like word-level NER

NLP Chunking: Tracking Topic Changes

NLP Chunking with Topic Tracking is a sophisticated approach to text chunking. This method divides the text into chunks based on semantic understanding, specifically by detecting significant shifts in the topics of sentences. If the topic of a sentence significantly differs from the topic of the previous chunk, this sentence is considered the beginning of a new chunk.

This method has the distinct advantage of maintaining semantic consistency within each chunk. By tracking the changes in topics, this method ensures that each chunk is semantically distinct from the others, thereby capturing the inherent structure and meaning of the text.

However, this method is not without its challenges. It requires advanced NLP techniques to accurately detect topic shifts, which adds to the complexity of implementation. Additionally, the accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used.

Where to Use It

NLP Chunking with Topic Tracking is highly effective for tasks that require an understanding of the semantic context and topic continuity. It is particularly useful for text summarization, sentiment analysis, and document classification tasks.

This method might not be the best fit for tasks involving texts that have a high degree of topic overlap or for tasks that require simple text chunking without the need for deep semantic understanding.

Summary

Pros:

Maintains semantic consistency within each chunk
Captures the inherent structure and meaning of the text by tracking topic changes

Cons:

Requires advanced NLP techniques, increasing the complexity of implementation
The accuracy of chunking heavily depends on the effectiveness of the topic modeling and detection techniques used

Use cases:

Highly effective for tasks requiring semantic context and topic continuity, such as text summarization, sentiment analysis, and document classification
Not recommended for tasks involving texts with high degrees of topic overlap or tasks requiring simple text chunking without the need for deep semantic understanding

Content-Aware Splitting (Markdown, LaTeX, HTML)

Content-Aware Splitting is a method of text chunking that focuses on the type and structure of the content, particularly in structured documents like those written in Markdown, LaTeX, or HTML. This method identifies and respects the inherent structure and divisions of the content, such as headings, code blocks, and tables, to create distinct chunks.

The primary advantage of this method is that it ensures different types of content are not mixed within a single chunk. For instance, a chunk containing a code block will not also contain a part of a table. This helps maintain the integrity and context of the content within each chunk.

However, this method also presents certain challenges. It requires understanding and parsing the specific syntax of the structured document format, which can increase the complexity of implementation. Moreover, it might not be suitable for documents that lack clear structural divisions or those written in plain text without any specific format.

Where to Use It

Content Aware Splitting is especially useful when dealing with structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages. It helps ensure that the chunks created are meaningful and contextually consistent.

However, this method might not be the best fit for unstructured or plain text documents, or for tasks that do not require a deep understanding of the content structure.

Summary

Pros:

Ensures different types of content are not mixed within a single chunk
Respects and maintains the integrity and context of the content within each chunk

Cons:

Requires understanding and parsing the specific syntax of the structured document format
Might not be suitable for unstructured or plain text documents

Where to Use It:

Particularly useful for structured documents or content with clear formatting, such as technical documentation, academic papers, or web pages
Not recommended for unstructured or plain text documents, or tasks that do not require a deep understanding of the content structure

Adding Extra Context to the Chunk (metadata, summaries)

Adding extra context to the chunks in the form of metadata or summaries can significantly enhance the value of each chunk and improve the overall understanding of the text[^3]. Here are two strategies:

Adding Metadata to Each Chunk

This strategy involves adding relevant metadata to each chunk. Metadata could include information such as the source of the text, the author, the date of publication, or even data about the content of the chunk itself, like its topic or keywords. This extra context can provide valuable insights and make the chunks more meaningful and easier to analyze.

NOTE: In the case of the chunks that are vectorized using text embeddings, be aware, that vector databases typically allow storage of metadata alongside the embedding vectors.

Pros:

Provides additional information about each chunk
Enhances the value of each chunk, making them more meaningful and easier to analyze
Can help to produce more effective embeddings by fixing the broader context for the chunk.

Cons:

Requires additional processing to generate and attach the metadata
The usefulness of the metadata depends on its relevance and accuracy

Where to Use It:

Especially useful in tasks that involve analyzing the origin, authorship, or content of the chunks, such as text classification, document clustering, or information retrieval
Can be used to filter the sources used to provide context to LLMs.

You can get intuition what is possible by reading llama_index documentation on metadata extraction and usage: Metadata Extraction Usage Pattern - LlamaIndex 🦙 0.9.30

Passing on Chunk Summaries

In this strategy, each chunk is summarized, and that summary is passed on to the next chunk. This method provides a 'running context' that can enhance the understanding of the text and maintain the continuity of information.

Pros:

Enhances the understanding of the text by maintaining a running context
Helps to ensure the continuity of information across chunks

Cons:

Requires advanced NLP techniques to generate accurate and meaningful summaries
The effectiveness of this method depends on the quality of the summaries

Where to Use It:

Particularly useful in tasks where understanding the continuity and context of the text is crucial, such as text summarization or reading comprehension tasks

Other Experimental Strategies for Adding Context to the Chunks

Keyword Tagging: This method involves identifying and tagging the most important keywords or phrases in each chunk. These tags then serve as a quick reference or summary of the chunk's content. Advanced NLP techniques can be used to identify these keywords based on their relevance and frequency.
Sentiment Analysis: For text that contains opinions or reviews, performing sentiment analysis on each chunk and attaching the sentiment score (positive, negative, neutral) as metadata can provide valuable context. This can be particularly useful in tasks such as customer feedback analysis or social media monitoring.
Entity Recognition: Applying Named Entity Recognition (NER) techniques to each chunk can identify and label entities such as names of people, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. This entity information can be added to each chunk, providing valuable context, especially in tasks like information extraction or knowledge graph construction.
Topic Classification: Each chunk can be classified into one or more topics using machine learning or NLP techniques. This topic label can provide a quick understanding of what each chunk is about, adding valuable context, especially for tasks like document classification or recommendation.
Chunk Linking: This method involves creating links between related chunks based on shared keywords, entities, or topics. These links can provide a 'map' of the content, showing how different chunks relate to each other. This can be particularly useful in tasks involving large and complex texts, where understanding the overall structure and relations between different parts is important.

Conclusions

In the field of Natural Language Processing, text chunking emerges as a powerful technique that significantly enhances the performance of semantic search and language models. By breaking down text into manageable, contextually relevant chunks, we can ensure more accurate and meaningful search results.

The choice of chunking method, whether it's fixed-size, structure-aware, or NLP chunking, depends on the specific requirements of the use case and application. Each method has its own strengths and limitations, and understanding these is crucial to implementing an effective chunking strategy.

Moreover, adding extra context to the chunks, such as metadata or summaries, can further enhance the value of each chunk and improve the overall understanding of the text. Experimental strategies like keyword tagging, sentiment analysis, entity recognition, topic classification, and chunk linking offer promising avenues for further exploration.

Any comments or suggestions? Let me know.

References

[^1] Create a CustomGPT And Supercharge your Company with AI - Pick the Best LLM - The Abacus.AI Blog
[^2] Chunking Strategies for LLM Applications | Pinecone
[^3] Optimize LLM Enterprise Applications through Embeddings and Chunking Strategy. | by Actalyst | Aug, 2023 | Medium
[^4] Retrieval Augmented Generation (RAG) Done Right: Chunking - Vectara (NLP chunking, compare chunking strategies) + notebook

Criticism of the Lean Startup

2023-09-04T00:00:00+02:00

X::Product Led Growth X::Growth Hacking Methodology

The Lean Startup method is still considered a valuable and relevant approach to launching and managing startups. However, it's important to recognize that the business and entrepreneurial landscape is dynamic, and the applicability of any methodology can evolve over time.

The Lean Startup method, popularized by Eric Ries, emphasizes a systematic and iterative approach to building and scaling a startup by validating assumptions, minimizing waste, and staying agile. Many principles of the Lean Startup, such as customer-centricity, rapid experimentation, and continuous learning, remain highly relevant in today's business environment.

However, there are some criticisms and challenges associated with the Lean Startup method, including:

Oversimplification: Critics argue that the Lean Startup method can sometimes oversimplify the complexity of building a successful business. While it encourages rapid experimentation, it may not address all the intricacies and industry-specific nuances that startups may encounter.
Overemphasis on MVP (Minimum Viable Product): Some argue that an overemphasis on building MVPs can lead to premature scaling or neglecting long-term vision and product quality. In some industries, especially those requiring substantial upfront investment or regulatory compliance, an MVP might not be appropriate.
Bias Toward Tech Startups: The Lean Startup method was initially designed with tech startups in mind and may not be as applicable to businesses in other industries, such as healthcare, biotech, or manufacturing, which have longer development cycles and higher regulatory barriers.
Market Saturation: In some markets, especially in technology hubs like Silicon Valley, there's a concern that the Lean Startup method has led to an oversaturation of similar ideas and startups, making it more challenging for any single company to stand out.
Evolving Landscape: As technology and business landscapes evolve, new methodologies and approaches may emerge that complement or surpass the Lean Startup method. For example, concepts like Design Thinking, Growth Hacking Methodology and Product Led Growth have gained traction in recent years.

To assess the current validity and relevance of the Lean Startup method, it's essential to consider the specific context, industry, and maturity of your startup. While the core principles of customer-centricity, iteration, and learning remain valuable, startups should also be open to adapting and combining methodologies based on their unique circumstances and challenges. Additionally, staying updated with the latest trends and methodologies in entrepreneurship is crucial to making informed decisions.

Design Thinking

2023-09-04T00:00:00+02:00

X::Criticism of the Lean Startup

X::Growth Hacking Methodology X::Product Led Growth

Design thinking is a human-centered and problem-solving approach to innovation and product development that has gained significant traction in the business world in recent years. It places a strong emphasis on empathy, creativity, and iterative processes to tackle complex problems and create user-centric solutions. Here's a comprehensive exploration of design thinking in the context of business and product development:

Introduction to Design Thinking: Design thinking is a methodology that originated in the world of design but has since transcended its origins to become a widely adopted approach in various industries, including technology, healthcare, finance, and more. At its core, design thinking is about understanding and addressing the needs of users or customers by fostering a deep sense of empathy, engaging in creative problem-solving, and iterating on solutions to continuously improve them.

Key Principles of Design Thinking:

Empathy: Design thinking starts with empathizing with the end-users or customers to gain a deep understanding of their needs, desires, and pain points. This empathetic approach helps teams uncover insights that might not be apparent through traditional data analysis.
Define: Once user needs are understood, the next step is to define the problem clearly and succinctly. This step involves synthesizing the information gathered during the empathy phase to create a user-centered problem statement.
Ideate: In this phase, teams brainstorm and generate a wide range of potential solutions without judgment. It's a creative and often collaborative process that encourages thinking outside the box.
Prototype: Prototyping involves creating low-fidelity representations of the proposed solutions. These prototypes can be anything from simple sketches to interactive mock-ups, depending on the context. The goal is to quickly visualize and test ideas.
Test: The testing phase involves gathering feedback from users by exposing them to the prototypes. This feedback loop allows teams to refine and improve their solutions based on real-world insights.

Benefits of Design Thinking in Business/Product Development:

User-Centric Innovation: Design thinking places the user at the center of the development process, leading to products and services that genuinely meet user needs and preferences.
Enhanced Creativity: By encouraging ideation without constraints in the early stages, design thinking fosters creative thinking, which can lead to breakthrough solutions.
Reduced Risk: Iterative testing and prototyping help identify and address issues early in the development process, reducing the risk of costly mistakes later on.
Improved Collaboration: Design thinking often involves cross-functional teams collaborating to solve problems, breaking down silos and fostering a culture of cooperation.
Adaptability: The iterative nature of design thinking allows businesses to adapt to changing circumstances and emerging trends more effectively.

Real-World Examples:

Numerous successful companies have embraced design thinking to drive innovation and improve their products and services. For instance:

Apple: Apple is renowned for its commitment to user-centric design. Products like the iPhone and MacBook exemplify how design thinking has been instrumental in creating highly intuitive and visually appealing devices.
IBM: IBM's design thinking transformation has led to the creation of IBM Design Studios, which apply design thinking principles to a wide range of projects, from software development to organizational strategy.
Airbnb: Airbnb uses design thinking to create memorable experiences for its users. The platform continuously iterates on its website and app to enhance user satisfaction.

Conclusion:

In today's fast-paced and ever-changing business landscape, design thinking offers a structured yet flexible approach to innovation and problem-solving. By prioritizing empathy, creativity, and iterative development, organizations can create products and services that resonate with users, drive growth, and stay adaptable in an increasingly competitive marketplace. As design thinking continues to evolve, it remains a valuable methodology for businesses seeking to stay customer-focused and innovative.

Problems with Langchain and how to minimize their impact

2023-09-01T00:00:00+02:00

Introduction

LangChain, a popular framework for building applications with large language models (LLMs), has been touted as a game-changer in the world of AI-driven development. However, as more users dive into the library and its capabilities, some have found that it falls short of expectations. In this section, we'll discuss ten issues with LangChain that have left users underwhelmed and questioning its value proposition.

Problems
Takeaways - How to Use the LangChain Right Way?
Conclusion

Problems

1. Overly complex and unnecessary abstractions

LangChain has been criticized for having too many layers of abstraction, making it difficult to understand and modify the underlying code. These layers can lead to confusion, especially for those who are new to LLMs or LangChain itself. The complexity can also make it challenging to adapt the library to specific use cases or integrate it with existing tools and scripts. In some cases, users have found that they can achieve their goals more easily by using simpler, more straightforward code.

2. Easy breakable and unreliable

Some users have found LangChain to be unreliable and difficult to fix due to its complex structure. The framework's fragility can lead to unexpected issues in production systems, making it challenging to maintain and scale applications built with LangChain. Users have reported that the deeper and more complex their application becomes, the more LangChain seems to become a risk to its maintainability.

3. Poor documentation

LangChain's documentation has been described as confusing and lacking in key details, making it challenging for users to fully understand the library's capabilities and limitations. The documentation often omits explanations of default parameters and important details, leaving users to piece together information from various sources. This lack of clarity can hinder users' ability to effectively leverage LangChain in their projects.

4. A high level of abstraction hinders customization

Users have reported that LangChain's high level of abstraction makes it difficult to modify and adapt the library for specific use cases. This can be particularly problematic when users want to make small changes to the default behavior of LangChain or integrate it with other tools and scripts. In these cases, users may find it easier to bypass LangChain altogether and build their own solutions from scratch.

5. Inefficient token usage

LangChain has been criticized for inefficient token usage in its API calls, which can result in higher costs. This can be particularly problematic for users who are trying to minimize their expenses while working with LLMs. Some users have found that they can achieve better results with fewer tokens by using custom Python code or other alternative libraries.

6. Difficult integration with existing tools

Users have reported difficulties integrating LangChain with their existing Python tools and scripts. This can be especially challenging for those who have complex analytics or other advanced functionality built into their applications. The high level of abstraction in LangChain can make it difficult to interface with these existing tools, forcing users to build workarounds or abandon LangChain in favor of more compatible solutions.

7. Limited value proposition

Some users feel that LangChain does not provide enough value compared to the effort required to implement and maintain it. They argue that the library's primary use case is to quickly create demos or prototypes, rather than building production-ready applications. In these cases, users may find it more efficient to build their own solutions or explore alternative libraries that offer a better balance of ease of use and functionality.

8. Inconsistent behavior and hidden details

LangChain has been criticized for hiding important details and having inconsistent behavior, which can lead to unexpected issues in production systems. Users have reported that LangChain's default settings and behaviors are often undocumented or poorly explained, making it difficult to predict how the library will behave in different scenarios. This lack of transparency can lead to frustration and wasted time troubleshooting issues that could have been avoided with better documentation.

9. Better alternatives available

Users have mentioned other libraries, such as Semantic Kernel, LlamaIndex, Deepset Haystack , or SuperAGI, as more suitable alternatives to LangChain. These alternatives often provide clearer documentation, more flexible customization options, and better integration with existing tools and scripts. In some cases, users have found that they can achieve their goals more easily and efficiently by using these alternative libraries instead of LangChain. See awesome-langchain for a list of LLM frameworks.

10. Primarily optimized for demos

LangChain has been described as being primarily optimized for quickly creating demos, rather than for building production-ready applications. Partnership with Streamlit should ease demo creation even more. While this can be useful for those who want to quickly experiment with LLMs or showcase their ideas, it can be limiting for users who want to build more robust, scalable applications. In these cases, users may find that LangChain's focus on demos and prototypes hinders their ability to build high-quality, production-ready applications.

Takeaways - How to Use the LangChain Right Way?

Based on the community comments and experiences shared, here are some pieces of advice on how to create apps using LangChain that will be reliable, easy to maintain and debug:

Use LangChain for prototyping and experimentation: LangChain can be useful for quickly creating prototypes and validating ideas. However, for more complex and production-level applications, you might want to consider implementing the functionality you need yourself.
Understand the underlying concepts: Before using LangChain, make sure to understand the core concepts of LLMs, prompts, and how the different components of the framework interact. This will help you make informed decisions about which parts of LangChain to use and which to replace with custom implementations.
Focus on the value of the ecosystem: LangChain provides integrations with various tools, indexes, and prompt templates. Leverage these resources to build your application, but be aware of the limitations and potential issues that might arise from using the default settings and abstractions.
Be prepared to write custom code: LangChain might not cover all use cases or provide the level of control and customization you need for your application. Be prepared to write custom code to better suit your specific requirements and use case.
Keep an eye on alternative tools and libraries: As the field of LLMs is rapidly evolving, new tools and libraries are being developed that might better suit your needs. Stay informed about the latest developments and consider using alternative libraries like Deepset Haystack, DSPy , or Microsoft tools like semantic-kernel and AutoGen if they better align with your project requirements. The list is huge and growing!
Learn from LangChain's source code: If you find that LangChain's abstractions and documentation are not sufficient for your needs, you can learn from the source code itself. Use the provided prompts and implementation details as inspiration and adapt them to your own project.
Consider local LLM models: While LangChain primarily focuses on using OpenAI's models, you might want to explore using local LLM models like Llama, Galpaca, Vicuna, or Koala. These models can offer benefits in terms of cost, privacy, and offline capabilities. However, be aware that they might not be as powerful or accurate as GPT-3.5 Turbo.
Integrate with existing tools and scripts: If you need to interface with existing Python tools or scripts, make sure to understand how LangChain interacts with them and how you can best integrate them into your application.
Test and measure the performance of your application: When using LangChain, ensure that you thoroughly test your application and measure its performance against different prompts and configurations. This will help you identify potential issues and areas for improvement.
Keep an eye on the costs: Be mindful of the API costs associated with using LangChain and consider optimizing your application to reduce the number of API calls and tokens used.

My favourite choice from this list would be #6 - to learn from the LangChain implemented tools and techniques by looking into the code.

Any comments or suggestions? Let me know.

Conclusion

In considering LangChain, it's vital to acknowledge its limitations and challenges before embracing it enthusiastically. Although LangChain has garnered significant attention and investment, users have pinpointed various drawbacks that could impede its effectiveness in more intricate, production-ready applications. To make well-informed decisions about LangChain's suitability for their projects, developers should gain an understanding of these issues. In the ever-evolving landscape of LLM-driven development, assessing the available tools and libraries is crucial to determining which aligns best with your specific needs and requirements. It's worth noting that the ideal solution might not yet exist, necessitating adaptation or customization of existing tools or even the creation of your own to realize your vision for AI-driven applications.

edits:

2023-10-19: Added AutoGen and semantic-kernel, removed GPTi,
2023-10-19: Added link to list of alternative frameworks

Jaro-Winkler Similarity

2023-08-29T00:00:00+02:00

Jaro-Winkler Similarity
Python Example:
Valuable Properties of Jaro-Winkler Similarity:
Recommendations for Usage:
Cases to Consider Alternatives:

Jaro-Winkler Similarity

Jaro-Winkler similarity is designed to compare two strings, giving more weight to the common prefix of the strings. The formula for Jaro-Winkler similarity is:

$$ JW(s1, s2) = J(s1, s2) + \frac{L \cdot p \cdot (1 - J(s1, s2))}{10} $$

Where:

$J(s1, s2)$ is the Jaro similarity between strings (s1) and (s2).
$L$ is the length of the common prefix between the strings.
$p$ is a constant scaling factor (typically 0.1) that increases the similarity for strings that share a common prefix.

The Jaro similarity $J(s1, s2)$ is calculated as:

$$ J(s1, s2) = \frac{m}{\max(\text{len}(s1), \text{len}(s2))}, \quad $$

Where:

$m$ is the number of matching characters

Python Example

def jaro_similarity(s1, s2):
    len_s1, len_s2 = len(s1), len(s2)
    match_distance = max(len_s1, len_s2) // 2 - 1

    common_chars_s1 = []
    common_chars_s2 = []

    for i, char in enumerate(s1):
        start = max(0, i - match_distance)
        end = min(i + match_distance + 1, len_s2)

        if char in s2[start:end]:
            common_chars_s1.append(char)
            common_chars_s2.append(s2[start:end][s2[start:end].index(char)])

    m = len(common_chars_s1)
    if m == 0:
        return 0.0

    transpositions = sum(c1 != c2 for c1, c2 in zip(common_chars_s1, common_chars_s2)) // 2
    jaro_similarity = (m / len_s1 + m / len_s2 + (m - transpositions) / m) / 3
    return jaro_similarity


def jaro_winkler_similarity(s1, s2, p=0.1):
    jaro_sim = jaro_similarity(s1, s2)
    common_prefix_len = 0

    for i, (c1, c2) in enumerate(zip(s1, s2)):
        if c1 == c2:
            common_prefix_len += 1
        else:
            break

    jaro_winkler_sim = jaro_sim + (common_prefix_len * p * (1 - jaro_sim))
    return jaro_winkler_sim

string1 = "apple"
string2 = "applet"
jw_similarity = jaro_winkler_similarity(string1, string2)
print("Jaro-Winkler Similarity:", jw_similarity)

Jaro-Winkler Similarity: 0.9722222222222223

The Jaro-Winkler similarity metric possesses several valuable properties that make it suitable for specific use cases. However, it's important to note that no single similarity metric is universally best for all scenarios. Here are some valuable properties of the Jaro-Winkler metric, as well as recommendations for its usage and instances where other metrics might be more appropriate.

Valuable Properties of Jaro-Winkler Similarity

String Comparison with Common Prefix: The Jaro-Winkler metric gives higher weight to common prefixes, making it effective for comparing strings that often have a prefix or abbreviation. This is particularly useful for names and addresses.
Adjustable Scaling Factor: The Jaro-Winkler metric allows for tuning the impact of the common prefix on the similarity score using the scaling factor $p$. This allows you to emphasize or de-emphasize the common prefix based on your needs.
Simple to Understand and Implement: The calculation of Jaro-Winkler similarity involves straightforward string matching and prefix length consideration, making it relatively easy to implement and understand.

Recommendations for Usage

Names and Addresses: Jaro-Winkler similarity is highly recommended when comparing names, addresses, and other strings with common prefixes or abbreviations. It's often used in record linkage, deduplication, and fuzzy matching tasks in databases.
Fuzzy String Matching: When dealing with noisy or misspelled data, the Jaro-Winkler metric can be effective in finding approximate matches. It's suitable for scenarios where small typographical errors or variations are common.
Short Texts: Jaro-Winkler is well-suited for comparing short texts like product names, usernames, and titles, where the common prefix is an important aspect of similarity.

Cases to Consider Alternatives

Long Texts: For comparing long texts or documents, cosine similarity or Jaccard similarity of term frequencies might be more appropriate, as they consider the distribution of terms across the entire text.
Semantic Similarity: If you're interested in capturing semantic meaning rather than character-level similarity, word embeddings-based metrics like cosine similarity between vector representations might be more suitable.
Numerical Data: For comparing numerical data, other similarity metrics such as Euclidean distance, Manhattan distance, or Pearson correlation coefficient might be more meaningful.
Customized Weights: If you have specific domain knowledge about feature importance, you might opt for a customized similarity metric that incorporates these weights effectively.
Language-Specific Features: If the text includes language-specific features, phonetic differences, or linguistic nuances, other specialized metrics like Soundex or Levenshtein distance might be considered.

Examples

Here are some concrete pairs of strings that demonstrate the properties of the Jaro-Winkler similarity metric ($p$=0.2 if not stated differently):

Common Prefix Emphasis:

String 1: "Michael"
String 2: "Michelle"
Jaro-Winkler similarity: 0.963

Explanation: The common prefix "Mich" contributes significantly to the similarity score in Jaro-Winkler, resulting in a high similarity even though the rest of the strings differ.

Case Sensitivity and Scaling Factor:

String 1: "McDonald's"
String 2: "Mcdonells"
Jaro-Winkler similarity: 0.853

Explanation: The common prefix "Mcdon" is considered due to the case difference. The scaling factor can adjust the impact of this prefix on the similarity score.

No Common Prefix:

String 1: "hello"
String 2: "world"
Jaro-Winkler similarity: 0.433

Explanation: Without a common prefix, the Jaro-Winkler similarity is low, even if the strings share some characters.

Short vs. Long Strings:

String 1: "AI"
String 2: "Artificial Intelligence"
Jaro-Winkler similarity: 0.623 Explanation: The short string "AI" has a higher similarity to the beginning of "Artificial Intelligence" due to the common prefix "A".

Typographical Errors:

String 1: "telephone"
String 2: "telephne"
Jaro-Winkler similarity: 0.967

Explanation: Despite the missing "o," the common prefix "teleph" contributes to a high Jaro-Winkler similarity score.

Short and Noisy Data:

String 1: "abacus"
String 2: "abaxus"
Jaro-Winkler similarity: 0.956

Explanation: The similarity captures the similarity in the common prefix "aba" and penalizes the difference at the end.

Significance of Scaling Factor:

String 1: "Thompson"
String 2: "Thomson"
Jaro-Winkler similarity with $p=0.1$: 0.975
Jaro-Winkler similarity with $p=0.2$: 0.992

Explanation: The scaling factor $p$ affects the similarity score. A higher $p$ gives more emphasis to the common prefix, leading to a higher similarity.

These examples illustrate how the Jaro-Winkler similarity metric behaves based on different characteristics of input strings, such as common prefixes, case sensitivity, typos, length, and the scaling factor $p$.

Summary

Jaro-Winkler similarity is highly valuable when dealing with short strings, names, and addresses, especially when common prefixes play a significant role. However, for longer texts, semantic similarity, numerical data, and specialized linguistic considerations, other metrics might be more appropriate. Always consider the specific characteristics of your data and the goals of your analysis when choosing a similarity metric.

Bearer Token Authentication for API

2023-08-24T00:00:00+02:00

Bearer Token Authentication

Bearer authentication is a method of API authentication that involves including a "bearer token" in the request header. This token is typically a long string of characters, often encoded in a specific format like JSON Web Token (JWT) or OAuth token. Bearer authentication is commonly used to secure APIs by allowing only authorized users or applications to access protected resources.

Here's how the process generally works:

Authentication: The user or application requests access to a protected resource by sending a request to the API server.
Token Generation: Upon successful authentication, the server generates a bearer token, which serves as proof of the user's or application's identity and permissions.
Token Inclusion: The generated bearer token is then included in the "Authorization" header of subsequent requests to the API. The header typically looks like this:

Authorization: Bearer <token>

Here, <token> represents the actual bearer token.

Authorization: The API server receives the request and extracts the bearer token from the header. It then validates the token to determine if the user or application is authorized to access the requested resource.
Access Control: If the bearer token is valid and the user or application has the necessary permissions, the API server grants access to the requested resource. If the token is invalid or expired, the server denies access.

Bearer authentication is often preferred due to its simplicity and ease of implementation. It allows the server to validate the token without needing to store any session information, making it suitable for stateless architectures like RESTful APIs. However, securing bearer tokens is crucial since anyone in possession of a valid token can access the associated resources. This is why HTTPS and token encryption are recommended to protect the token during transmission.

NOTE: bearer tokens should be handled carefully. They can potentially be exposed if not properly secured, and their use should be combined with other security measures, such as rate limiting, token expiration, and regular token rotation, to enhance the overall security of an API.

Token Encryption

Token encryption plays a crucial role in securing bearer tokens used for API authentication. Encrypting bearer tokens ensures that the token's content remains confidential and tamper-proof while it's being transmitted or stored. Here's an overview of how token encryption works:

Token Content: Bearer tokens often contain important information such as user identity, permissions, and expiration time. This information should be protected from unauthorized access.
Choose Encryption Algorithm: A strong encryption algorithm is selected for securing the token. Common choices include AES (Advanced Encryption Standard) and RSA (Rivest-Shamir-Adleman).
Generate Encryption Keys: Encryption requires keys: a public key for encryption and a private key for decryption (in the case of asymmetric encryption like RSA) or a shared key (in the case of symmetric encryption like AES). These keys must be kept secret.
Encryption Process:
- Asymmetric Encryption (e.g., RSA): If using asymmetric encryption, the sender uses the recipient's public key to encrypt the token. Only the recipient possessing the corresponding private key can decrypt and access the original token.
- Symmetric Encryption (e.g., AES): In symmetric encryption, both the sender and receiver share the same secret key. The sender uses this key to encrypt the token, and the recipient uses the same key to decrypt it.
Transmission: The encrypted token can now be safely transmitted over the network. Even if intercepted by malicious actors, the encrypted content should be meaningless without the decryption key.
Decryption Process:
- Asymmetric Encryption (e.g., RSA): The recipient uses their private key to decrypt the token, revealing its original content.
- Symmetric Encryption (e.g., AES): The recipient uses the shared secret key to decrypt the token and access its original content.

Encryption adds an additional layer of security to bearer tokens. Even if an attacker gains access to the encrypted token, they won't be able to decipher its contents without the appropriate decryption key.

It's important to note a few considerations:

Key Management: The security of encrypted tokens depends heavily on proper key management. Keys should be stored securely and rotated periodically.
Algorithm and Key Length: The choice of encryption algorithm and key length impacts the security of the encrypted token. Strong algorithms with sufficient key lengths should be used.
HTTPS: While encryption protects the token in transit, using HTTPS (TLS/SSL) for communication further ensures the confidentiality and integrity of the entire data exchange, including the token.
Token Validation: Even when using encrypted tokens, the receiving server must still validate the decrypted token to ensure its authenticity, integrity, and authorization.

Combining token encryption with other security practices, such as secure token storage and token expiration, provides a comprehensive approach to securing bearer tokens and API authentication.

Understanding Retrieval-Augmented Generation (RAG) empowering LLMs

2023-08-24T00:00:00+02:00

TLDR

Retrieval augmented generation refers to the method of enhancing a user's input to a large language model (LLM) such as ChatGPT by incorporating extra information obtained from an external source. This additional data can then be utilized by the LLM to enrich the response it produces.

Introduction: Understanding Retrieval-Augmented Generation (RAG)
The Need for RAG in Large Language Models
The 'Open Book' Approach of RAG
Personalized and Verifiable Responses with RAG
Challenges and Future Directions
Conclusion
References

Introduction: Understanding Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation, commonly referred to as RAG, and sometimes called Grounded Generation (GG), represents an ingenious integration of pretrained dense retrieval (DPR) and sequence-to-sequence models.

Transformer architecture (used in GPT models) is a member of sequence-to-sequence (Seq2Seq) architectures. Seq2Seq models are designed to handle tasks that involve transforming an input sequence into an output sequence, such as machine translation, text summarization, and dialogue generation.

The process involves retrieving documents using DPR and subsequently transmitting them to a seq2seq model. Through a process of marginalization, these models then produce desired outputs. The retriever and seq2seq modules commence their operations as pretrained models, and through a joint fine-tuning process, they adapt collaboratively, thus enhancing both retrieval and generation for specific downstream tasks. This innovative artificial intelligence framework serves as a means to empower large language models (LLMs) by anchoring them to external knowledge sources. Consequently, this strategic approach ensures the availability of accurate, current information, thereby granting users valuable insights into the generative mechanisms of these models. For a comprehensive understanding of the RAG technique, we offer an in-depth exploration, commencing with a simplified overview and progressively delving into more intricate technical facets.

Figure 1. Data processing, storage and referencing in RAG method. Source: Microsoft

The Need for RAG in Large Language Models

Large language models, while powerful, can sometimes be inconsistent in their responses. They may provide accurate answers to certain questions but struggle with others, often regurgitating random facts from their training data. This inconsistency stems from the fact that LLMs understand the statistical relationships between words but not their actual meanings.

To address this issue, researchers have developed the RAG framework, which improves the quality of LLM-generated responses by grounding the model in external sources of knowledge. This approach not only ensures access to the most current and reliable facts but also allows users to verify the model's claims for accuracy and trustworthiness.

The 'Open Book' Approach of RAG

RAG operates in two main phases: retrieval and content generation. During the retrieval phase, algorithms search for and retrieve relevant snippets of information based on the user's prompt or question. These facts can come from various sources, such as indexed documents on the internet or a closed-domain enterprise setting for added security and reliability.

In the generative phase, the LLM uses the retrieved information and its internal representation of training data to synthesize a tailored answer for the user.

This approach is akin to an "open book" exam, where the model can browse through content in a book rather than relying solely on its memory.

Figure 2. RAG operation. Information preparation and storage. Augmenting prompt with external information.

Personalized and Verifiable Responses with RAG

RAG allows LLM-powered chatbots to provide more personalized answers without the need for human-written scripts. By reducing the need to continuously train the model on new data, RAG can lower the computational and financial costs of running LLM-powered chatbots in an enterprise setting.

Moreover, RAG enables LLMs to generate more specific, diverse, and factual language compared to traditional parametric-only seq2seq models. This feature is particularly useful for businesses that require up-to-date information and verifiable responses.

Challenges and Future Directions

Despite its advantages, RAG is not without its challenges. For instance, LLMs may struggle to recognize when they don't know the answer to a question, leading to incorrect or misleading information. To address this issue, researchers are working on fine-tuning LLMs to recognize unanswerable questions and probe for more detail until they can provide a definitive answer.

Furthermore, there is ongoing research to improve both the retrieval and generation aspects of RAG. This includes finding and fetching the most relevant information possible and structuring that information to elicit the richest responses from the LLM.

Conclusion

Retrieval-Augmented Generation offers a promising solution to the limitations of large language models by grounding them in external knowledge sources. By adopting RAG, businesses can achieve customized solutions, maintain data relevance, and optimize costs while harnessing the reasoning capabilities of LLMs. As research continues to advance in this area, we can expect even more powerful and efficient language models in the future.

Any comments or suggestions? Let me know.

References

Original paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks by Patrick Lewis et al. (available as paper with code)
Exemplary notebooks on amazon Sagemaker:
- Retrieval-Augmented Generation: Question Answering based on Custom Dataset
- Retrieval-Augmented Generation: Question Answering based on Custom Dataset with Open-sourced LangChain Library
Python library with RAG implementation: GitHub - llmware-ai/llmware: Providing enterprise-grade LLM-based development framework, tools and fine-tuned models.
Analytics: Vectorview
Deep-dive into specific use-case of RAG with scaling in mind: Building RAG-based LLM Applications for Production (Part 1)
Good section on possible improvements to RAG: Retrieval Augmented Generation (RAG): What, Why and How? | LLMStack
General intro to RAG: How do domain-specific chatbots work? An Overview of Retrieval Augmented Generation (RAG) | Scriv
Optimization, async, using summaries: Secrets to Optimizing RAG LLM Apps for Better Performance, Accuracy and Lower Costs! | by Madhukar Kumar | madhukarkumar | Sep, 2023 | Medium
Check the GitHub for the RAG-related projects: retrieval-augmented-generation · GitHub Topics
Yet another RAG system - implementation details and lessons learned : r/LocalLLaMA
Building and Evaluating Advanced RAG Applications - DeepLearning.AI - recent course from deeplearning.ai (Andrew Ng). Instructors: Jerry Liu and Anupam Datta.
- In this course, we’ll explore:
  - Two advanced retrieval methods: Sentence-window retrieval and auto-merging retrieval that perform better compared to the baseline RAG pipeline.
  - Evaluation and experiment tracking: A way evaluate and iteratively improve your RAG pipeline’s performance.
  - The RAG triad: Context Relevance, Groundedness, and Answer Relevance, which are methods to evaluate the relevance and truthfulness of your LLM’s response.

X::Techniques to Boost RAG Performance in Production

Edits:

2023-10-23 - added link to LLMStack
2023-11-06 - added TLDR section
2023-11-06 - added ToC

Create Self-Hosted Python Package Repository - General Guide

2023-08-12T00:00:00+02:00

X::Lesser-known Python Package Repository Managers X::Storing Private Python Packages with Local NAS and Lightweight Servers

Creating a self-hosted Python package repository allows you to host and manage your own Python packages, making them accessible to your team or the public without relying on external services like PyPI. Here's a general guide on how to set up a self-hosted Python package repository.

style: bullet
min_depth: 2
max_depth: 6 
title: "**Contents**"

General Guide

Choose a Repository Manager

You need a repository manager to host and manage your Python packages. Two popular options are:

Devpi: A powerful and customizable Python package repository server.
Artifactory: A general-purpose repository manager that can host various types of packages, including Python.

Set Up a Server

You will need a server to host your package repository. This could be a dedicated server, a cloud instance (AWS, GCP, Azure), or even a local machine if the repository is for internal use.

Install and Configure the Repository Manager

Devpi

Install Devpi using pip: pip install devpi-server devpi-web
Configure Devpi: Follow the instructions in the official documentation.

Artifactory

Download and install Artifactory: Follow the instructions on the Artifactory website.

Create a Virtual Environment (optional but recommended)

Set up a Python virtual environment on your server to keep your package repository isolated from the system Python.

Upload Packages

Once your repository manager is set up, use tools like twine to upload your Python packages. Make sure to specify your self-hosted repository URL.

Accessing Packages

To use packages from your self-hosted repository, users can modify their pip.conf or .pypirc configuration file to include your repository's URL.

Security and Access Control

Configure user authentication and access control to restrict who can upload and access packages in your repository.

Maintenance and Backup

Regularly back up your package repository data to prevent data loss.
Keep your repository manager and server updated with security patches.

Documentation

Provide clear documentation to your team on how to access, upload, and manage packages in your self-hosted repository.

Artifactory vs. Devpi - pros & cons and setup instructions

Let's explore two popular free and open-source options for creating a self-hosted Python package repository: Devpi and Artifactory, along with their pros, cons, use cases, and a tutorial for setting up each.

Option 1: Devpi

Pros:

Designed specifically for Python package management.
Provides features like caching, replication, and access control.
Supports easy package versioning and management.

Cons:

Limited support for non-Python packages.
Web interface might not be as polished as other solutions.

Use Cases:

Small to medium-sized teams working exclusively with Python.
Projects where ease of setup and simple usage is preferred.

Tutorial:

Install Devpi:

pip install devpi-server devpi-web

Create and Configure Devpi Server:
Initialize a new Devpi server:

devpi-server --init

Start the Devpi server:

devpi-server

Create Users and Indexes:
Create a user:

devpi use http://localhost:3141
devpi user -c <username>

Create an index:

devpi index -c <indexname>

Upload and Use Packages:
Upload a package:

devpi upload

Install a package from your Devpi index:

pip install -i http://localhost:3141/<username>/<indexname>/simple/ <package>

Option 2: Artifactory

Pros:

Versatile repository manager supporting multiple package types.
Robust access control and security features.
Highly configurable and scalable.

Cons:

More complex setup compared to Devpi.
Heavier resource requirements.

Use Cases:

Large organizations with diverse technology stacks.
Projects needing advanced access control and security features.

Tutorial:

Install Artifactory:
Follow the installation guide for Artifactory Community Edition.
Configure Artifactory:
Access Artifactory's web interface and set up your repository.
Create a Virtual Repository:
Create a new virtual repository and include a "PyPI" remote repository as a source.
Upload Packages:
Use twine to upload your Python packages to your virtual repository:

twine upload --repository-url <Artifactory_URL>/<repository_name> dist/*

Access and Use Packages:
Configure your pip to use your Artifactory repository as an index:

pip config set global.index-url <Artifactory_URL>/<repository_name>/simple/

Install packages as usual using pip.

Closing thoughts

Setting up a self-hosted Python package repository requires careful consideration of your team's needs and technical expertise. Choose the option that best aligns with your requirements and resources.
Remember, setting up and maintaining a self-hosted package repository requires technical expertise and ongoing maintenance. If you're not experienced with server management and administration, consider starting with a simpler approach or seeking help from someone with relevant experience.

Cookiecutter alternatives

2023-08-12T00:00:00+02:00

Alternative Tools to Cookiecutter for Scaffolding Projects
Recommendations for various use-cases
Conclusion

Alternative Tools to Cookiecutter for Scaffolding Projects

Scaffolding tools are essential for accelerating the process of project setup and code generation by providing predefined templates and structures. One of the popular tools for this purpose is Cookiecutter, which allows developers to create projects from project templates. However, the software development ecosystem is diverse, and there are several alternative tools to Cookiecutter, each with unique features and characteristics that differentiate them from one another.

In this article, we will explore ten alternative tools to Cookiecutter and highlight their standout features and best-suited use cases.

1. Yeoman

Yeoman is a robust scaffolding tool that offers a vast collection of generators to create projects across various languages and frameworks. It provides a strong community support and an extensive library of community-contributed generators.

Unlike Cookiecutter, Yeoman focuses on a more opinionated approach, meaning it enforces best practices and conventions for specific frameworks, which can speed up development. Additionally, it supports interactive user prompts, making project setup more user-friendly.

Yeoman's wide range of generators and its integration with popular build tools like Grunt and Gulp make it suitable for large-scale projects and complex workflows.

Best Use Case: Yeoman is best suited for developers who want a structured and opinionated approach to project generation, especially in scenarios where adherence to specific conventions is crucial.

2. Hygen

Hygen is a fast and flexible code generator that allows developers to create custom templates for their projects. It offers a template language with conditional logic and supports both built-in and custom helpers.

Hygen focuses on simplicity and allows developers to create templates using their preferred language, making it highly customizable. Unlike Cookiecutter, which relies on Jinja2 templates, Hygen's customizability extends to both the template language and directory structure.

The ability to generate code snippets and templates quickly and effortlessly makes Hygen ideal for scenarios where rapid prototyping and iterative development are essential.

Best Use Case: Hygen is best suited for developers who need a lightweight, customizable, and language-agnostic solution for scaffolding projects.

3. Plop

Plop is a simple yet powerful micro-generator tool that focuses on creating small and reusable templates. It allows developers to define custom generators with ease, making it a popular choice for smaller projects.

Plop stands out from Cookiecutter due to its minimalistic approach and single-purpose philosophy. Instead of managing complex project structures, Plop concentrates on code generation for specific components or modules.

Plop's ability to create small, self-contained generators with custom logic and prompts is ideal for developers who require lightweight scaffolding tools for repetitive tasks.

Best Use Case: Plop is best suited for developers who work on component-based architectures and require a quick and straightforward way to generate components, modules, or boilerplate code.

4. Hyde

Hyde is a lightweight scaffolding tool that allows developers to create projects using a simple YAML configuration file. It offers a minimalist approach to project generation, making it easy to use and understand.

Unlike Cookiecutter, which relies on templates and prompts, Hyde uses a declarative configuration file to define project structures. This simplicity enables developers to get started quickly without the need for a dedicated template engine.

Hyde's unique feature lies in its simplicity, making it an excellent choice for small to medium-sized projects and developers who prefer a configuration-driven approach.

Best Use Case: Hyde is best suited for developers who want a straightforward and lightweight solution for setting up projects without the complexity of template languages.

5. Slush

Slush is a streaming scaffolding tool built on top of Gulp.js, providing a pipeline-based approach to project generation. It allows developers to compose complex generators using Gulp plugins, offering powerful customization capabilities.

Unlike Cookiecutter, which operates on static templates, Slush leverages Gulp's streaming capabilities to process files, enabling developers to manipulate and modify the project structure during generation.

Slush's streaming nature and its compatibility with Gulp plugins make it stand out for projects that require advanced file processing and task automation during scaffolding.

Best Use Case: Slush is best suited for developers who are already familiar with Gulp and need to integrate project generation with complex build processes.

6. Blueprint

Blueprint is a modern scaffolding tool designed for simplicity and flexibility. It allows developers to create project templates using Handlebars templates, YAML configuration, or JavaScript code, providing multiple options for customizing templates.

Unlike Cookiecutter, which mainly relies on Jinja2 templates, Blueprint gives developers the freedom to choose their preferred template language. It also offers a straightforward CLI interface for generating projects.

Blueprint's versatility and support for various template creation methods make it suitable for developers who have existing Handlebars or JavaScript templates they wish to reuse for project generation.

Best Use Case: Blueprint is best suited for developers who want a lightweight and flexible scaffolding tool with support for multiple template languages.

7. Sao

Sao is a pluggable and customizable scaffolding tool that provides a simple JSON-based template definition. It enables developers to create their own template plugins and extend existing ones seamlessly.

Sao's plugin system and JSON-based templates offer a high level of customization, allowing developers to tailor project generation to their specific requirements without being tied to a specific template language.

The ability to create custom plugins and extend existing templates makes Sao a powerful choice for developers who value modularity and plugin support.

Best Use Case: Sao is best suited for developers who need a versatile and extensible scaffolding tool with the option to create and share their custom plugins.

8. Plopdown

Plopdown is a powerful scaffolding tool that generates templates using a JSON configuration file. It offers advanced features like dynamic prompts, glob pattern matching, and custom logic for template generation.

Plopdown's support for dynamic prompts and glob patterns sets it apart from Cookiecutter. It allows developers to generate project files based on complex conditions and patterns, making it suitable for projects with dynamic requirements.

Plopdown's ability to handle dynamic inputs and patterns make it ideal for projects that require a high degree of customization and flexibility during the scaffolding process.

Best Use Case: Plopdown is best suited for developers who need a flexible and powerful scaffolding tool capable of handling dynamic inputs and complex project structures.

9. Jolt

Jolt is a lightweight and straightforward scaffolding tool that allows developers to create templates using a concise YAML syntax. It emphasizes minimal configuration and aims to reduce boilerplate code.

Unlike Cookiecutter, which may require extensive configuration, Jolt's YAML syntax simplifies the template creation process, making it a fast and efficient choice for smaller projects.

Jolt's simplicity and focus on reducing boilerplate code make it stand out for quick prototyping and smaller projects with straightforward requirements.

Best Use Case: Jolt is best suited for developers who prefer a lightweight and minimalistic scaffolding tool for rapid project setup.

10. Boilr

Boilr is a command-line scaffolding tool that utilizes a template registry, allowing developers to share and discover templates easily. It provides a curated list of templates for various languages and frameworks.

Unlike Cookiecutter, Boilr's template registry simplifies the process of finding and using project templates, making it an excellent choice for developers who want a seamless experience with pre-built templates.

Boilr's extensive template registry and its command-line interface make it stand out for its accessibility and ease of use.

Best Use Case: Boilr is best suited for developers who prefer a command-line tool with access to a wide variety of pre-built templates for different project types.

Recommendations for various use-cases

Sure! Here are three distinct use cases with specific requirements, along with recommended tools for each use case:

Use Case 1: Rapid Prototyping and Small Projects - Plop

Requirements:

Lightweight and easy-to-use tool.
Minimal configuration and setup.
Ability to quickly generate boilerplate code and components.

Recommended Tool: Plop

Reasoning: Plop is an excellent choice for rapid prototyping and small projects due to its simplicity and focus on generating small, reusable templates. Its straightforward YAML-based configuration allows developers to get started quickly without the overhead of extensive setup. Plop's ability to create self-contained generators with custom logic and prompts makes it perfect for generating boilerplate code and components in a fast and efficient manner.

Use Case 2: Large-Scale Projects with Opinionated Conventions - Yeoman

Requirements:

Strong community support and a wide range of templates.
Ability to enforce best practices and conventions for specific frameworks.
Interactive user prompts for customizable project setups.

Recommended Tool: Yeoman

Reasoning: Yeoman is a powerful scaffolding tool with an extensive library of community-contributed generators, making it suitable for large-scale projects. It enforces opinionated conventions, which is beneficial for maintaining consistency and best practices across the codebase. Yeoman's interactive user prompts make project setup user-friendly, allowing developers to customize the generated code according to their specific requirements.

Use Case 3: Advanced File Processing and Task Automation - Slush

Requirements:

Integration with build tools for advanced file processing.
Flexibility to manipulate and modify project structure during generation.
Support for custom plugins and extensibility.

Recommended Tool: Slush

Reasoning: Slush is an ideal choice for projects that require advanced file processing and task automation during scaffolding. Built on top of Gulp.js, Slush leverages Gulp's streaming capabilities to process files, allowing developers to manipulate and modify the project structure during generation. Its pipeline-based approach and compatibility with Gulp plugins provide high flexibility and customization possibilities. Developers who are already familiar with Gulp will find Slush seamless to integrate into their existing build processes.

These recommended tools cater to different use cases, ensuring that developers can find the most suitable scaffolding tool based on their project requirements and preferences.

Conclusion

While Cookiecutter is a popular choice for scaffolding projects, developers have several alternative tools to consider, each with its own unique features and characteristics. Depending on the project requirements, preferences, and familiarity with specific tools, developers can choose the one that best fits their needs. Whether it's Yeoman's opinionated approach, Plop's focus on micro-generators, or Sao's pluggable architecture, there is a suitable alternative for every scenario. Experimenting with these tools can significantly enhance the development workflow and productivity.

Lesser-known Python Package Repository Managers

2023-08-12T00:00:00+02:00

X::Create Self-Hosted Python Package Repository - General Guide X::Storing Private Python Packages with Local NAS and Lightweight Servers

The Artifactory (paid) and Devpi (free, open source) are most widely used python package repository managers, but there are some other interesting projects. Here are a few lesser-known Python package repository managers along with links to their source code or home websites.

Warehouse
pypiserver
Bandersnatch
EggBasket

Warehouse

The current codebase behind the Python Package Index (PyPI). While not lesser-known, it's worth mentioning as an alternative to the official PyPI implementation.

Source Code: https://github.com/pypa/warehouse

pypiserver

PyHockey is a minimal Python package server that's easy to set up and use for hosting private packages.

Source Code: https://github.com/pypiserver/pypiserver

Bandersnatch

A PyPI mirror client that can be used to create a complete copy of the Python Package Index (PyPI) locally or in a private network.

Home: https://pypi.org/project/bandersnatch/
Source Code: https://github.com/pypa/bandersnatch

EggBasket

EggBasket is a lightweight, easily-configurable Python package server designed for simplicity and ease of use.

Home: https://pypi.org/project/eggbasket/

Please note that the popularity and maintenance status of these repositories may vary, so it's a good idea to review the documentation and GitHub repositories to ensure they meet your requirements before setting up a self-hosted package repository.

Split glued or joined words

2023-08-12T00:00:00+02:00

wordninja package

install wordninja package: pip install wordnija

>>> import wordninja
>>> wordninja.split('bettergood')
['better', 'good']

wordsegment package

install the wordsegment package: pip install wordsegment.

use programatically:

>>> from wordsegment import load, segment
>>> load()
>>> segment('thisisatest')
['this', 'is', 'a', 'test']

or from CLI

$ echo thisisatest | python -m wordsegment
this is a test

Solutions from: string - How can I split multiple joined words? - Stack Overflow

Storing Private Python Packages with Local NAS and Lightweight Servers

2023-08-12T00:00:00+02:00

X::Create Self-Hosted Python Package Repository - General Guide X::Lesser-known Python Package Repository Managers

There are simple ways to store private Python packages on a local NAS (Network Attached Storage) without setting up a full-fledged package repository manager like Devpi or Artifactory. Here are a couple of straightforward alternatives:

Option 1: Local File System Repository

This approach involves creating a directory on your NAS to store your Python packages. You can use the pip command's --find-links option to specify the location of your custom package directory.

Pros:

Very simple setup and usage.
Well-suited for small teams or personal projects.

Cons:

Lack of advanced features like access control, versioning, and replication.

Tutorial:

Create a Packages Directory on NAS:
Create a directory on your NAS where you will store your Python packages.
Upload Packages to NAS:
Copy or move your Python packages into the NAS directory.
Install Packages from NAS:
Install packages from your NAS using the pip command with the --find-links option:

pip install --find-links=file:///path/to/nas/packages/ <package>

Option 2: Local PyPI Server

You can set up a lightweight local PyPI server like pypiserver to serve your private Python packages.

Pros:

Simple setup with basic package management features.
Suitable for small teams and projects.

Cons:

May lack advanced features like access control and versioning compared to full repository managers.

Tutorial:

Install pypiserver:
Install pypiserver using pip:

pip install pypiserver
Create a Packages Directory:
Create a directory to store your Python packages.
Start pypiserver:
Start the pypiserver with the command:

pypi-server -p 8080 /path/to/packages/
Upload and Install Packages:
Copy your Python packages to the packages directory.
Install packages using the pip command with the local PyPI server URL:

pip install --index-url=http://localhost:8080/simple/ <package>

These simpler approaches provide a way to store private Python packages on a local NAS without the overhead of setting up a comprehensive repository manager. Choose the option that best fits your needs and resources. Keep in mind that while these methods are simpler, they lack some advanced features and may not be as scalable or secure as full repository managers.

Prompt Discovery in the Context of Large Language Models (LLMs) and Prompt Engineering

2023-08-08T00:00:00+02:00

Prompt discovery in the context of large language models refers to the systematic process of identifying and optimizing prompts to elicit desired responses from the model. It involves formulating prompts in a way that effectively guides the model's generation towards accurate, relevant, and high-quality outputs. Prompt engineering is a critical component of this process, as it encompasses the design and refinement of prompts to achieve specific tasks or goals.

Technical Aspects of Prompt Discovery
Activities and Challenges in Prompt Discovery
Types of Tools and Technologies for Prompt Discovery

Technical Aspects of Prompt Discovery

Prompt Formulation and Structure: This involves crafting prompts using appropriate syntax, keywords, and context to provide clear instructions to the model. Experimentation with different sentence structures, question formats, and contextual cues can impact the model's understanding and response.
Semantic Representation: Developing prompts that capture the desired semantic meaning and intent is crucial. This may involve exploring semantic role labeling, syntactic analysis, and dependency parsing to create prompts that effectively guide the model's reasoning.
Prompt Permutations: Generating a diverse set of prompt variations can help in identifying which phrasings or formulations yield the best results. This could involve systematically modifying sentence structure, word order, or incorporating synonyms and paraphrases.
Prompt Length and Complexity: Analyzing the impact of prompt length and complexity on model performance. Longer prompts may provide more context but risk confusing the model, while shorter prompts might lack necessary context.
Multi-step Prompts: Crafting prompts that involve multi-step instructions or conditional logic to guide the model through a series of steps to reach a desired conclusion.
Prompt Contextualization: Incorporating relevant context or domain-specific information within prompts to enhance the model's knowledge and improve response quality.
Prompt Targeting: Experimenting with prompts that explicitly mention the desired answer or output, guiding the model toward a specific response.

Activities and Challenges in Prompt Discovery

Prompt Effectiveness Evaluation: Developing methodologies to quantitatively and qualitatively assess the effectiveness of different prompts in eliciting accurate and relevant responses.
Prompt Generalization: Investigating how well a well-optimized prompt can generalize across different models, architectures, and datasets.
Prompt Adaptation: Identifying techniques to adapt prompts for various domains, languages, or tasks, considering nuances in language and context.
Adversarial Prompt Design: Exploring methods to create prompts that challenge the model's limitations and encourage robustness against adversarial inputs.
Active Learning for Prompt Refinement: Developing algorithms that iteratively learn and refine prompts based on model performance, aiming to reduce human intervention in the prompt engineering process.
Prompt Diversity Exploration: Analyzing the impact of diverse prompts on model behavior, uncovering potential biases, and ensuring fairness in responses.

Types of Tools and Technologies for Prompt Discovery

Prompt Generation Assistants: AI-driven tools that provide prompt suggestions, permutations, and optimizations based on user-defined criteria and objectives.
Prompt Evaluation Metrics: Novel metrics that quantitatively measure the quality, relevance, and correctness of model responses based on different prompts.
Semantic Prompt Analysis: Advanced natural language understanding tools capable of dissecting prompt semantics, identifying key components, and suggesting improvements.
Prompt Optimization Algorithms: Algorithms that leverage reinforcement learning, genetic algorithms, or neural architecture search to automatically discover effective prompts.
Prompt-Aware Model Architectures: Model architectures explicitly designed to leverage and incorporate prompt information effectively during the generation process.
Contextualization Modules: Modules that enhance prompts with contextual information, leveraging external knowledge sources or domain-specific databases.
Bias and Fairness Detection Tools: Tools that analyze prompts for potential bias and fairness issues, ensuring the generated responses align with ethical and unbiased standards.
Interactive Prompt Refinement Interfaces: Interfaces allowing users to interactively refine and experiment with prompts, providing real-time feedback on model responses.

As the field of prompt engineering and large language models evolves, these tools and techniques will likely become more sophisticated, enabling more efficient and effective prompt discovery processes. There are few tools available at the time of writing (Aug 2023):

ianarawjo/ChainForge - An open-source visual programming environment for LLM experimentation and prompt evaluation.
logspace-ai/langflow - Langflow is a UI for LangChain, designed with react-flow to provide an effortless way to experiment and prototype flows.
FlowiseAI/Flowise - Drag & drop UI to build your customized LLM flow

Azure OpenAI Langchain configuration

2023-08-02T00:00:00+02:00

This note contains a recipe for how to configure LangChain to use Azure OpenAI.

NOTE: requires python-dotenv python package installed

create `.env` with configuration and secrets

OPENAI_API_TYPE="azure"
OPENAI_API_KEY="***"
OPENAI_API_BASE="***"
OPENAI_API_VERSION="***"

initialize langchain

from dotenv import load_dotenv,find_dotenv
from langchain.llms import AzureOpenAI

load_dotenv(find_dotenv())

deployment_name = "text-davinci-003"
model_name = "text-davinci-003"

llm = AzureOpenAI(deployment_name=deployment_name, model_name=model_name)

# check if it works
print(llm("What is the capital of France?"))

NOTE: find_dotenv - its purpose is to locate the .env file in your project directory or its parent directories. It starts the search from the current working directory and recursively moves up the directory tree until it finds the .env file. If no .env file is found, it returns the path of the current working directory. This function is beneficial because it ensures your code can locate the .env file regardless of the directory from which your script is executed.

Rank Fusion Algorithms - From Simple to Advanced

2023-07-28T00:00:00+02:00

Introduction

Rank fusion is a fundamental technique used in various domains, including data science and search engine optimization, to combine multiple ranked lists into a single, more reliable ranking. This process aims to exploit the strengths of individual ranking algorithms and mitigate their weaknesses, leading to improved overall performance. In this blog post, we will explore a range of rank fusion algorithms, starting from simple yet effective methods to advanced techniques employed by tech giants to achieve state-of-the-art results.

Algorithms
Conclusion
References

Algorithms

Borda Algorithm

The Borda algorithm is one of the simplest rank fusion techniques. It assigns scores to items based on their positions in the individual rankings and then combines these scores to obtain a fused ranking. In the context of search engine results, each document receives points based on its position in the ranked lists. The points are then summed up to form the final rank.

Consider $n$ ranked lists $\{R_1, R_2, \ldots, R_n\}$ with $m$ items. The Borda algorithm assigns points to each item $i$ in the following way:

$$ \text{Borda Score}(i) = \sum_{j=1}^{n} (m - \text{rank}_j(i)) $$

Where $\text{rank}_j(i)$ denotes the position of item $i$ in the $j$th ranked list.

The Borda algorithm is easy to implement, but it might lag in performance for large datasets or when the individual rankings are significantly diverse.

Combining Probability Mass Function (CPMF)

CPMF is a probabilistic rank fusion method that incorporates the probability of an item being at a certain rank in individual lists. It assumes that the rankings are probabilistic and uses the Probability Mass Function (PMF) to calculate the fused ranking. CPMF outperforms Borda for diverse and noisy datasets.

Let $p_{ij}$ be the probability that item $i$ appears at rank $j$ in the $n$ lists. The CPMF score for item $i$ is given by:

$$ \text{CPMF Score}(i) = \sum_{j=1}^{m} p_{ij} $$

The probabilities $p_{ij}$ can be estimated using techniques like the Plackett-Luce model or the Thurstone-Mosteller model.

Rank-Biased Precision (RBP)

RBP is a rank fusion method widely used in information retrieval systems. It incorporates a user-defined persistence parameter $p$ to reflect the probability that a user will examine the search results up to a certain rank. This parameter allows the search engine to optimize rankings based on user behavior.

For a given ranked list $R_j$, the RBP score is calculated as follows:

$$ \text{RBP Score}(R_j) = (1 - p) \sum_{k=1}^{m} p^{k-1} \text{rel}(R_j[k]) $$

Where $\text{rel}(R_j[k])$ is an indicator function representing the relevance of the item at rank $k$ in list $R_j$.

RBP provides more flexibility in tuning the importance of different ranks based on user preferences.

LambdaMART

LambdaMART is an advanced algorithm used by tech giants like Microsoft and Yahoo for learning-to-rank tasks. It is based on the gradient boosting framework and employs LambdaRank, which optimizes the ListNet loss function using gradient descent.

The LambdaMART algorithm involves constructing a set of weak rankers (usually decision trees) that are iteratively refined to minimize the LambdaRank objective, which directly measures the pairwise disagreement between ranks.

$$ \text{LambdaRank Objective} = \sum_{i=1}^{m} \sum_{j=1}^{m} \text{DCG gain}(i, j) \cdot \text{Lambda}(i, j) $$

Where $\text{DCG gain}(i, j)$ is the gain of swapping items at ranks $i$ and $j$ in the ranking, and $\text{Lambda}(i, j)$ is a weight function that depends on the gradients of the individual models.

LambdaMART's ability to optimize for ranking measures directly contributes to its superior performance in learning-to-rank scenarios.

Neural Rank Fusion

With the rise of deep learning, neural rank fusion methods have gained popularity due to their ability to learn complex patterns from data. Neural rank fusion models typically employ techniques like siamese networks or transformer-based architectures to process individual rankings and generate a fused ranking.

In a siamese network-based approach, the individual rankings are fed into two parallel networks with shared weights. The networks learn to map the rankings into a common embedding space, where the fused ranking is generated based on similarity scores.

On the other hand, transformer-based rank fusion models utilize attention mechanisms to process and combine individual rankings effectively.

Neural rank fusion methods often outperform traditional algorithms when sufficient training data is available, but they may require substantial computational resources.

Reciprocal Rank Fusion

The Reciprocal Rank Fusion (RRF) is an advanced algorithmic technique designed to amalgamate multiple result sets, each having distinct relevance indicators, into a unified result set. One of the key advantages of RRF is its ability to deliver high-quality results without the necessity for any tuning. Moreover, it does not mandate the relevance indicators to be interconnected or similar in nature.

Diving deeper into the algorithm, RRF is based on the concept of reciprocal rank. The reciprocal rank of a document is the multiplicative inverse of its rank. In the context of information retrieval, the rank of a document is its position in a list of documents sorted by relevance. The reciprocal rank is used to give higher weight to documents that appear earlier in the list.

The RRF algorithm combines the reciprocal ranks of the same document from different result sets to compute a combined score. The combined score is then used to rank the documents in the final result set. The formula used in the RRF algorithm is as follows:

$$ \text{RRF Score} = \frac{1}{k + rank} $$

Where $k$ is a constant (usually set to 60), and $rank$ is the rank of the document in a particular result set. The RRF score is calculated for each document in each result set, and the scores are then summed up to get the final score for each document.

The properties of the RRF algorithm include its simplicity, effectiveness, and robustness. It is simple because it only requires the ranks of the documents and does not need any tuning. It is effective because it can combine result sets with different relevance indicators and still produce high-quality results. It is robust because it is not sensitive to the choice of $k$ and can handle a large number of result sets.

Conclusion

Rank fusion serves as a potent tool in the arsenal of data scientists and search engine experts, enhancing the efficacy of ranking performance. The spectrum of rank fusion algorithms ranges from the straightforward Borda algorithm to the more complex Neural Rank Fusion, each tailored to meet specific scenarios and data attributes. While the Borda algorithm is appreciated for its simplicity and ease of implementation, more advanced techniques like LambdaMART and Neural Rank Fusion are capable of delivering cutting-edge results for large-scale applications.

Incorporating the Reciprocal Rank Fusion (RRF) into this discussion, it stands out for its ability to combine multiple result sets with varying relevance indicators without the need for tuning. This makes it a robust and effective choice for many applications.

Edits: - 2024-03-06 - Added Panjete/rankAggr to references - 2023-10-09 - Added "Reciprocal Rank Fusion", rewrite conclusion

References

Implementing Reciprocal Rank Fusion (RRF) in Python

2023-07-28T00:00:00+02:00

In the world of Information Retrieval, ranking is one of the most crucial aspects. It prioritizes the matching information according to its relevancy. In many cases, having a single ranking model may not satisfy the diverse needs of users. This is where the idea of Rank Fusion comes in; combining various ranking models to enhance the retrieval performance. Let's learn how to implement a simple rank fusion approach in Python.

Understanding the RRF Ranking Process

The Reciprocal Rank Fusion (RRF) operates by collecting search outcomes from various strategies, assigning each document in the results a reciprocal rank score, and subsequently merging these scores to generate a new ranking. The underlying principle is that documents that consistently appear in top positions across diverse search strategies are likely more pertinent and should thus receive a higher rank in the consolidated result.

A simplified breakdown of the RRF process

Collect ranked search outcomes from multiple simultaneous queries. E.g. one query to semantic search and one query to text search.
Assign reciprocal rank scores to each result in the ranked lists. The RRF process generates a new search score for each match in each result set. For each document in the search results, the algorithm assigns a reciprocal rank score based on its position in the list. This score is computed as $1/(rank + k)$, where $rank$ is the document's position in the list, and $k$ is a constant.

Choosing the k constant Empirical observation suggests that $k$ performs best when set to a small value, such as 60. Note that this $k$ value is a constant in the RRF algorithm and is entirely distinct from the k that regulates the number of nearest neighbours.

Combine scores. The algorithm adds up the reciprocal rank scores acquired from each search strategy for each document, thereby generating a combined score for each document.
The algorithm ranks documents based on the combined scores and arranges them accordingly. The resulting list constitutes the fused ranking.

To depict the Reciprocal Rank Fusion (RRF) process, we can use a flowchart.

*Figure 1: Reciprocal Rank Fusion (RRF) Process Flowchart. The diagram illustrates the steps involved in the RRF ranking process.

Implementing Reciprocal Rank Fusion

RRF uses the following formula to determine the score for ranking each document:

score = 0.0
for q in queries: # loop over queries send to different search engines 
    if d in result(q):
        score += 1.0 / ( k + rank(result(q), d))
return score

# where
# k is a ranking constant
# q is a query in the set of queries
# d is a document in the result set of q
# result(q) is the result set of q
# rank( result(q), d ) is d's rank within the result(q) starting from 1

(code from Elasticsearch documentation)

You could significantly improve performance by using maps and list comprehensions - referred to as "vectorizing" in overlapping contexts.

def reciprocal_rank_fusion(queries, d, k, result_func, rank_func):
    return sum([1.0 / (k + rank_func(result_func(q), d)) if d in result_func(q) else 0 for q in queries])

This function takes as arguments:

A collection of queries
A document d
A ranking constant k
A function result_func that represents the result(q) operation in your original code.
A function rank_func that represents the rank(result(q), d) operation in your original code.

NOTE: list comprehension is used to perform the operations that for-loop did, allowing Python to compute the results in a more optimized way. However, this isn't truly "vectorized" computing as you would find in libraries like NumPy or in languages like R, where computations are performed concurrently rather than sequentially.

The result_func function takes a query q, and returns a list of documents that are the results of the query. For simplicity, let's assume that each query corresponds to a list of documents in a dictionary called database.

The rank_func function takes a list of documents (results of a query) and a specific document d, and returns the rank of d in the list.

database = {   # assuming your queries and results are stored in a dictionary
    'query1': ['doc1', 'doc2', 'doc3'],
    'query2': ['doc3', 'doc1', 'doc2'],
    # more queries and their document results...
}

def result_func(q):
    return database[q]

def rank_func(results, d):
    return results.index(d) + 1 # adding 1 because ranks start from 1

Then, the reciprocal_rank_fusion function can be called like this:

k = 5 
d = 'doc1'
queries = ['query1', 'query2'] # fill this with your actual query keys

print(reciprocal_rank_fusion(queries, d, k, result_func, rank_func))

This assumes that queries and their corresponding results are uniquely stored in a dictionary, and that your document ranks are determined by their order in the list of results.

Please modify the functions result_func, rank_func, and database to fit your specific application details and data.

Conclusion

The concept of Rank Fusion, particularly the Reciprocal Rank Fusion (RRF) method, offers a promising approach to amalgamate multiple result sets into a unified one. This article has demonstrated how to implement a simple RRF in Python.

While the example provided in this article is simplified, it provides a solid foundation for understanding the RRF process and how to implement it in Python. Depending on the specific application and data, the functions and database structure may need to be modified. However, the core concept and approach remain the same.

The RRF method is a powerful tool in the field of Information Retrieval, providing a robust and efficient way to combine multiple ranking models to enhance retrieval performance. By understanding and implementing this method, one can significantly improve the quality and relevance of search results, thereby enhancing user satisfaction and system effectiveness.

gitignore-style exclusion for restic

2023-07-27T00:00:00+02:00

X::Don't Just Create Backups, Verify Them - How Restic Can Help?

Restic is a popular backup tool that supports the use of .gitignore-style exclusion patterns to ignore certain files and directories during the backup process. This feature is useful when you want to exclude specific files or directories from being backed up, such as temporary files, caches, or build artifacts.

To use ignore with Restic, you can create a file called .resticignore in the root of your repository (where you run Restic). This file should contain the patterns for the files and directories you want to ignore, just like you would do with a .gitignore file.

Here's how you can use ignore in Restic:

Create a .resticignore file: Inside your project's root directory (or the directory you're backing up), create a file named .resticignore. You can use any text editor to create this file.
Add patterns to ignore: In the .resticignore file, list the files and directories you want to ignore during the backup. Each pattern should be on a separate line. You can use the same syntax as you would in a .gitignore file.

For example, a simple .resticignore file might look like this:

*.log temp/ cache/ build/

The above example would ignore all files with the .log extension and the temp, cache, and build directories.

Run Restic backup with --ignore-file option: When running Restic to perform the backup, specify the .resticignore file using the --ignore-file option. This tells Restic to use the patterns in that file to exclude certain files and directories.

Here's an example command:

restic backup /path/to/your/data --ignore-file /path/to/.resticignore

Replace /path/to/your/data with the actual path of the data you want to back up and /path/to/.resticignore with the path to your .resticignore file.

By using the .resticignore file, you can customize what gets backed up and what gets excluded. This can be particularly useful to avoid backing up large or unnecessary files, reducing storage space and backup time.

Location of Python Virtual Environments - Choosing Between Project-Folder and Centralized Folder

2023-07-27T00:00:00+02:00

Project-folder Virtual Environments

In this approach, you create a virtual environment within the project directory itself. This means that each project has its isolated Python environment, and you manage dependencies specific to that project.

With this approach you have clarity where the associated venv resides - helpful when doing cleanup or backup.

Centralized Location for Virtual Environments

In this approach, you create a centralized directory where all virtual environments reside. This directory can be outside your projects, e.g., ~/.virtualenvs or any other location you prefer.

With this approach you keep project directory containing mainly code - the other, replicable content - virtual environment files - are located somewhere outside the project. This helps to e.g. backup whole directory without worrying about excluding virtualenv which is typically not worth backing-up.

X: Creating Virtual Environments in Python

Cookiecutters for the python package with poetry

2023-07-26T00:00:00+02:00

Introduction
Cookiecutter and Poetry for Python Project Scaffolding
Benefits of Using Cookiecutter for Project Scaffolding
Advantages of Using Poetry for Dependency Management
Cookiecutters
cjolowicz/cookiecutter-hypermodern-python
fpgmaas/cookiecutter-poetry
radix-ai/poetry-cookiecutter
albertorios/cookiecutter-poetry-pypackage
timhughes/cookiecutter-poetry
johanvergeer/cookiecutter-poetry
elbakramer/cookiecutter-poetry
wboxx1/cookiecutter-pypackage-poetry
cookiecutter wrapper
Tools and services often used in python project cookiecutters

Introduction

Cookiecutter and Poetry for Python Project Scaffolding

In the world of Python development, efficient project setup and management are essential for streamlined and successful software development. Two powerful tools that aid in this process are Cookiecutter and Poetry.

Cookiecutter is a command-line utility that enables developers to create project templates, or "cookiecutters," which serve as scaffolds for new projects. These cookiecutters are pre-configured templates that include project structures, file layouts, and even code snippets to kickstart the development process. With its simplicity and flexibility, Cookiecutter allows developers to easily generate consistent and well-organized projects without reinventing the wheel each time.

On the other hand, Poetry is a modern package manager and build tool for Python projects. It simplifies dependency management, packaging, and publishing while ensuring reproducible builds and version control. Poetry provides a user-friendly interface for managing project dependencies and virtual environments, making it a valuable asset for Python developers looking for an efficient way to manage their project's requirements.

Benefits of Using Cookiecutter for Project Scaffolding

Using Cookiecutter for project scaffolding offers several key advantages:

Consistency: Cookiecutter promotes consistency across projects by providing a standardized and repeatable starting point. This consistency ensures that developers adhere to best practices and maintain a clean project structure throughout the development process.
Time Savings: With Cookiecutter, developers can avoid the repetitive and time-consuming task of setting up a new project from scratch. By using pre-defined templates, the initial project setup becomes quick and hassle-free, allowing developers to focus on writing code and implementing features.
Community-Driven Templates: The open-source nature of Cookiecutter means that developers can access a vast repository of community-contributed templates. This diverse collection covers various project types and frameworks, making it easy to find a suitable starting point for almost any Python project.
Flexibility and Customization: While offering pre-configured templates, Cookiecutter also allows developers to customize their project scaffolds. This flexibility ensures that developers can tailor the project structure to fit their specific needs and project requirements.

Advantages of Using Poetry for Dependency Management

Poetry's features complement the benefits of Cookiecutter, making it an ideal companion for Python project development:

Dependency Management Made Easy: Poetry simplifies the management of project dependencies, handling both direct dependencies and their dependencies, providing a single-source-of-truth for the project's requirements.
Virtual Environments: Poetry creates isolated virtual environments for projects, ensuring that each project has its own set of dependencies, avoiding version conflicts and promoting project stability.
Publication and Distribution: Poetry streamlines the process of publishing packages to the Python Package Index (PyPI), simplifying the distribution of Python packages and making them accessible to a wider audience.
Version Control and Reproducibility: Poetry's pyproject.toml file allows for clear specification of package versions, ensuring reproducible builds and making it easier to manage version updates.

Cookiecutters

cjolowicz/cookiecutter-hypermodern-python

https://github.com/cjolowicz/cookiecutter-hypermodern-python

User Guide — Hypermodern Python Cookiecutter documentation

Packaging and dependency management with Poetry
Test automation with Nox
Linting with pre-commit and Flake8
Continuous integration with GitHub Actions
Documentation with Sphinx and Read the Docs
Automated uploads to PyPI and TestPyPI
Automated release notes with Release Drafter
Automated dependency updates with Dependabot
Code formatting with Black and Prettier
Testing with pytest
Code coverage with Coverage.py
Coverage reporting with Codecov
Command-line interface with Click
Static type-checking with mypy
Runtime type-checking with Typeguard
Security audit with Bandit and Safety
Check documentation examples with xdoctest
Generate API documentation with autodoc and napoleon
Generate command-line reference with sphinx-click
Manage project labels with GitHub Labeler

fpgmaas/cookiecutter-poetry

https://github.com/fpgmaas/cookiecutter-poetry

Poetry for dependency management
CI/CD with GitHub Actions
Pre-commit hooks with pre-commit
Code quality with black, ruff, mypy, and deptry
Publishing to Pypi or Artifactory by creating a new release on GitHub
Testing and coverage with pytest and codecov
Documentation with MkDocs
Compatibility testing for multiple versions of Python with Tox
Containerization with Docker

radix-ai/poetry-cookiecutter

https://github.com/radix-ai/poetry-cookiecutter

albertorios/cookiecutter-poetry-pypackage

https://github.com/albertorios/cookiecutter-poetry-pypackage

Develop, build, and release Python packages using via Poetry
Test against multiple Python versions via Tox
Bump semantic version via bump2version
Optional command-line interface via Click
Repeatable build environments via Docker

timhughes/cookiecutter-poetry

https://github.com/timhughes/cookiecutter-poetry

Cookiecutter template configured with the following:

poetry
pytest
black
bandit
pyinstaller
jupyterlab
click

johanvergeer/cookiecutter-poetry

https://github.com/johanvergeer/cookiecutter-poetry

Testing setup with pytest
GitHub Actions: Ready for GitHub actions
Sphinx docs: Documentation ready for generation with, for example, ReadTheDocs
Auto-release to PyPI when you push a new tag to master (optional)
Command-line interface using Click (optional)
GitHub Issue templates for bug reports and feature requests

elbakramer/cookiecutter-poetry

https://github.com/elbakramer/cookiecutter-poetry (fork from johanvergeer/cookiecutter-poetry)

Package and dependency management using Poetry
Has the option to stick with setuptools (setup.py)
GitHub Actions: Ready for GitHub Actions
Build and test on push or pull request for continuous integration (CI) (+badge)
Build documentation on push, publish the built documentation to Github Pages (+badge)
Draft release on push, this draft can be published manually or even automatically when a new tag is pushed
Build and release Python package to PyPI when a new release tag is published on GitHub
Many pre-commit hook-based formatting, linting, testing tools
Upgrade syntax for newer Python with pyupgrade
Formatting with black
Import sorting with isort
Linting with flake8 and pylint
Static typing with mypy
Testing with pytest
Git hooks that run all the above with pre-commit
Other integrations with external sites/services
Sphinx docs serving with ReadTheDocs (+badge)
Coverage report with Codecov (+badge)
Monitoring dependency version updates with Requires.io or PyUp (+badge)
Version bumping using bump2version
Dynamic versioning using dunamai
Command-line interface using Click

wboxx1/cookiecutter-pypackage-poetry

https://github.com/wboxx1/cookiecutter-pypackage-poetry

Testing setup with unittest and python setup.py test or pytest
Travis-CI: Ready for Travis Continuous Integration testing
Appveyor: Ready for Appveyor Continuous Integration testing
Tox testing: Setup to easily test for Python 2.7, 3.4, 3.5, 3.6, 3.7
Sphinx docs: Documentation ready for generation with, for example, ReadTheDocs
Bump2version: Pre-configured version bumping with a single command
Auto-release to PyPI when you push a new tag to master (optional)
Command-line interface using Click (optional)

cookiecutter wrapper

https://pypi.org/project/cookiecutter-poetry/

Please note that the actual number of GitHub stars would be fetched dynamically when viewing the article in real-time. The badge URL with the GitHub stars includes a placeholder ({shield}) for the dynamic value, and the actual number will be displayed when the badge is rendered on the page.

Tools and services often used in python project cookiecutters

Cookiecutter: Command-line utility for creating project templates.
Poetry: Package manager and build tool for Python projects.
Pre-commit: Framework for managing and maintaining multi-language pre-commit hooks.
Black: Opinionated code formatter for Python.
Tox: Generic virtualenv management and test command line tool.
Nox: Flexible test automation tool.
Ruff: Fast Linter, code quality tool for Python projects.
GitHub Actions: Continuous integration and continuous deployment service by GitHub.
Codecov: Code coverage reporting tool.
Bump2version: Version-bumping utility for software projects.
Docker: Platform for building, shipping, and running applications in containers.
Sphinx: Documentation generator for Python projects.
Read the Docs: Hosting service for software documentation.
Release Drafter: Automated release notes generation tool.
Dependabot: Automated dependency updates tool.
Prettier: Opinionated code formatter for various languages, including Python.
pytest: Framework for writing and running Python tests.
Coverage.py: Code coverage measurement tool for Python.
Typeguard: Runtime type checking for Python functions.
Bandit: Security linter for Python code.
Safety: Security dependency checker for Python packages.
xdoctest: Tool for running code examples in docstrings.
autodoc: Sphinx extension for automatic documentation generation from docstrings.
napoleon: Sphinx extension for NumPy and Google style docstrings.
sphinx-click: Sphinx extension for Click-based command-line interfaces.
GitHub Labeler: GitHub Action for managing project labels.

Simplifying Data Download from Google Drive in Google Colab Using gdown

2023-07-24T00:00:00+02:00

Introduction

In this blog post, we will explore a straightforward method to download data from Google Drive into your Google Colab notebook using the 'gdown' command. Google Colab is a popular platform for running Python code, especially for machine learning and data analysis tasks. By leveraging 'gdown,' a handy Python library, you can seamlessly access your files stored on Google Drive without any hassle. Let's dive right into the process!

Steps

Step 1: Import gdown and Authenticate Google Drive

To begin, ensure you have 'gdown' installed in your Colab environment. If it isn't pre-installed, you can do so using the following code snippet:

!pip install gdown

Step 2: Obtain the File's Shareable Link

To download data from your Google Drive, you must first ensure the file or folder is publicly accessible. To do this, right-click on the file or folder in your Google Drive, select "Get Shareable Link," and set the sharing settings to "Anyone with the link."

Step 3: Retrieve the ID from the Shareable Link

Upon obtaining the shareable link, extract the file's ID from the link. The ID is typically found after "https://drive.google.com/file/d/". For instance, if your link is "https://drive.google.com/file/d/ABC12345XYZ/view," then "ABC12345XYZ" is the file's ID.

Step 4: Download the Data Using the gdown command, you can now effortlessly download the data from your Google Drive into your Colab notebook. The following code demonstrates how to do this:

import gdown

file_id = "ABC12345XYZ"  # Replace this with your file's ID
output_file = "data_file.ext"  # Replace "data_file.ext" with the desired output filename and extension

gdown.download(f"https://drive.google.com/uc?id={file_id}", output_file)

Conclusion

In this brief guide, we have explored the process of downloading data from Google Drive into Google Colab using the 'gdown' command. By following these simple steps, you can seamlessly access and utilize your data for various machine learning, data analysis, or other Python-based projects in Google Colab. Happy coding!

Add VSCode to PATH

2023-07-21T00:00:00+02:00

If you get code command not found error but vscode is installed:

> code
zsh: command not found: code

it means, that, code command is not in you system PATH. You need to add it.

To do that, follow these steps:

Launch Visual Studio Code.
Open the Command Palette by pressing Cmd+Shift+P (or Ctrl+Shift+P on Windows/Linux).
Type "shell command" in the Command Palette search bar.
You should see an option that says "Shell Command: Install 'code' command in PATH." Select it to add the code command to your system PATH.

After completing these steps, you should be able to open Visual Studio Code directly from the terminal using the code command.

What is downstream task

2023-07-21T00:00:00+02:00

In the context of data science and business, the term "downstream task" refers to a task or process that occurs after the completion of an initial or preceding task in a data pipeline or workflow. In this data flow, information is processed and refined as it moves from one stage to another.

To understand the concept better, let's consider a simplified data science workflow:

Data Collection: The first step is to gather and collect raw data from various sources, such as databases, APIs, or files.
Data Preprocessing: Once the data is collected, it often needs to be cleaned, transformed, and structured in a way that makes it suitable for analysis. This step is known as data preprocessing.
Feature Engineering: After preprocessing, relevant features (variables) are extracted from the data, and new features might be created to enhance the predictive power of the models.
Model Training: With the prepared data, machine learning models are trained to make predictions or classifications based on patterns found in the data.
Model Evaluation: After the models are trained, they need to be evaluated on a separate dataset to assess their performance and identify any issues such as overfitting or underfitting.

Now, let's introduce the notion of "downstream tasks":

Model Deployment: Once the trained model(s) have been evaluated and deemed satisfactory, they are deployed into a production environment where they can be used to make predictions on new, unseen data.
Decision Making: In a business context, the model's predictions are often used as inputs for making data-driven decisions. These decisions could be related to marketing strategies, customer segmentation, risk assessment, product recommendations, etc.
Performance Monitoring: After the model has been deployed, its performance needs to be continually monitored to ensure that it maintains accuracy and relevance over time.
Model Updating and Retraining: As new data becomes available and the model's performance deteriorates or needs improvement, it might be necessary to update or retrain the model to keep it up-to-date and accurate.

In this workflow, "downstream tasks" are those that happen after the initial data preprocessing, model training, and evaluation stages. These tasks utilize the output of the earlier stages to make informed decisions and provide value to the business.

Alternatives for Building Python CLI Apps

2023-07-17T00:00:00+02:00

Python provides several libraries and frameworks for building command-line interface (CLI) applications, each with its own set of features and advantages. In this article, we will explore some of the popular alternatives to build Python CLI apps, including Click, argparse, and Typer, among others.

Click
argparse
Typer
Other Alternatives
Fire
cement
Docopt
Plumbum

Click

Click is a powerful and widely used Python library for creating command-line interfaces. It focuses on simplicity and aims to make it easy to write and maintain CLI applications. Click provides a decorator-based approach for defining commands, options, and arguments, making it intuitive and straightforward to use. It supports complex command hierarchies, automatic help page generation, and customization options for output formatting. Click also offers advanced features such as context passing, callback handling, and parameter types. It has a large and active community, ensuring ongoing support and continuous development.

Click is an excellent choice for both simple and complex CLI applications. Its simplicity and intuitive API make it a great option for beginners, while its advanced features cater to more complex use cases. Whether you are building a small script or a full-fledged CLI tool, Click provides a solid foundation for developing robust and user-friendly applications.

Stand-out Features:

Simple and intuitive API.
Decorator-based command definition.
Support for complex command hierarchies.
Automatic help page generation.
Advanced features like context passing and parameter types.

Use-case: Click is suitable for a wide range of CLI applications, from small scripts to large-scale tools. It is a popular choice for building command-line interfaces in Python due to its simplicity, flexibility, and extensive feature set.

To learn more about Click, visit the official documentation or explore the GitHub repository.

argparse

argparse is a standard library included in Python, making it readily available for CLI application development without any external dependencies. It provides a flexible and comprehensive framework for defining command-line arguments, options, and sub-commands. argparse supports automatic help generation, argument type checking, default values, and various customization options. It also handles error reporting and displays error messages with usage information. argparse's design promotes code reusability, making it easy to build CLI applications with modular components.

argparse is a versatile library suitable for a wide range of CLI applications. Its standard inclusion in Python ensures compatibility and ease of use, making it a popular choice for developers. Whether you are building a simple script or a complex application with multiple sub-commands, argparse provides a robust foundation for handling command-line arguments.

Stand-out Features:

Standard library inclusion, no external dependencies.
Comprehensive framework for defining arguments and options.
Automatic help generation.
Error reporting and usage information display.
Code reusability and modular design.

Use-case: argparse is well-suited for a variety of CLI applications, from basic scripts to more complex tools with sub-commands. Its standard library nature and comprehensive feature set make it a reliable choice for command-line argument handling in Python.

For detailed information about argparse, refer to the official documentation or explore the GitHub repository.

Typer

Typer is a modern, fast, and efficient CLI framework built on top of Click. It offers a simple and concise API for building command-line interfaces in Python, with an emphasis on code readability and type hints. Typer automatically infers the types of arguments and options from their default values or annotations, reducing the need for boilerplate code. It provides features such as automatic help generation, completion generation for shells, and support for asynchronous execution.

Typer's simplicity and seamless integration with Click make it an appealing choice for developers who prioritize code clarity and conciseness. It leverages Python's type hints to improve developer productivity and reduce the likelihood of runtime errors. With its performance optimizations, Typer can handle large CLI applications efficiently.

Stand-out Features:

Simple and concise API with emphasis on code readability.
Automatic type inference from default values or annotations.
Automatic help and completion generation.
Asynchronous execution support.
Performance optimizations for handling large applications.

Use-case: Typer is particularly well-suited for developers who value code readability and conciseness. It is a great choice for building CLI applications of any size, ranging from small scripts to complex tools, with a focus on leveraging Python's type hints.

To learn more about Typer, refer to the official documentation or explore the GitHub repository.

Fire

Fire is a library developed by Google that automatically generates a command-line interface from Python objects. It allows you to turn any Python class or module into a CLI application without the need for explicit command definitions. Fire uses introspection to infer the available methods and attributes of an object, which are then exposed as CLI commands and arguments. This automatic generation of the CLI interface makes Fire incredibly convenient for quickly building command-line tools from existing code.

Fire's simplicity and automatic CLI generation make it an excellent choice for rapidly prototyping CLI applications. It eliminates the need for manually defining command structures and allows you to focus on the core functionality of your Python objects. While it may not offer the same level of customization as some other libraries, Fire excels in its ability to generate a functional CLI interface with minimal effort.

Stand-out Features:

Automatic CLI generation from Python objects.
No explicit command definitions required.
Rapid prototyping of CLI applications.
Eliminates the need for manual command structure definitions.

Use-case: Fire is best suited for quickly creating simple CLI tools based on existing Python code. It is ideal for situations where you want to expose the functionality of your Python objects through a command-line interface without the need for explicit command definitions.

To learn more about Fire, refer to the official documentation or explore the GitHub repository.

cement

cement is a powerful and extensible CLI framework for Python. It provides a complete set of features for building CLI applications, including command-line argument parsing, command line completion, output rendering, and plugin support. cement follows a modular design, allowing you to choose and configure only the components you need for your application. It offers support for both single-command and multi-command applications, making it versatile and adaptable to various use cases.

One of the standout features of cement is its plugin architecture, which enables easy integration of third-party functionality into your CLI application. It also provides a powerful and customizable output handler system, allowing you to define how the application's output is rendered and formatted. cement's extensive documentation and active community make it a reliable choice for developing robust CLI applications.

Stand-out Features:

Comprehensive CLI framework with modular design.
Command-line argument parsing.
Command line completion.
Customizable output rendering.
Plugin architecture for easy integration of third-party functionality.

Use-case: cement is suitable for building CLI applications of any complexity. Its modular design and extensive feature set make it an excellent choice for projects that require advanced customization, plugin support, and flexible output rendering.

For detailed information about cement, refer to the official documentation or explore the GitHub repository.

Docopt

Docopt is a command-line interface description language and Python library that generates a CLI parser from human-readable usage patterns. It allows you to define the command-line interface by writing usage patterns and associated descriptions. Docopt then automatically generates a parser based on these patterns, handling argument parsing and help generation.

The simplicity and readability of Docopt's usage patterns make it a unique and user-friendly approach to building CLI applications. By using natural language to describe the command-line interface, Docopt simplifies the process of defining and maintaining CLI specifications. It supports both positional arguments and options and provides support for complex command hierarchies.

Docopt is an excellent choice for projects where a human-readable and self-documenting CLI interface is a priority. It allows developers to focus on writing clear usage patterns while leaving the parsing and help generation to the library.

Stand-out Features:

Command-line interface description language.
Automatic parser generation from human-readable usage patterns.
Simplifies the process of defining and maintaining CLI specifications.
Support for positional arguments and options.
Natural language approach for clear usage patterns.

Use-case: Docopt is best suited for projects where a human-readable and self-documenting CLI interface is desired. It is a good choice for developers who prefer a more descriptive and expressive way of defining the command-line interface.

For more information about Docopt, refer to the official documentation or explore the GitHub repository.

Plumbum

Plumbum is a library that aims to simplify the process of writing shell-like scripts and command-line tools in Python. It provides an intuitive and concise API for executing shell commands, capturing their output, and handling command-line arguments. Plumbum allows you to seamlessly mix shell-like syntax and Python code, providing a powerful and flexible approach to command-line application development.

One of Plumbum's standout features is its ability to create reusable command templates. These templates encapsulate the common functionality of a command, allowing you to easily define and reuse complex command structures. Plumbum also offers support for input/output redirection, background execution, and shell pipeline operations.

Plumbum is an excellent choice for developers who want to combine the power of shell commands with the flexibility and expressiveness of Python. It simplifies the process of interacting with the command line and enables the creation of robust and maintainable CLI applications.

Stand-out Features:

Intuitive and concise API for executing shell commands.
Seamless integration of shell-like syntax and Python code.
Reusable command templates for defining complex command structures.
Support for input/output redirection, background execution, and shell pipelines.

Use-case: Plumbum is suitable for developers who want to leverage the power of shell commands while maintaining the flexibility and expressiveness of Python. It is a good choice for building command-line applications that require extensive interaction with the command line and complex command structures.

To learn more about Plumbum, refer to the official documentation or explore the GitHub repository.

Which Tool Should I Use in My Case?

When choosing a tool for building Python CLI apps, it's important to consider the specific requirements of your project. Different tools excel in different scenarios. Here, we'll discuss three common use-cases with divergent requirements and suggest the best tools for each case along with justifications.

1. Simple Script or Rapid Prototyping

If you're building a simple script or need to rapidly prototype a CLI application, Click and Fire are excellent choices.

Click offers a simple and intuitive API with decorator-based command definition, making it easy to create CLI apps quickly. It provides advanced features like context passing and parameter types, which can enhance the functionality of your script. Additionally, Click's extensive documentation and active community support make it a reliable choice.

Fire is perfect for converting existing Python code into a CLI application effortlessly. With Fire, you can generate a command-line interface from any Python object without explicit command definitions. It prioritizes simplicity and allows you to focus on the core functionality of your code, making it ideal for rapid prototyping.

2. Complex CLI Application with Advanced Customization

For complex CLI applications that require advanced customization, argparse and cement are robust options.

argparse is a Python standard library, providing a comprehensive framework for defining command-line arguments, options, and sub-commands. It supports automatic help generation, type checking, and error reporting. argparse's modular design promotes code reusability and is suitable for projects with multiple sub-commands and extensive customization requirements.

cement is a powerful CLI framework that offers a complete set of features, including argument parsing, command line completion, output rendering, and plugin support. It follows a modular design, allowing you to choose the components you need. cement's plugin architecture enables easy integration of third-party functionality, and its customizable output rendering system provides flexibility.

3. Human-Readable CLI Interface

If you prioritize a human-readable and self-documenting CLI interface, consider Typer and Docopt.

Typer is a modern CLI framework built on top of Click, emphasizing code readability and type hints. It automatically infers argument types, reducing boilerplate code. Typer's simplicity and integration with Python's type hints make it an appealing choice for developers who value code clarity.

Docopt takes a unique approach, allowing you to define the command-line interface using human-readable usage patterns. It automatically generates a parser based on these patterns, handling argument parsing and help generation. Docopt's natural language approach simplifies the process of defining and maintaining CLI specifications, resulting in a clear and readable CLI interface.

Creating a PowerPoint Presentation with a Language Model

2023-07-17T00:00:00+02:00

In this article, we'll explore how to generate a PowerPoint presentation using the OpenAI Azure API and provide additional advanced features to enhance the process.

Prerequisites

Before we begin, make sure you have the following prerequisites set up:

Python 3.x installed on your machine
OpenAI API key
Required Python libraries: python-pptx and openai

You can install the libraries using the pip package manager:

pip install python-pptx openai

Step 1: Setting up the OpenAI API

To get started, you'll need to sign up for the OpenAI API and obtain an API key. The API key allows you to interact with the GPT model. Follow the instructions in the OpenAI documentation to sign up and retrieve your API key.

Step 2: Importing the Required Modules

To work with PowerPoint and the OpenAI API, we need to import the necessary modules in our Python script. Specifically, we'll import the Presentation class from the python-pptx library and the openai module.

from pptx import Presentation
import openai

Step 3: Authenticating with the OpenAI API

Next, we need to authenticate with the OpenAI API by providing our API key. This step ensures that we have the necessary permissions to access and utilize the GPT model.

openai.api_key = 'YOUR_API_KEY'

Replace 'YOUR_API_KEY' with the API key you obtained in Step 1.

Step 4: Generating the Presentation Outline with ChatGPT

With the necessary setup complete, we can now use the ChatGPT model to generate an outline for our PowerPoint presentation. We'll provide a description of the presentation as input and receive a list of slides as output. The slides will form the basis of our presentation structure.

description = "This presentation is about the benefits of exercise."
response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=description,
    max_tokens=100,
    n=5,  # Number of slides in the outline
    stop=None,
    temperature=0.7
)
slides = response.choices[0].text.strip().split('\n')

In this example, the description variable contains the input description for the presentation. The max_tokens parameter limits the response length, and the n parameter determines the number of slides in the outline. Feel free to adjust these parameters based on your specific needs.

Step 5: Generating Content for Each Slide

To make our presentation informative, we'll use the ChatGPT model to generate content for each slide in the outline. For each slide, we'll iterate through the slides list and generate the content using the ChatGPT model.

for i, slide in enumerate(slides):
    response = openai.Completion.create(
        engine="text-davinci-003",
        prompt=slide,
        max_tokens=150,
        n=1,
        stop=None,
        temperature=0.7
    )
    content = response.choices[0].text.strip()
    # Store the content for the slide
    slides[i] = {'title': slide, 'content': content}

Here, we iterate through each slide in the slides list, generate the content using the ChatGPT model, and store the title and content in a dictionary. Adjust the max_tokens parameter based on the desired length of each slide's content.

Step 6: Creating the PowerPoint Presentation

With the slide titles and content generated, it's time to create the PowerPoint presentation using the python-pptx library. We'll iterate through the slides and add them to the presentation with the appropriate titles and content.

presentation = Presentation()

for slide in slides:
    slide_title = slide['title']
    slide_content = slide['content']

    slide_layout = presentation.slide_layouts[1]  # Choose the layout for the slide
    slide = presentation.slides.add_slide(slide_layout)
    title = slide.shapes.title
    content = slide.placeholders[1]

    title.text = slide_title
    content.text = slide_content

presentation.save("generated_presentation.pptx")

In this example, we create a new slide for each item in the slides list. We set the title and content for each slide and save the presentation as a PowerPoint file named "generated_presentation.pptx". You can adjust the slide layout by choosing a different index from the slide_layouts list.

Possible Next Features for the Presentation Generation Script

While the script we've created is already capable of generating PowerPoint presentations, we can enhance it further with additional features. Here are a few possible next steps to consider:

Slide Customization: Allow users to specify different slide layouts, fonts, colors, and background images to customize the visual appearance of their presentation.
Image Integration: Extend the script to generate slides with images. This can involve using AI models to automatically search and retrieve relevant images based on the content of each slide.
Interactive Presentations: Utilize technologies like Jupyter Notebook or web-based frameworks to create interactive presentations that allow viewers to engage with the content dynamically.
Natural Language Processing: Incorporate natural language processing techniques to analyze the generated content and provide suggestions for improvements, such as grammar corrections, more concise wording, or alternative phrasing.

By implementing these features, the presentation generation script can become more versatile and provide a richer experience for users.

Alternative approach - let LLM generate VisualBasic script

In this article we use python to generate the slides. You can also ask model (ChatGPT) for a VisualBasic script that will generate presentation for you. You can learn this approach from the video: Create Beautiful PowerPoint Slides with ChatGPT + VBA: Quick Tip! - YouTube

Conclusion

In this article, we've explored how to create a PowerPoint presentation using a language model, specifically OpenAI's GPT model through the Azure API. We've covered the steps from setting up the OpenAI API to generating an outline and filling the slides with content. Additionally, we discussed possible next features to enhance the script, such as slide customization, image integration, interactive presentations, and natural language processing. By expanding upon these features, you can create powerful presentation automation tools tailored to your specific needs.

Automating presentation generation not only saves time and effort but also opens up new possibilities for creating engaging and informative presentations. With the help of AI and language models, we can revolutionize the way presentations are created, allowing presenters to focus more on refining their ideas and delivering impactful content.

Time Travel in Git - Creating a Branch from the Past and Crafting a New Future

2023-07-14T00:00:00+02:00

Introduction

In this guide, we will learn how to create a new branch in a Git repository based on a previous commit. We have commit history as below.

We are not happy with the changes C, D and E. We would like to start again from B, but we want to keep changes C, D and E in a new branch. Specifically, we will create a new branch starting from commit B in the main branch. We'll move the subsequent commits C, D, and E to the new branch and continue working on the main branch from the state of commit B - new commits F and G.

This guide assumes you have a basic understanding of Git commands and are familiar with the command line interface.

Step-by-Step Guide

Determine the current branch and commit

Open the terminal and navigate to the Git repository where you want to perform this operation. Use the following command to display the current branch and commit:

git status

Create a new branch from commit B

To create a new branch at commit B, use the following command:

git branch new-branch-name commit-B-hash

Replace new-branch-name with the desired name for your new branch and commit-B-hash with the hash or unique identifier of commit B. This command creates a new branch without switching to it.

Move commits C, D, and E to the new branch

Switch to the new branch using the following command:

git checkout new-branch-name

This command switches your working directory to the new branch. Commits C, D, and E will be moved to this branch while leaving the main branch unaffected.

To move commits C, D, and E to the new branch, use the interactive rebase command:

git rebase -i commit-B-hash

Replace commit-B-hash with the hash or unique identifier of commit B. An interactive rebase will open, displaying a list of commits.

Rearrange the commits in the interactive rebase

In the interactive rebase interface, locate the lines representing commits C, D, and E. Rearrange their order by moving them above commit B. Save and close the file to continue.

Update the main branch

Switch back to the main branch using the following command:

git checkout main

Your working directory will now be on the main branch.

Make changes to the main branch based on commit B

You are now on the main branch, as it was at commit B. Make the necessary changes or improvements.

Commit the changes on the main branch

Stage your changes using the following command:

git add .

Commit the changes with a descriptive message using the following command:

git commit -m "Describe your changes or improvements"

Continue development on the main branch

At this point, you can continue making new commits on the main branch, just as you would in any normal development workflow.

NOTE: perhaps it is not a best practice to run development on the main branch - you can learn more about it from various branching strategies. We use such a schema here for sake of simplicity.

Conclusion

Congratulations! You have successfully created a new branch starting from commit B and moved the subsequent commits C, D, and E to the new branch. The develop branch has been reverted to its state at commit B, allowing you to continue development from that point. Remember to use Git commands with caution and make sure to create backups or push your changes to a remote repository for safety.

Mastering Temporary Files and Directories with Python's tempfile Module

2023-07-13T00:00:00+02:00

Python's tempfile module is an incredibly powerful tool that allows you to create and manage temporary files and directories with ease. In this article, we'll dive deep into the most common use-cases and explore some lesser-known, but highly useful features of this versatile module.

Why Use Temporary Files and Directories?

Temporary files and directories are essential when you need to store intermediate results, cache data, or hold information during the execution of a program. They can help you minimize memory usage and improve performance by reducing the need to recompute expensive operations. Moreover, temporary files can be useful in scenarios like unit testing, where you need to create mock files and directories for testing purposes.

Creating Temporary Files

The tempfile module provides several functions to create temporary files, including TemporaryFile, NamedTemporaryFile, and SpooledTemporaryFile.

TemporaryFile

The TemporaryFile function creates an anonymous temporary file that is deleted when it is closed. This function returns a file-like object that can be used with Python's standard I/O operations:

import tempfile

with tempfile.TemporaryFile() as temp_file:
    temp_file.write(b'This is a temporary file.')
    temp_file.seek(0)
    print(temp_file.read())

NamedTemporaryFile

The NamedTemporaryFile function is similar to TemporaryFile, but the file has a visible name in the file system. The file is deleted when it is closed:

import tempfile

with tempfile.NamedTemporaryFile() as temp_file:
    temp_file.write(b'This is a named temporary file.')
    temp_file.seek(0)
    print(temp_file.read())

SpooledTemporaryFile

The SpooledTemporaryFile function creates a temporary file that is stored in memory (using io.BytesIO or io.StringIO) until it reaches a specified size. Once the size is exceeded, the data is automatically written to disk:

import tempfile

with tempfile.SpooledTemporaryFile(max_size=1024) as temp_file:
    temp_file.write(b'This is a spooled temporary file.')
    temp_file.seek(0)
    print(temp_file.read())

Creating Temporary Directories

The tempfile module provides the TemporaryDirectory function to create temporary directories. These directories, along with their contents, are automatically deleted when the context manager exits:

import tempfile
import os

with tempfile.TemporaryDirectory() as temp_dir:
    print(f'Temporary directory: {temp_dir}')
    temp_file_path = os.path.join(temp_dir, 'temp_file.txt')

    with open(temp_file_path, 'w') as temp_file:
        temp_file.write('This file is inside the temporary directory.')

print('Temporary directory and file have been deleted.')

Customizing Temporary File and Directory Names

You can customize the names of temporary files and directories using the prefix, suffix, and dir arguments. For example:

import tempfile

with tempfile.NamedTemporaryFile(prefix='my_temp_', suffix='.txt', dir='/tmp') as temp_file:
    print(f'Temporary file path: {temp_file.name}')

Managing File and Directory Lifetimes

By default, temporary files and directories are deleted when their corresponding file-like objects are closed. However, you can use the delete argument to control this behavior:

import tempfile

with tempfile.NamedTemporaryFile(delete=False) as temp_file:
    temp_file.write(b'This temporary file will not be deleted.')
    temp_file_path = temp_file.name

with open(temp_file_path, 'rb') as temp_file:
    print(temp_file.read())

Securely Generating Random Strings

The tempfile module also provides the mkstemp and mkdtemp functions, which generate random strings for file and directory names, respectively. These functions can be useful when you need to generate unique names for your application:

import tempfile

temp_file, temp_file_path = tempfile.mkstemp()
print(f'Temporary file path: {temp_file_path}')

temp_dir_path = tempfile.mkdtemp()
print(f'Temporary directory path: {temp_dir_path}')

Conclusion

In this article, we've explored the powerful features of Python's tempfile module, covering common use-cases and some lesser-known features. With these tools at your disposal, you can easily create and manage temporary files and directories in your Python applications.

Exploring Python Packages for Loading and Processing YAML Front Matter in Markdown Documents

2023-07-11T00:00:00+02:00

Introduction
PyYAML
Frontmatter
YAML Front Matter
Python Markdown
mistune
Commonmark
Which one to use in my case?
Simple Front Matter Extraction
Advanced Front Matter Manipulation
Seamless Integration with Markdown Parsing
Performance and Speed
CommonMark Compliance
Minimalistic Approach
Conclusion

Introduction

Markdown has gained popularity as a lightweight markup language for creating structured documents. It is widely used in various domains, including blogging, documentation, and note-taking. Markdown documents often include front matter, which is a metadata section at the beginning of the document. This front matter typically contains YAML (YAML Ain't Markup Language) formatted data that provides additional information about the document. In this blog post, we will explore several Python packages that can help you load and process YAML front matter in Markdown documents, providing you with the necessary tools to extract and work with this valuable metadata.

PyYAML

PyYAML is a powerful YAML parser and emitter for Python. It allows you to easily read and write YAML files, making it a suitable choice for extracting YAML front matter from Markdown documents.

PyPI: PyYAML
GitHub: PyYAML on GitHub

Example on how to load, modify and save front matter to markdown document:

import yaml

# Read front matter from a Markdown file
with open('article.md', 'r') as file:
    content = file.read()
    _, front_matter, _ = content.split('---', 2)
    data = yaml.safe_load(front_matter)

# Modify front matter
data['Modified'] = '2023-07-12'

# Write front matter back to the Markdown file
with open('article.md', 'w') as file:
    file.write('---\n')
    file.write(yaml.dump(data, default_flow_style=False))
    file.write('---\n')

python-frontmatter

Jekyll-style YAML front matter offers a useful way to add arbitrary, structured metadata to text documents, regardless of type. This is a small package to load and parse files (or just text) with YAML (or JSON, TOML or other) front matter.

PyPI: python-frontmatter
GitHub: python-frontmatter on GitHub

Example on how to load, modify and save front matter to markdown document:

import frontmatter

# Read front matter from a Markdown file
post = frontmatter.load('article.md')

# Modify front matter
post['modified'] = '2023-07-12'

# Write front matter back to the Markdown file
frontmatter.dump(post, 'article.md')

Python Markdown

Python Markdown is a popular package for parsing and rendering Markdown documents. While its primary focus is on converting Markdown to HTML, it also provides support for custom extensions, including front matter parsing.

PyPI: Python Markdown
GitHub: Python Markdown on GitHub

Example on how to load, modify and save front matter to markdown document:

from markdown.extensions import meta

# Read front matter from a Markdown file
with open('article.md', 'r') as file:
    content = file.read()
    md = meta.MetaExtension()
    md.convert(content)

# Modify front matter
md.Meta['Modified'] = ['2023-07-12']

# Write front matter back to the Markdown file
with open('article.md', 'w') as file:
    file.write(md.Meta.pformat())
    file.write('\n---\n')
    file.write(md.body)

mistune

Description: mistune is a fast and extensible Markdown parser implemented in pure Python. It aims to be compatible with the Markdown specification while offering various customization options, including support for front matter parsing.

PyPI: mistune
GitHub: mistune on GitHub

Example on how to load, modify and save front matter to markdown document:

import mistune

# Read front matter from a Markdown file
with open('article.md', 'r') as file:
    content = file.read()
    md = mistune.Markdown(renderer=mistune.AstRenderer())

# Modify front matter
front_matter = md.renderer.front_matter

for node in md.parse(content):
    if isinstance(node, front_matter):
        node["Modified"] = "2023-07-12"

# Write front matter back to the Markdown file
with open('article.md', 'w') as file:
    file.write(md.renderer.render(md.parse(content)))

Commonmark

Commonmark is a comprehensive Markdown parsing and rendering library for Python. It adheres to the CommonMark specification and offers a wide range of features, including support for parsing YAML front matter.

PyPI: Commonmark
GitHub: Commonmark on GitHub

Example on how to load, modify and save front matter to markdown document:

import commonmark
import re

# Read front matter from a Markdown file
with open('article.md', 'r') as file:
    content = file.read()

# Extract front matter
front_matter = re.search(r'^---\n(.*?)\n---\n', content, re.DOTALL)
data = yaml.safe_load(front_matter.group(1))

# Modify front matter
data['Modified'] = '2023-07-12'

# Write front matter back to the Markdown file
with open('article.md', 'w') as file:
    file.write('---\n')
    file.write(yaml.dump(data, default_flow_style=False))
    file.write('---\n')
    file.write(content.replace(front_matter.group(0), ''))

Which one to use in my case?

Here are distinct use cases related to loading and processing YAML front matter in Markdown documents, along with recommended libraries for each case and the justifications for the recommendations:

Simple Front Matter Extraction

Recommended Library: Frontmatter

Frontmatter is a dedicated Python package designed specifically for working with front matter in Markdown documents. It provides a simple and intuitive API for extracting front matter data, making it a suitable choice for straightforward front matter extraction needs.

Advanced Front Matter Manipulation

Recommended Library: PyYAML

PyYAML is a powerful YAML parser and emitter for Python. If you require advanced manipulation and processing of YAML front matter, PyYAML offers extensive functionality and flexibility. It allows you to read and write YAML files, making it a robust choice for complex front matter handling.

Seamless Integration with Markdown Parsing

Recommended Library: Python Markdown

If your focus is on seamless integration with Markdown parsing, Python Markdown is a widely-used and feature-rich package. It supports custom extensions, including front matter parsing, allowing you to extract front matter while parsing the Markdown content. This integration can simplify your workflow when working with Markdown documents.

Performance and Speed

Recommended Library: mistune

mistune is a fast and extensible Markdown parser implemented in pure Python. If performance and speed are crucial factors in your use case, mistune's efficient parsing capabilities make it an ideal choice. It provides customization options, including support for front matter parsing, while maintaining high performance.

CommonMark Compliance

Recommended Library: Commonmark

If adhering to the CommonMark specification is essential, Commonmark is a comprehensive Markdown parsing and rendering library that aligns with the specification. It supports front matter parsing while ensuring compliance with the CommonMark standard, providing a reliable solution for standardized Markdown processing.

Minimalistic Approach

Recommended Library: YAML Front Matter

YAML Front Matter is a minimalistic package that focuses on simplicity and ease of use. If you prefer a lightweight solution for extracting YAML front matter from Markdown files without additional complexity, YAML Front Matter provides a straightforward and efficient approach.

Conclusion

In this blog post, we explored several Python packages that can load and process YAML front matter in Markdown documents. These packages provide convenient and efficient methods for extracting metadata from the front matter section, enabling you to access and manipulate this valuable information.

Boosting Productivity and Automation With AppleScript on macOS

2023-07-10T00:00:00+02:00

Introduction

In today's fast-paced digital world, maximizing productivity and finding ways to automate tasks are essential skills. macOS provides a powerful tool called AppleScript, which allows users to write scripts and automate various processes. In this blog post, we will explore the capabilities of AppleScript, discuss cool tricks, and highlight its alternatives.

Getting Started with AppleScript
Increasing Productivity with AppleScript
Cool Tricks with AppleScript
Alternatives to AppleScript
Conclusion

Getting Started with AppleScript

AppleScript is a scripting language that enables users to control applications and perform tasks on macOS. It utilizes the osascript command-line utility to execute AppleScript code. To begin using AppleScript, open the Terminal on your Mac and enter the desired commands preceded by osascript -e.

The osascript website provide examples:

Open or switch to Safari:  
$ osascript -e 'tell app "Safari" to activate'

Close safari:  
$ osascript -e 'quit app "safari.app"'

Empty the trash:  
$ osascript -e 'tell application "Finder" to empty trash'

Set the output volume to 50%  
$ sudo osascript -e 'set volume output volume 50'

Input volume and Alert volume can also be set from 0 to 100%:  
$ sudo osascript -e 'set volume input volume 40'  
$ sudo osascript -e 'set volume alert volume 75'  

Mute the output volume (True/False):  
$ osascript -e 'set volume output muted TRUE'

Toggle volume muting:  
$ osascript -e 'set volume output muted not (output muted of (get volume settings))'

Toggle system theme:  
$ osascript -e 'tell application "System Events" to tell appearance preferences to set dark mode to not dark mode'

Shut down without asking for confirmation:  
$ osascript -e 'tell app "System Events" to shut down'

Restart without asking for confirmation:  
$ osascript -e 'tell app "System Events" to restart'

Increasing Productivity with AppleScript

Customized Workflow

AppleScript enables you to create personalized workflows by automating repetitive tasks. For example, you can write a script that renames and moves files based on specific criteria, saving you time and effort.

Example Script 1: Automating File Organization

tell application "Finder"
    set sourceFolder to choose folder with prompt "Select the source folder"
    set destinationFolder to choose folder with prompt "Select the destination folder"
    set fileList to every file of sourceFolder
    repeat with aFile in fileList
        move aFile to destinationFolder
    end repeat
end tell

This script allows you to select a source folder and a destination folder. It moves all files from the source folder to the destination folder, simplifying your file organization process.

Application Control

With AppleScript, you can interact with various macOS applications. You could automate tasks like sending emails, creating documents, or extracting data from spreadsheets, helping streamline your workflow.

Example Script 2: Creating New Email in Apple Mail

tell application "Mail"
    set newMessage to make new outgoing message with properties {subject:"Hello", content:"Just wanted to say hi!"}
    tell newMessage
        make new to recipient at end of to recipients with properties {address:"example@email.com"}
        open
    end tell
end tell

This script automates the process of creating a new email in Apple Mail. It sets the subject and content of the email and adds a recipient, ready for you to send your message swiftly.

System Automation

AppleScript allows you to control system settings and perform actions like changing the display resolution, adjusting volume, or toggling Wi-Fi—all with a single script.

Example Script 3: Adjusting Display Brightness

tell application "System Preferences"
    reveal anchor "displaysDisplayTab" of pane id "com.apple.preference.displays"
    activate
end tell

tell application "System Events"
    tell process "System Preferences"
        tell slider 1 of window 1
            set value to 75 -- Change brightness level (0-100)
        end tell
    end tell
end tell

tell application "System Preferences" to quit

This script opens the Display preferences in System Preferences, adjusts the brightness slider to the desired level, and then closes System Preferences. This allows you to quickly customize your display brightness without navigating through menus.

Cool Tricks with AppleScript

Displaying Notifications

As discussed earlier, you can use AppleScript to display notifications on the screen. This feature is particularly useful for receiving alerts or reminders during time-sensitive tasks.

Example Script 4: Notifying Important Task Deadlines

display notification "Don't forget to submit the report by 5 PM!" with title "Task Reminder"

This script displays a notification with a reminder for an important task deadline. You can set up similar notifications for time-sensitive activities to keep you on track.

Text Manipulation

AppleScript offers powerful text manipulation capabilities. You can automate tasks such as extracting specific information from a text file, finding and replacing text across multiple documents, or formatting text according to predefined rules.

Example Script 5: Find and Replace Text in Multiple Files

set searchText to "oldText"
set replaceText to "newText"

tell application "Finder"
    set folderPath to choose folder with prompt "Select the folder"
    set fileList to every file of folderPath
    repeat with aFile in fileList
        set fileContents to read aFile as «class utf8»
        set modifiedContents to replaceTextInString(fileContents, searchText, replaceText)
        set writeResult to write modifiedContents to aFile as «class utf8»
    end repeat
end tell

on replaceTextInString(textString, oldText, newText)
    set AppleScript's text item delimiters to the oldText
    set textItems to every text item of textString
    set AppleScript's text item delimiters to the newText
    return textItems as text
end replaceTextInString

This script prompts you to select a folder and replaces all occurrences of "oldText" with "newText" in the contents of every file within that folder. This can be useful for batch text replacements across multiple documents.

GUI Automation

AppleScript can simulate user interactions with graphical user interfaces (GUI). You can automate tasks that involve clicking buttons, selecting options from menus, or filling out forms in applications, saving you from repetitive manual operations.

Example Script 6: Automating Safari Website Login

tell application "Safari"
    activate
    open location "https://example.com/login"
    delay 2 -- Add a delay if needed for the page to load
end tell

tell application "System Events"
    tell process "Safari"
        set frontmost to true
        keystroke "username"
        keystroke tab
        keystroke "password"
        keystroke return
    end tell
end tell

This script automates the process of opening a specific website in Safari, entering a username, password, and submitting the login form. You can adapt this script to automate various web-based actions.

Alternatives to AppleScript

While AppleScript is a robust tool, other alternatives can also help achieve automation and productivity on macOS:

Automator

Automator is a visual automation tool built into macOS. It provides a drag-and-drop interface to create workflows without writing code. Automator supports a wide range of actions and can be an excellent choice for users who prefer a more intuitive approach.

Hammerspoon

Hammerspoon is a powerful automation tool that uses the Lua scripting language. It offers extensive customization and control over macOS, allowing users to create complex workflows and automation routines.

Keyboard Maestro

Keyboard Maestro is a comprehensive automation tool that focuses on keyboard and mouse automation. It provides a user-friendly interface to create macros, trigger actions based on specific events, and automate repetitive tasks efficiently.

Conclusion

AppleScript is a versatile tool for increasing productivity and automating tasks on macOS. Its ability to control applications, system settings, and perform various actions make it a valuable asset. Additionally, cool tricks like displaying notifications and GUI automation enhance the overall experience. However, alternatives like Automator, Hammerspoon, and Keyboard Maestro offer different approaches to automation, catering to diverse user preferences. Explore these tools and find the one that best fits your workflow to unlock new levels of productivity and efficiency on your Mac.

References

Introduction to AppleScript Language Guide
osascript Man Page - macOS - SS64.com
osacompile - Compile AppleScripts and other OSA language scripts.

Display a Notification on the Screen in macOS

2023-07-10T00:00:00+02:00

To display a notification on the screen near the menu bar in macOS using the terminal, you can make use of the osascript command to execute AppleScript code. Here's an example command you can run:

osascript -e 'display notification "Hello, World!" with title "Notification"'

This command will display a notification with the message "Hello, World!" and the title "Notification" near the menu bar on macOS.

You can customize the message and title by modifying the strings inside the double quotes in the osascript command.

Note that starting from macOS Big Sur (11.0), AppleScript-based notifications require user authorization. The first time you run this command, you will be prompted to grant permission to Terminal (or whichever application you are using) to send notifications.

reading

Software Versioning Schemes

2023-07-08T00:00:00+02:00

Introduction

Software versioning schemes are essential in the world of programming, as they help developers, users, and collaborators keep track of various versions of a software product. A proper versioning scheme enables easy identification of the current release, the changes made in each version, and the compatibility of a version with previous ones. In this blog post, we will discuss some of the most popular versioning schemes used in the software industry, along with a few lesser-known but useful alternatives.

Semantic Versioning
Calendar Versioning (CalVer)
ZeroVer: 0-based Versioning
Lesser-known Versioning Schemes
Conclusion

Semantic Versioning (SemVer)

Semantic versioning, also known as SemVer, is a widely adopted versioning scheme that emphasizes the importance of clear and meaningful version numbers. In SemVer, a version number consists of three parts: MAJOR.MINOR.PATCH. Each part represents the following:

MAJOR version: incremented when you make incompatible API changes,
MINOR version: incremented when you add functionality in a backwards-compatible manner, and
PATCH version: incremented when you make backwards-compatible bug fixes.

In addition to these three parts, SemVer allows for additional labels for pre-release and build metadata as extensions to the MAJOR.MINOR.PATCH format. This makes it easier for developers to communicate the scope of changes in each release and helps users understand if an update will break their existing setup or not.

Calendar Versioning (CalVer)

Another popular versioning scheme is Calendar Versioning or CalVer. CalVer uses a combination of the release date and a project-specific version number to create a unique identifier for each release. The format typically looks like this: YYYY.MM.DD.MICRO.

The advantages of CalVer include its simplicity and the ability to quickly determine the age of a release. However, unlike SemVer, CalVer does not provide explicit information about API changes or compatibility between versions.

ZeroVer: 0-based Versioning (0ver)

ZeroVer, also known as 0ver is a unique and simple versioning scheme that asserts that your software's major version should never exceed the first and most important number in computing: zero. This is in contrast to other versioning schemes like Semantic Versioning and Calendar Versioning.

The rationale behind ZeroVer is that software is never truly "finished" and will always be subject to improvements, bug fixes, and new features. By keeping the major version at zero, developers acknowledge the ever-evolving nature of their software and avoid the pressures associated with "final" releases.

Note: in the 0ver there is a zero in front of the name, do not confuse with capital letter O

Lesser-known Versioning Schemes

In addition to the popular versioning schemes mentioned above, there are other lesser-known but equally useful alternatives. Some of these include:

Romantic Versioning

Romantic Versioning is a light-hearted, informal versioning scheme that uses popular culture references or personal milestones as version numbers. While not suitable for all projects, Romantic Versioning can be a fun way to engage users and make software updates more memorable.

The Romantic Versioning specification was authored by Daniel V from the Legacy Blog crew in 2015. This open and public repository has the task of maintenance and visibility, cooperation with others.

Hash-based Versioning

Hash-based Versioningis a versioning scheme that uses the unique hash of a particular commit in a version control system (such as Git) as the version number. This approach ensures that each release is directly tied to a specific point in the development history, making it easy to track and revert changes if needed.

Custom Versioning Schemes

Some projects may benefit from a custom versioning scheme tailored to their specific needs. This could involve combining elements from various existing schemes or developing an entirely new approach. When creating a custom versioning scheme, it's essential to ensure that it is clear, consistent, and easy to understand for all stakeholders.

Conclusion

Choosing the right versioning scheme for your software project is crucial for effective communication and collaboration among developers, users, and other stakeholders. While Semantic Versioning and Calendar Versioning are popular choices, alternative schemes like ZeroVer, Romantic Versioning, Hash-based Versioning, or even custom schemes can also be appropriate depending on your project's unique requirements. Ultimately, the ideal versioning scheme should be easy to understand, provide meaningful information about each release, and facilitate the management of software updates.

How to install Faiss on Google Colab

2023-07-04T00:00:00+02:00

"X:": "2023-06-08-Similarity_search_using_IVFPQ"

To install faiss on Colab use:

!pip install faiss-cpu --no-cache

Easy Text Vectorization With VectorHub and Sentence Transformers

2023-07-04T00:00:00+02:00

Text is heavily inspired by part of the e-book: Semantic NLP search with FAISS and VectorHub - Guide To Vectors (getvectorai.com) - which was using VectorHub as an interface to the models.

NOTE: VectorHub is deprecated and no longer maintained. The authors of VectorHub recommend using Sentence Transformers, TFHub, and Huggingface directly for text vectorization.

This article demonstrates a similar process as the original article but uses a sentence transformers package.

Encoding Data Using Sentence Transformers

To encode models easily, we will utilize the Sentence Transformers library. SentenceTransformers is a Python framework for state-of-the-art sentence, text, and image embeddings. It provides a variety of pre-trained models that can convert sentences into meaningful numerical representations.

First, we need to install the sentence-transformers package, which includes the necessary dependencies for using Sentence Transformers. This library offers a wide range of pre-trained models, such as BERT, RoBERTa, and MiniLM, that can be used for text encoding. More information about Sentence Transformers can be found here.

pip install sentence-transformers

Next, we will instantiate our model and start the encoding process. In this example, we will use the "all-MiniLM-L6-v2" model, which is a variant of the MiniLM model.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

# Sentences to be encoded
sentences = [
    'This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of strings.',
    'The quick brown fox jumps over the lazy dog.'
]

# Encode sentences using the Sentence Transformers model
embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

In the code snippet above, we begin by installing the sentence-transformers package, which provides the necessary tools for working with Sentence Transformers. This library offers various pre-trained models that can convert sentences into meaningful vector representations.

After the installation, we import the SentenceTransformer class from the sentence_transformers module. We instantiate the model using the all-MiniLM-L6-v2 variant, which will be used for encoding our sentences.

We define a list of sentences that we want to encode using the Sentence Transformers model. In this case, we have three exemplary sentences: "This framework generates embeddings for each input sentence," "Sentences are passed as a list of strings," and "The quick brown fox jumps over the lazy dog."

To perform the encoding, we use the encode method of the model object, passing in the sentences list. This method returns the corresponding embeddings for each sentence, which we store in the embeddings variable.

Finally, we iterate over the sentences and embeddings lists together using zip. For each sentence and its corresponding embedding, we print them out to visualize the results.

Please note that the code snippet above uses the "all-MiniLM-L6-v2" model as an example. You can explore and choose from a wide range of models provided by Sentence Transformers according to your specific requirements.

References

Any comments or suggestions? Let me know.

Introducing a Python Module for Splitting Text Into Parts Based on Token Limit

2023-07-03T00:00:00+02:00

Introduction

In the realm of natural language processing and text analysis, it is often necessary to split a large piece of text into smaller parts while ensuring that the split does not break words or disrupt the meaning of the text. This task can be challenging, especially when dealing with the tokenization. However, with the help of the Tiktoken library and a custom Python module, splitting text based on a specified number of tokens can be an easy task.

Understanding the Tiktoken Library

Tiktoken is a powerful Python library for tokenization, which is the process of splitting text into individual tokens such as words or subwords. The library provides various tokenization encodings and functions that enable developers to process text data in a tokenized format. It offers support for different languages and tokenization models, making it a versatile tool for a wide range of text processing tasks. Tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Introducing the Python Module: split_string_with_limit

The provided Python module: split_string_with_limit.py (GitHub Gist), leverages the capabilities of the Tiktoken library to split a string into parts with a specified limit on the number of tokens per part. The module takes three parameters: text, limit, and encoding.

text: The input string that needs to be split.
limit: The maximum number of tokens allowed per part.
encoding: The tokenization encoding to be used, which determines how the text is tokenized.

The module proceeds as follows:

It tokenizes the input text using the specified encoding from Tiktoken.
It creates an empty list, parts, to store the tokenized parts.
It initializes a current_part list and a current_count variable to keep track of the tokens in the current part.
It iterates over each token in the tokenized text.
For each token, it appends it to the current_part list and increments the current_count by 1.
If the current_count exceeds the specified limit, it adds the current_part to the parts list, resets the current_part and current_count to empty values, and continues with the next tokens.
Once all the tokens have been processed, the module checks if there is any remaining current_part and adds it to the parts list.
Finally, it converts each tokenized part back into text format by decoding the individual tokens and joins them together. The resulting text parts are stored in the text_parts list.
The module returns the text_parts list as the output.

Example Usage

To demonstrate the usage of the split_string_with_limit module, let's consider an example:

python split_string_with_limit.py input_file.txt 100 cl100k_base

In this example, we provide three arguments:

input_file.txt: The path to the input text file that contains the text to be split.
100: The maximum number of tokens allowed per part. You can adjust this value based on your requirements.
cl100k_base: The encoding name. This determines how the text will be tokenized. Tiktoken provides various encoding options, and you can experiment with different encodings to achieve the desired results.

The module reads the text from the input file, tokenizes it using the specified encoding, and splits it into parts based on the token limit. The resulting text parts are then printed in a JSON format, providing a structured representation of the split text.

Approximate approach

While the split_string_with_limit module offers a convenient solution for splitting text based on a token limit, it's worth mentioning alternative algorithms or approaches that can achieve similar results. One of these can be a Fixed-Length Split: instead of splitting based on the number of tokens, we could split the text into fixed-length segments based on counting words or characters. One can use rule of thumb:

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

to have approximate of the split into parts of equal length without doing actual tokenization.

Conclusion

In this blog post, we introduced the split_string_with_limit Python module, which leverages the power of the Tiktoken library to split a string into parts based on a specified token limit. We discussed the functionality of the module, its parameters, and how it can be used in practice. Furthermore, we explored alternative algorithms and approaches for splitting text based on the number of tokens. By combining the flexibility of Tiktoken and the convenience of the split_string_with_limit module, developers can efficiently process and analyze text data without compromising on accuracy or readability.

Demystifying Perplexity - Assessing Dimensionality Reduction With PCA

2023-06-30T00:00:00+02:00

X::Intrinsic vs. Extrinsic Evaluation - What's the Best Way to Measure Embedding Quality?

Perplexity is a measure commonly used in machine learning, particularly in the field of dimensionality reduction techniques such as Principal Component Analysis (PCA). It provides a way to evaluate and compare the quality of dimensionality reduction algorithms by quantifying how well they preserve the structure of the original data.

In this blog post, we will delve into the concept of perplexity, its application in PCA, and its importance in assessing the performance of dimensionality reduction methods. We will also provide code examples in Python to demonstrate its practical implementation.

Understanding Perplexity

Perplexity is a measure originally developed for evaluating probabilistic models, particularly in the field of natural language processing. It represents the level of uncertainty or confusion in predicting the next item in a sequence. In the context of dimensionality reduction, perplexity provides an estimation of the number of nearest neighbors that should be considered when reconstructing a data point in a lower-dimensional space.

Given a high-dimensional dataset, PCA aims to find a lower-dimensional representation that captures the most significant features or patterns of the original data. The idea behind perplexity is to preserve the local structure of the data, ensuring that neighboring points in the high-dimensional space remain close to each other in the reduced space.

Perplexity in PCA

To understand how perplexity is used in PCA, let's consider a high-dimensional dataset with 𝑁 data points. PCA involves projecting this dataset onto a lower-dimensional space while retaining the maximum amount of variance. The reduced dataset can be represented by 𝑀 principal components, where 𝑀 < 𝑁.

In PCA, the perplexity of a data point 𝑥𝑖 is calculated based on the conditional probability distribution of its neighbors given a particular variance or similarity scale. This distribution can be modeled using a Gaussian kernel centered at 𝑥𝑖:

$$ P(\mathbf{y}_j|\mathbf{x}_i) = \frac{{\exp\left(-\frac{{\|\mathbf{y}_j - \mathbf{x}_i\|^2}}{{2\sigma_i^2}}\right)}}{{\sum_{k\neq j}\exp\left(-\frac{{\|\mathbf{y}_k - \mathbf{x}_i\|^2}}{{2\sigma_i^2}}\right)}} $$

Here, 𝑃(𝑦𝑗|𝑥𝑖) represents the conditional probability of observing data point 𝑦𝑗 as a neighbor of 𝑥𝑖 in the lower-dimensional space. The variance or similarity scale 𝜎𝑖 determines the spread of the Gaussian kernel for each data point 𝑥𝑖.

The perplexity of 𝑥𝑖, denoted as 𝑃𝑖, is then defined as the Shannon entropy of the conditional distribution:

$$ P_i = 2^{-\sum_j P(\mathbf{y}_j|\mathbf{x}_i)\log_2 P(\mathbf{y}_j|\mathbf{x}_i)} $$

In practice, finding the optimal variance scale 𝜎𝑖 that results in the desired perplexity can be challenging. One common approach is to perform a binary search to find the value of 𝜎𝑖 that achieves a target perplexity value. The binary search is performed by iteratively adjusting the value of 𝜎𝑖 until the entropy of the conditional distribution matches the target perplexity.

Evaluating Dimensionality Reduction with Perplexity

Perplexity is a crucial metric for evaluating the performance of dimensionality reduction techniques, including PCA. By preserving the local structure of the data, a good dimensionality reduction method should ensure that neighboring points remain close to each other in the lower-dimensional space.

To evaluate the effectiveness of a dimensionality reduction algorithm, we can compare the perplexity of the original high-dimensional data with the perplexity of the reduced data. If the perplexity remains similar after dimensionality reduction, it suggests that the algorithm successfully preserves the local structure of the data.

In practice, perplexity is often used in conjunction with other evaluation metrics, such as visualization techniques like t-SNE (t-Distributed Stochastic Neighbor Embedding). t-SNE is a nonlinear dimensionality reduction method that can be used to visualize high-dimensional data in two or three dimensions while preserving the local structure. By comparing the perplexity of t-SNE embeddings with the perplexity of the original data, we can gain insights into the quality of the dimensionality reduction.

Implementation in Python

Let's now demonstrate the calculation of perplexity and its application in evaluating dimensionality reduction using PCA in Python. We will use the scikit-learn library, which provides a simple and efficient implementation of PCA and other machine learning algorithms.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import squareform

def perplexity(X, perplexity_value):
    distances = pairwise_distances(X, metric='euclidean', squared=True)
    P = np.zeros((N, N))
    sigmas = np.ones(N)

    for i in range(N):
        beta_min = -np.inf
        beta_max = np.inf
        betas = np.ones(N)

        for _ in range(50):
            affinities = np.exp(-distances[i] * betas)
            sum_affinities = np.sum(affinities)
            entropy = -np.sum((affinities / sum_affinities) * np.log2(affinities / sum_affinities))
            perplexity_diff = entropy - np.log2(perplexity_value)

            if np.abs(perplexity_diff) < 1e-5:
                break

            if perplexity_diff > 0:
                beta_min = betas[i].copy()
                if beta_max == np.inf or beta_max == -np.inf:
                    betas[i] *= 2
                else:
                    betas[i] = (betas[i] + beta_max) / 2
            else:
                beta_max = betas[i].copy()
                if beta_min == -np.inf or beta_min == np.inf:
                    betas[i] /= 2
                else:
                    betas[i] = (betas[i] + beta_min) / 2

        P[i] = affinities / np.sum(affinities)

    return P

# Generate random high-dimensional data
N = 1000
X = np.random.randn(N, 100)

# Apply PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

# Calculate perplexity of original data
original_perplexity = perplexity(X, perplexity_value=30)

# Calculate perplexity of reduced data
reduced_perplexity = perplexity(X_reduced, perplexity_value=30)

print("Perplexity of original data:", np.mean(original_perplexity))
print("Perplexity of reduced data:", np.mean(reduced_perplexity))

In the above example, we generate a random high-dimensional dataset using NumPy and apply PCA to reduce its dimensionality to 2. We then calculate the perplexity of the original data and the reduced data using the perplexity function. Finally, we print the mean perplexity values for comparison.

By examining the perplexity values, we can gain insights into how well PCA preserves the local structure of the data. If the perplexity of the reduced data is close to the perplexity of the original data, it suggests that PCA successfully retains the essential information during dimensionality reduction.

Conclusion

In this blog post, we explored the concept of perplexity in the context of dimensionality reduction, specifically in PCA. Perplexity provides a measure of the level of uncertainty or confusion in predicting the neighbors of a data point in a lower-dimensional space. It is used to assess the quality of dimensionality reduction algorithms by evaluating how well they preserve the local structure of the data.

We also provided a Python implementation to calculate perplexity and demonstrated its application in evaluating dimensionality reduction using PCA. By comparing the perplexity of the original data with the perplexity of the reduced data, we can assess the effectiveness of PCA in preserving the essential information.

Perplexity is a valuable tool in the evaluation and comparison of dimensionality reduction methods. It offers insights into the preservation of the local structure and can guide the selection of appropriate techniques for different datasets and applications.

Understanding Bhattacharyya Distance and Coefficient for Probability Distributions

2023-06-30T00:00:00+02:00

Introduction

In the fields of statistics, machine learning, and data science, measuring the similarity between probability distributions is crucial for various tasks. One commonly used measure for this purpose is the Bhattacharyya distance, which quantifies the dissimilarity between two distributions. The Bhattacharyya coefficient, on the other hand, provides a measure of the overlap between two statistical samples or populations. In this blog post, we will delve into the concepts of Bhattacharyya distance and coefficient, discuss their applications, and provide Python code examples for better understanding.

Bhattacharyya Distance
Bhattacharyya Coefficient
Applications of Bhattacharyya Distance and Coefficient
Python Implementation
Other metrics
Conclusion

Bhattacharyya Distance

The Bhattacharyya distance is a statistical measure that quantifies the similarity between two probability distributions. It is named after Anil Kumar Bhattacharyya, an Indian-American statistician. The distance is defined for continuous probability distributions and is based on the Bhattacharyya coefficient (which we will discuss later).

Mathematically, the Bhattacharyya distance between two continuous probability density functions (PDFs) or discrete probability mass functions (PMFs) is defined as follows:

$$ D_B(P,Q) = -\ln \left( BC(P,Q) \right) = -\ln \left( \sum_{i} \sqrt{P(i)Q(i)} \right) $$

where ( P ) and ( Q ) are the probability distributions being compared, ( P(i) ) and ( Q(i) ) represent the probabilities of occurrence for the ( i )th event or sample, and ( BC(P,Q) ) denotes the Bhattacharyya coefficient.

The Bhattacharyya distance ranges from 0 to infinity, where 0 indicates perfect similarity and higher values indicate increasing dissimilarity. It is important to note that the Bhattacharyya distance is a symmetric measure, meaning ( D_B(P,Q) = D_B(Q,P) ).

Bhattacharyya Coefficient

The Bhattacharyya coefficient is a measure of overlap between two statistical samples or populations. It quantifies the similarity between two probability distributions and is often used as a precursor to computing the Bhattacharyya distance.

Mathematically, the Bhattacharyya coefficient between two discrete probability distributions can be calculated as follows:

$$ BC(P,Q) = \sum_{i} \sqrt{P(i)Q(i)} $$

For continuous probability distributions, the Bhattacharyya coefficient can be expressed as an integral:

$$ BC(P,Q) = \int \sqrt{p(x) q(x)} \, dx $$

where ( p(x) ) and ( q(x) ) represent the probability density functions (PDFs) of distributions ( P ) and ( Q ), respectively.

The Bhattacharyya coefficient ranges from 0 to 1, where 1 indicates complete overlap and 0 indicates no overlap. The coefficient measures the similarity of two distributions by taking into account the square root of the product of their probabilities at corresponding events or samples.

Applications of Bhattacharyya Distance and Coefficient

Pattern recognition: Bhattacharyya distance is often used to compare feature vectors or histograms in pattern recognition tasks. It helps in identifying similarities or dissimilarities between different classes or clusters of data.
Image processing: Bhattacharyya distance can be used to compare image histograms, aiding in tasks such as image segmentation, object recognition, and image retrieval.
Document classification: Bhattacharyya distance can be employed to measure the similarity between document feature vectors, enabling document clustering and categorization.

Python Implementation

To demonstrate the computation of Bhattacharyya distance and coefficient, we will use the SciPy library in Python.

Let's assume we have two discrete probability distributions, ( P ) and ( Q ), represented as arrays.

import numpy as np
from scipy.spatial import distance

# Probability distributions
P = np.array([0.2, 0.3, 0.1, 0.4])
Q = np.array([0.25, 0.15, 0.2, 0.4])

# Bhattacharyya distance
db = -np.log(distance.bhattacharyya(P, Q))

# Bhattacharyya coefficient
bc = distance.bhattacharyya(P, Q)

print("Bhattacharyya Distance: ", db)
print("Bhattacharyya Coefficient: ", bc)

Output:

Bhattacharyya Distance: 0.0632593256263896
Bhattacharyya Coefficient: 0.9367406743736104

In the code snippet above, we utilize the bhattacharyya function from the scipy.spatial.distance module to compute the Bhattacharyya distance and coefficient. The resulting values are printed, providing the measure of dissimilarity and overlap, respectively.

Other metrics

The Bhattacharyya distance metric has both similarities and differences compared to other related distance metrics used in statistics, machine learning, and data science. Let's explore the similarities and differences with some commonly used distance metrics.

Distance Metric	Similarities	Differences
Euclidean	- Applicable to both continuous and discrete data.	- Measures geometric distance between points in a multi-dimensional space. - Does not consider probability information of the data.
Manhattan	- Similar to Euclidean, applicable to both continuous and discrete data.	- Measures distance between points as the sum of absolute differences in their coordinates. - Does not consider probability distributions. - Not suitable for measuring similarity between distributions.
Kullback-Leibler	- Measures dissimilarity between probability distributions.	- Quantifies information lost when one distribution is used to approximate the other. - Does not directly measure overlap or similarity between distributions. - Asymmetric measure.
Jensen-Shannon	- Variation of KL divergence, measures dissimilarity between probability distributions.	- Calculates average of KL divergences between distributions and their average. - Does not directly measure overlap or similarity between distributions. - Symmetric measure.
Cosine Similarity	- Measures similarity between vector representations of data.	- Measures cosine of the angle between two vectors, indicating similarity in direction or orientation. - Primarily used for comparing vector representations, such as text documents or high-dimensional feature vectors. - Does not capture probabilistic nature of distributions or specifically designed for comparing probability distributions.

In summary, the Bhattacharyya distance stands out as a measure explicitly designed for comparing probability distributions. It considers the probability information of the data and provides a measure of dissimilarity based on the overlap between distributions. Other distance metrics, such as Euclidean distance, Manhattan distance, KL divergence, Jensen-Shannon divergence, and cosine similarity, have different focuses and may not capture the probabilistic nature or similarity between distributions as effectively as the Bhattacharyya distance.

Conclusion

The Bhattacharyya distance and coefficient are valuable tools for quantifying the similarity and dissimilarity between probability distributions. By leveraging these measures, we can compare distributions in various fields, including statistics, machine learning, and data science. Understanding and utilizing these concepts can aid in solving diverse tasks, such as pattern recognition, image processing, and document classification. Python implementations, as showcased in this blog post, allow for straightforward calculations and application of Bhattacharyya distance and coefficient to real-world scenarios.

Script to Python Package Using Poetry (And PyCharm)

2023-06-28T00:00:00+02:00

The task
Steps for Package Creation
Create Project Directory
Open the Project in PyCharm
Configure Poetry Virtual Environment
Install Dependencies
Configure PyCharm Interpreter
Initialize Git Repository
Create Package Structure
Move Script and Files
Create __init__.py
Update pyproject.toml
Add README.md file
Test the Script
Package the Project
Publish the Package
Versioning and Updates

The task

Let's assume that you have simple script that count tokens in provided text file. Below is the script that accepts a positional input argument, which is the file name, and can be run from the command-line interface (CLI). See also the note on How to count tokens?

#!/usr/bin/env python3
import argparse
import tiktoken


def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


num_tokens_from_string(
    "tiktoken is great!",
)


def count_tokens(file_path):
    with open(file_path, "r") as file:
        text = file.read()
    return num_tokens_from_string(text)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Count the number of tokens in a text file."
    )
    parser.add_argument("file", help="Path to the input text file")

    args = parser.parse_args()
    file_path = args.file

    num_tokens = count_tokens(file_path)
    print(f"Number of tokens: {num_tokens}")

In this script, the argparse module is used to handle command-line arguments. The script defines a single positional argument, file, which represents the file name of the input text file.

When the script is executed from the command line, it will parse the command-line arguments and retrieve the file path provided by the user. The count_tokens function will then be called with the file path, and the number of tokens will be printed.

To run the script from the CLI, use the following command:

python script_name.py file_path

Replace script_name.py with the actual name of your script file, and file_path with the path to the input text file you want to analyze. The script will then tokenize the text file and print the number of tokens.

NOTE: you need tiktoken package installed before running the script. You can install it using pip:

pip install tiktoken

Steps for Package Creation

To create and publish a Python package based on the provided script, you can follow the steps below:

Create Project Directory

Start by creating a new directory for your project. You can choose an appropriate name for the directory.

Initialize the Project with Poetry: Open your command-line interface and navigate to the project directory you created. Run the following command to initialize the project using Poetry:

poetry init

This command will prompt you to fill in information about your package, such as the package name, version, description, author details, and more. Fill in the required information as prompted.

Open the Project in PyCharm

Open PyCharm and select "Open" from the welcome screen or go to "File" > "Open" and choose the project directory you created.

Configure Poetry Virtual Environment

When opening the project in PyCharm for the first time, it will detect the presence of Poetry. You will be prompted to either allow PyCharm to create a Poetry virtual environment or create it manually. Select the option to create the virtual environment.

If you already have a Poetry virtual environment set up manually, you can skip this step.

Install Dependencies

In your command-line interface, navigate to the project directory if you're not already there. Run the following command to install the necessary dependencies using Poetry:

poetry install

This command will create a virtual environment and install the required packages specified in your project's pyproject.toml file.

Configure PyCharm Interpreter

In PyCharm, go to "File" > "Settings" > "Project: " > "Python Interpreter". Click on the gear icon and choose "Add...".

Select "Poetry Environment" and choose the existing local Poetry interpreter associated with your project's virtual environment. Click "OK" to apply the changes.

Initialize Git Repository

In your command-line interface, navigate to the project directory if you're not already there. Run the following command to initialize a Git repository:

git init

This will set up a new Git repository for version control.

At this point, you have set up the project structure, initialized Poetry, configured the virtual environment in PyCharm, installed dependencies, and initialized a Git repository. Now, you can proceed with packaging and publishing your Python script.

NOTE: you might want to add .gitignore file at this stage Minimal .gitignore can be:

# Compiled Python files
__pycache__/
*.py[cod]

# Distribution / packaging
dist/
build/
*.egg-info/
*.egg

# Virtual environments
venv/
env/

# IDEs and editors
.idea/

Create Package Structure

Inside your project directory, create a package structure that follows Python's best practices. For example, you can create a directory named my_package that will contain your script and other necessary files.

Move Script and Files

Move your script file and any other relevant files into the package directory (my_package in this example).

Create `init.py`

Inside the package directory (my_package), create an empty file named __init__.py. This file is required to make the directory a Python package.

Update `pyproject.toml`

Open your project's pyproject.toml file. Under the [tool.poetry] section, add the script file and any additional files that need to be included in the package. For example:

[tool.poetry]
...
[tool.poetry.scripts]
my_script = 'my_package.my_script:main'

Replace my_script with the desired command name for your script, and my_package.my_script:main with the correct import path to your script and its main function.

Add README.md file

In the root of the project directory create README.md and fill it with useful information. See also:writing_good_readme

NOTE: You can add some badges relate to your pypi package, e.g.:

![img](https://img.shields.io/pypi/v/package_name.svg)
![](https://img.shields.io/pypi/pyversions/package_name.svg)
![](https://img.shields.io/pypi/dm/package_name.svg)

Add LICENSE file

You can create a LICENSE file manually. Here's how you can do it:

Create a new file in your project root directory named LICENSE.
Go to the MIT License template, copy the text.
Paste the copied text into your LICENSE file.
Replace [year] with the current year and [fullname] with your name or your organization's name.
Save the file.

Test the Script

Before publishing your package, it's essential to test your script to ensure it works as expected. You can execute the script locally to verify its functionality.

If you want to use pytest for testing add it as development dependency and install:

poetry add --group dev poetry

Package the Project

In your command-line interface, navigate to the project directory. Run the following command to create a distributable package:

poetry build

This command will generate a distributable package (e.g., a .tar.gz file) in the dist directory within your project.

Publish the Package

To publish your package, you can use a package index such as PyPI (Python Package Index). First, you need to create an account on PyPI if you haven't already. Once you have an account, run the following command to publish your package:

poetry publish

This command will guide you through the process of publishing your package to PyPI. You'll be prompted to enter your PyPI credentials and confirm the publication.

Note: Make sure your package has a unique name to avoid conflicts with existing packages on PyPI.

Versioning and Updates

When you make updates to your package, ensure to increment the version number in the pyproject.toml file under the [tool.poetry.version] section. This helps to track and manage different versions of your package.

That's it! You have now packaged and published your Python script using Poetry. Users can install your package using pip and use your script as a command-line tool.

Please note that publishing a package is a significant step, and it's essential to review and test your code thoroughly before sharing it with others.

Correcting metadata

authors = ["Krystian Safjan <ksafjan@gmail.com>"]

keywords = ["keyword1", "keyword2"]
homepage = "https://github.com/user/repo"
repository = "https://github.com/user/repo"
documentation = "https://github.com/user/repo"

Bash - Rename Multiple Image Files to Match Pattern With Sequence Number

2023-06-27T00:00:00+02:00

The use case for the provided script is to rename multiple image files in a directory while maintaining their original file extensions. This script can be handy in situations where you have a collection of image files with different formats or extensions, and you want to standardize their names for better organization or consistency.

By executing the script, all image files with extensions such as .jpg, .jpeg, .png, .gif, .tiff, .heic, and .heif in the current directory will be renamed. The new names will follow the pattern "img_xxx.ext", where "xxx" represents a sequence number starting from 000, and "ext" represents the original file extension.

For example, if you have the following image files in the directory:

photo1.jpg
picture.png
image2.jpeg
snapshot.tiff
capture.heic

Running the script will rename them as:

img_000.jpg
img_001.png
img_002.jpeg
img_003.tiff
img_004.heic

This allows for consistent naming and easier identification of the image files in the directory.

Here's the Bash script that supports multiple image formats and preserves the original file extension while renaming the files:

#!/bin/bash

counter=0

for file in *.{jpg,jpeg,png,gif,tiff,heic,heif}; do
    if -f "$file"; then
        extension="${file##*.}"
        newname=$(printf "img_%03d.%s" "$counter" "$extension")
        mv "$file" "$newname"
        ((counter++))
    fi
done

In this script:

The for loop uses brace expansion {} to iterate over multiple file extensions: jpg, jpeg, png, gif, tiff, heic, and heif.
Inside the loop, the script checks if the current file is a regular file using the -f test.
If it's a regular file, it extracts the original file extension using the ${file##*.} syntax.
The newname variable is generated using printf with the current value of the counter variable and the extracted extension.
Finally, the file is renamed using the mv command, preserving the original extension.

To use this script, follow these steps:

Open a text editor and paste the script into a new file.
Save the file with a .sh extension, for example, rename_images.sh.
Open a terminal and navigate to the directory where the image files are located.
Make the script executable by running the following command: chmod +x rename_images.sh.
Run the script using the command ./rename_images.sh.

After running the script, all the image files in the directory should be renamed according to the pattern you specified.

Oneliner

Here's a one-liner Bash command that renames all image files in the current directory to match the pattern "img_xxx.jpg" where "xxx" is a sequence number starting from 000:

counter=0; for file in *.jpg; do if -f "$file"; then newname=$(printf "img_%03d.jpg" "$counter"); mv "$file" "$newname"; ((counter++)); fi; done

This command combines the same logic as the previous script into a single line. The counter variable is set to 0, and then the for loop iterates over the .jpg files in the directory. The rest of the logic remains the same.

To use this one-liner, open a terminal, navigate to the directory containing the image files, and run the command. The image files will be renamed accordingly.

To create a Bash alias for the one-liner version of the last script, you can add the following line to your ~/.bashrc or ~/.bash_aliases (.zshrc or ~/.zsh_aliases if using zsh) file:

alias rename_images='counter=0; for file in *.{jpg,jpeg,png,gif,tiff,heic,heif}; do if -f "$file"; then extension="${file##*.}"; newname=$(printf "img_%03d.%s" "$counter" "$extension"); mv "$file" "$newname"; ((counter++)); fi; done'

Save the file and then run source ~/.bashrc or source ~/.bash_aliases to apply the changes.

Afterward, you can use the rename_images command in your terminal to execute the one-liner script and rename the image files in the current directory accordingly.

Python version

Here's a Python script that achieves the same functionality as the Bash script, renaming image files while preserving their original extensions:

import os

counter = 0
extensions = [".jpg", ".jpeg", ".png", ".gif", ".tiff", ".heic", ".heif"]

for filename in os.listdir("."):
    if filename.lower().endswith(tuple(extensions)) and os.path.isfile(filename):
        file_parts = os.path.splitext(filename)
        newname = f"img_{counter:03d}{file_parts[1]}"
        os.rename(filename, newname)
        counter += 1

In this Python script:

The counter variable keeps track of the sequence number for renaming the files.
The extensions list contains the supported image extensions.
The script iterates over each file in the current directory using os.listdir(".").
For each file, it checks if the filename has a matching extension and if it is a regular file.
If both conditions are satisfied, it extracts the file's extension and uses os.rename() to perform the renaming operation.
The new name is constructed using the desired pattern "img_xxx.ext", where "xxx" represents the sequence number and "ext" represents the original file extension.
Finally, the counter is incremented for the next file.

You can save this Python script to a file with a .py extension, for example, rename_images.py, and then run it using a Python interpreter. The image files in the directory will be renamed accordingly, following the specified pattern while preserving their original extensions.

Harnessing the Power of Dependency Injection for Improved Testability in Python

2023-06-21T00:00:00+02:00

Introduction

In software development, testability is a crucial aspect that helps ensure the reliability and maintainability of our code. One effective technique for enhancing testability is dependency injection (DI). Dependency injection allows us to decouple dependencies from our functions, methods, or classes, making it easier to test and maintain our code. In this blog post, we will explore various techniques, use cases, and lesser-known tricks for using dependency injection in Python.

What is Dependency Injection?
Benefits of Dependency Injection:
Techniques for Dependency Injection
Constructor Injection
Setter Injection
Interface Injection
Use Cases for Dependency Injection:
Testing Legacy Code
Mocking Dependencies
Improving Code Reusability
Parameter Injection:
Context Managers and Dependency Injection
Dependency Injection Containers
Conclusion

What is Dependency Injection?

Dependency injection is a design pattern that allows us to inject dependencies into a class or function from external sources rather than creating them internally. By doing so, we reduce the coupling between components and make them more flexible, reusable, and testable.

Benefits of Dependency Injection

Improved testability: By injecting dependencies, we can easily replace them with mocks or stubs during testing, making our tests more isolated and focused.
Decoupled code: Dependency injection reduces the tight coupling between components, promoting better separation of concerns and modular design.
Code reusability: With dependency injection, components become more reusable as they rely on abstractions rather than concrete implementations.
Easier maintenance: By externalizing dependencies, we can modify or extend their behavior without affecting the code that uses them.

Techniques for Dependency Injection

Constructor Injection

Constructor injection involves passing dependencies through a class's constructor. It ensures that the required dependencies are available before an object is created.

Example:

class UserService:
    def __init__(self, user_repository):
        self.user_repository = user_repository

    def get_user(self, user_id):
        return self.user_repository.get(user_id)

Setter Injection

Setter injection involves setting the dependencies using setter methods. This technique allows for more flexibility, as dependencies can be changed or updated after object initialization.

Example:

class NotificationService:
    def set_email_service(self, email_service):
        self.email_service = email_service

    def send_notification(self, user):
        self.email_service.send(user.email, "New notification!")

Interface Injection

Interface injection is a technique where a dependency is injected by providing an interface or an abstract base class. This allows for the injection of different implementations of the same interface, providing flexibility and extensibility.

Example:

from abc import ABC, abstractmethod

class Database(ABC):
    @abstractmethod
    def query(self, query):
        pass

class MySQLDatabase(Database):
    def query(self, query):
        # Perform MySQL query
        pass

class PostgresDatabase(Database):
    def query(self, query):
        # Perform Postgres query
        pass

Use Cases for Dependency Injection

Testing Legacy Code

When working with legacy code that has tightly coupled dependencies, dependency injection can be used to introduce testability by replacing or mocking those dependencies during testing.

Example:

def legacy_function():
    # ...
    db_connection = MySQLDatabase()  # Tightly coupled dependency
    # ...

# Using dependency injection to test legacy_function


def test_legacy_function():
    mock_db = MockMySQLDatabase()
    legacy_function.inject_dependencies(db_connection=mock_db)
    # Test the function

Mocking Dependencies

In unit testing, dependency injection allows us to replace real dependencies with mock objects, enabling us to focus on testing the behavior of the unit under test in isolation.

Example:

class UserService:
    def __init__(self, user_repository):
        self.user_repository = user_repository

    def get_user(self, user_id):
        return self.user_repository.get(user_id)

# Testing UserService with a mock user repository
def test_get_user():
    mock_repository = MockUserRepository()
    service = UserService(user_repository=mock_repository)
    # Test the method using the mock repository

Improving Code Reusability

Dependency injection promotes code reusability by relying on abstractions rather than concrete implementations. This allows different implementations to be injected based on specific requirements.

Example:

class PaymentGateway(ABC):
    @abstractmethod
    def process_payment(self, amount):
        pass

class PayPalGateway(PaymentGateway):
    def process_payment(self, amount):
        # Process payment via PayPal
        pass

class StripeGateway(PaymentGateway):
    def process_payment(self, amount):
        # Process payment via Stripe
        pass

Lesser-Known Techniques and Tricks:

Parameter Injection

In addition to constructor, setter, and interface injection, parameter injection is a technique where dependencies are passed directly as parameters to functions or methods. This can be useful in situations where direct injection is preferred over using class instances.

Example:

def process_data(data, logger):
    logger.info("Processing data...")
    # Process the data

# Calling the function with injected dependencies
logger = Logger()
process_data(data, logger)

Context Managers and Dependency Injection

Context managers can be combined with dependency injection to provide resources or dependencies within a specific scope, ensuring their proper initialization and cleanup.

Example:

from contextlib import contextmanager

@contextmanager
def db_connection():
    connection = MySQLDatabase()  # Dependency initialization
    yield connection
    connection.close()  # Cleanup

# Using the context manager with dependency injection
with db_connection() as db:
    # Use the database connection within the context

Dependency Injection Containers

Dependency injection containers or frameworks provide a centralized way to manage dependencies, their configurations, and their lifetime. Popular Python DI frameworks include injector, DInjector, and inject.

Example:

from injector import inject, Injector

class UserService:
    @inject
    def __init__(self, user_repository):
        self.user_repository = user_repository

# Creating and using an injector
injector = Injector()
user_service = injector.get(UserService)

Conclusion

Dependency injection is a powerful technique for improving testability, code modularity, and reusability in Python. By applying various injection techniques and exploring different use cases, you can design more robust and maintainable code. Additionally, the lesser-known tricks and techniques covered in this blog post can further enhance your understanding and application of dependency injection in various scenarios.

Any comments or suggestions? Let me know.

Efficient Workflow for Reviewing Changes in Git before Pulling from Remote Branch

2023-06-20T00:00:00+02:00

Introduction

When working with Git, it is essential to have a streamlined workflow that ensures you review the changes made by others before pulling them into your local branch. This practice helps prevent conflicts and ensures that your local repository remains in sync with the remote branch. In this blog post, we will outline a few simple steps to check the changes introduced by others in the remote branch before performing a git pull.

Step 1: Fetch Remote Changes
Step 3: Review Changes
Step 4: Resolve Conflicts (if any)
Step 5: Pull Changes
Conclusion

Step 1: Fetch Remote Changes

Before reviewing any changes, it is crucial to fetch the latest updates from the remote repository. This step ensures that your local repository has the most up-to-date information. To fetch changes, run the following command:

git fetch

This command retrieves all the latest changes from the remote repository without automatically merging them into your local branch.

Step 2: Inspect Remote Branch

After fetching the remote changes, you can inspect the remote branch to see the modifications made by others. This step helps you understand the nature and scope of the changes before merging them into your branch. To view the remote branch, use the following command:

git log origin/branch-name

Replace branch-name with the name of the remote branch you want to review. This command displays a list of commits made to the remote branch, showing the commit hash, author, timestamp, and commit message.

Step 3: Review Changes

Now that you have a clear view of the commits in the remote branch, it's time to review the changes introduced. There are several ways to inspect the individual commits, depending on your preferred Git tooling. Here are a few common options:

Option 1: Using Git Diff

To review the changes introduced by a specific commit, you can use the git diff command. Run the following command, replacing commit-hash with the actual commit hash you want to inspect:

git diff commit-hash

This command displays a detailed diff of the changes made in that specific commit, allowing you to analyze the modifications line by line.

Option 2: Utilizing Visual Git Tools

If you prefer a more visual representation of changes, you can leverage Git GUI tools like GitKraken, Sourcetree, Git Cola or tig. These tools provide an intuitive interface that allows you to navigate through commits, inspect changes, and even visualize branching patterns.

tig

tig test..master - Show difference between two branches test and master

Step 4: Resolve Conflicts (if any)

During your review, you may encounter conflicts between the changes made by others and your local modifications. Conflicts arise when Git cannot automatically merge two sets of changes. If conflicts occur, it is crucial to resolve them before pulling the changes into your branch.

To resolve conflicts, you can use Git's built-in merge tools or a visual Git tool like those mentioned earlier. These tools provide a side-by-side view of conflicting changes, enabling you to choose which modifications to keep and how to combine them effectively.

Step 5: Pull Changes

After reviewing the changes, ensuring there are no conflicts or addressing any conflicts that arise, you can proceed with pulling the changes from the remote branch into your local branch. To pull the changes, use the following command:

git pull origin branch-name

Replace branch-name with the name of the remote branch from which you want to pull the changes. This command automatically merges the changes into your branch, keeping your local repository up to date.

Conclusion

In this blog post, we discussed a streamlined workflow for reviewing changes in Git before pulling them from a remote branch. By following these steps, you can ensure that you have a clear understanding of the modifications introduced by others, address conflicts if necessary, and maintain a synchronized local repository. Adopting this workflow will help you avoid potential

Extracting Keywords From the User Query

2023-06-09T00:00:00+02:00

Rule-Based Approach
Linguistic Analysis
Machine Learning (ML) and Statistical Methods
Hybrid Approaches:
What about using (large) language models?
Pros:
Cons:
More on Machine Learning and statistical Methods for Keywords Extraction
Exemplary implementation

When it comes to extracting keywords or key terms from a user query, there are several approaches that can be used. Each approach has its own set of pros and cons, which I will discuss below:

Rule-Based Approach

Pros: This approach involves defining a set of rules or patterns to identify keywords based on specific criteria. It can be effective for simple queries and known patterns, allowing for precise keyword extraction.
Cons: Rule-based approaches can be limited in their flexibility and scalability. They require manual effort to create and maintain the rules, making them less suitable for handling complex or evolving queries. Additionally, they may not perform well when faced with ambiguous or unstructured input.

Linguistic Analysis

Pros: Linguistic analysis techniques utilize natural language processing (NLP) algorithms to analyze the grammatical structure and semantics of a query. By considering parts of speech, syntactic relationships, and semantic associations, they can extract relevant keywords effectively.
Cons: This approach can be computationally expensive and may require substantial linguistic resources such as parsers, lexicons, and ontologies. Handling languages with complex grammar or processing highly contextual queries can be challenging. It might also struggle with ambiguous phrases or idiomatic expressions.

Machine Learning (ML) and Statistical Methods

Pros: ML techniques, such as supervised or unsupervised learning, can automatically learn patterns and extract keywords based on training data. They can adapt to different query types and improve over time with more data. Statistical methods, such as term frequency-inverse document frequency (TF-IDF), can also identify important keywords based on their prevalence and relevance within a dataset.
Cons: Building ML models requires labeled training data, which can be time-consuming and expensive to create. Models may struggle with rare or domain-specific queries if not adequately trained. They can also be susceptible to biases present in the training data, and their performance may degrade when faced with queries significantly different from the training distribution.

Hybrid Approaches

Pros: Hybrid approaches combine multiple techniques, leveraging the strengths of each to improve keyword extraction. For example, combining rule-based methods with ML models can enhance accuracy and handle a wider range of queries.
Cons: Designing and implementing hybrid approaches can be complex and require expertise in multiple areas. Combining different techniques may introduce additional computational overhead, impacting performance and response time.

It's important to note that the effectiveness of these approaches can vary depending on factors such as the nature of the queries, available resources, and the desired level of accuracy. A well-designed solution often involves a combination of techniques to achieve the best results.

What about using (large) language models?

Using language models, such as GPT-3.5, can be a powerful approach for extracting keywords or key terms from a user query. Language models are trained on vast amounts of text data and have the ability to understand and generate human-like language.

Here are the pros and cons of using language models for keyword extraction:

Pros

Contextual Understanding: Language models can capture the contextual meaning of words and phrases in a query. They can consider the surrounding words and sentences to extract keywords that are most relevant to the overall query.
Handling Ambiguity: Language models can handle ambiguous queries by considering the broader context. They can interpret the query based on available information and generate keywords that make the most sense in the given context.
Generalization: Language models have the ability to generalize from the training data and can extract keywords effectively even for queries that are slightly different from what they have seen before.
Continuous Learning: Language models can be fine-tuned on specific domains or datasets to improve their keyword extraction capabilities. This allows them to adapt to specific contexts and improve their accuracy over time.

Cons

Lack of Control: Language models generate keywords based on their learned patterns and training data, which may not always align with specific user requirements or domain-specific terminology. They may produce keywords that are technically correct but not exactly what the user intended.
Over-reliance on Training Data: Language models heavily depend on the data they were trained on. If the training data contains biases or limitations, the model may exhibit the same biases or struggle with specific types of queries that were underrepresented in the training data.
Computational Overhead: Language models can be computationally expensive to run, especially for real-time applications. The time required for keyword extraction using a language model might not be suitable for scenarios that demand low latency or high throughput.
Lack of Explanation: Language models can provide keyword outputs, but they may not offer clear explanations for why certain words were selected as keywords. This lack of interpretability can make it challenging to understand the reasoning behind the chosen keywords.

While language models can be effective for keyword extraction, it's important to consider these pros and cons and carefully evaluate the trade-offs before integrating them into a production system. It may be necessary to fine-tune the language model or combine it with other techniques to address specific limitations or requirements.

More on Machine Learning and statistical Methods for Keywords Extraction

There are several machine learning and statistical methods commonly used for keyword extraction from text. Here are some popular techniques:

Term Frequency-Inverse Document Frequency (TF-IDF): TF-IDF is a statistical method that measures the importance of a term within a document and across a collection of documents. It calculates a weight for each term based on its frequency in the document and inversely proportional to its frequency in the entire document collection. Keywords with higher TF-IDF scores are considered more significant.
TextRank: TextRank is an algorithm inspired by Google's PageRank algorithm for ranking web pages. It applies a graph-based ranking approach to identify important keywords in a text. In this method, the text is represented as a graph, where each word is a node, and edges represent the co-occurrence or semantic similarity between words. TextRank assigns scores to words based on their centrality in the graph, with higher scores indicating more important keywords.
Latent Dirichlet Allocation (LDA): LDA is a generative probabilistic model that represents a collection of documents as a mixture of topics. It assumes that each document contains a distribution of topics, and each topic is characterized by a distribution of words. LDA can be used for keyword extraction by identifying the most probable words associated with each topic. Keywords are then selected based on their relevance to the document's topics.
Support Vector Machines (SVM): SVM is a supervised learning algorithm that can be used for keyword extraction by treating it as a binary classification problem. Training data is labeled with keywords and non-keywords, and SVM learns a decision boundary to separate the two classes. New text can be classified using the trained SVM model, and the words contributing most to the classification decision are considered keywords.
Neural Networks: Various neural network architectures can be employed for keyword extraction, such as recurrent neural networks (RNNs), convolutional neural networks (CNNs), and transformers. These models can learn representations of words and capture complex relationships between them. They can be trained using labeled data or trained in an unsupervised manner by formulating the problem as an autoencoder or sequence-to-sequence learning.
Rule-based methods: Rule-based approaches define a set of linguistic rules or patterns to identify keywords based on specific criteria such as part-of-speech tags, syntactic structures, or domain-specific rules. These methods can be effective when the domain or language has well-defined patterns for keywords.

Exemplary implementation

One state-of-the-art solution for keyword extraction from short texts is the TextRank algorithm, which is an unsupervised approach based on the PageRank algorithm. It has been proven to be highly effective in identifying important keywords in a text.

Here's a Python implementation using the nltk library, which provides an implementation of the TextRank algorithm:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn
from collections import defaultdict

def preprocess_text(text):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Tokenize each sentence into words and perform part-of-speech tagging
    tagged_words = []
    for sentence in sentences:
        words = word_tokenize(sentence)
        tagged_words.extend(pos_tag(words))

    # Lemmatize the words and remove stopwords
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    preprocessed_words = []
    for word, tag in tagged_words:
        # Consider only nouns, verbs, adjectives, and adverbs
        if tag.startswith('NN') or tag.startswith('VB') or tag.startswith('JJ') or tag.startswith('RB'):
            # Lemmatize the word
            lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))

            # Convert to lowercase and remove stopwords
            if lemma.lower() not in stop_words:
                preprocessed_words.append(lemma.lower())

    return preprocessed_words

def get_wordnet_pos(tag):
    if tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('V'):
        return wn.VERB
    elif tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('R'):
        return wn.ADV
    else:
        return None

def calculate_similarity(word1, word2):
    synsets1 = wn.synsets(word1)
    synsets2 = wn.synsets(word2)
    if synsets1 and synsets2:
        max_sim = max((wn.path_similarity(s1, s2) or 0) for s1 in synsets1 for s2 in synsets2)
        return max_sim
    return 0

def textrank_keywords(text, top_n=5):
    # Preprocess the text
    words = preprocess_text(text)

    # Build the word co-occurrence graph
    graph = defaultdict(list)
    for i, word1 in enumerate(words):
        for j, word2 in enumerate(words):
            if i != j:
                similarity = calculate_similarity(word1, word2)
                if similarity > 0:
                    graph[word1].append((word2, similarity))

    # Apply the TextRank algorithm
    scores = defaultdict(float)
    damping_factor = 0.85
    max_iterations = 100
    for _ in range(max_iterations):
        prev_scores = dict(scores)
        for word1 in words:
            score = (1 - damping_factor) + damping_factor * sum(prev_scores[word2] * weight for word2, weight in graph[word1])
            scores[word1] = score

    # Get the top keywords
    top_keywords = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_n]
    return top_keywords

# Example usage
text = "What are the benefits of exercise for mental health?"
keywords = textrank_keywords(text)
print(keywords)

NOTE: before you can start using it you will need to download certain data resources from NLTK (Natural Language Toolkit) in order to use it for keyword extraction. Specifically, you will need to download the stopwords corpus and WordNet data.

To download the necessary data, you can use the following code snippet:

import nltk

nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('wordnet')

The Role and Responsibilities of a Forward Deployed Engineer - Bridging the Gap Between Software Products and Customer Needs

2023-06-09T00:00:00+02:00

prompt: null

TL;DR

A Forward Deployed Engineer (FDE) is a versatile software engineer who works closely with customers to bridge the gap between enterprise software products and their specific implementation needs. FDEs collaborate with engineering teams, provide technical support, partner with product teams, assist in revenue growth activities, and lead customer success efforts. With a mix of technical skills, an entrepreneurial mindset, and product intuition, FDEs play a crucial role in ensuring successful product deployment and customer satisfaction.

Introduction
Understanding the Role of FDEs
Why Forward Deployed Engineers are in High Demand?
Conclusion

Introduction

In the fast-paced world of enterprise software, there is an increasing demand for versatile engineers who can seamlessly integrate complex products into customers' specific implementation needs. This demand has given rise to the role of Forward Deployed Engineer (FDE). FDEs play a crucial role in ensuring successful technical integration and ongoing product deployment, acting as a bridge between the product suite and the unique requirements of each customer. This blog post will delve into the responsibilities of FDEs and shed light on why this role is in high demand.

Understanding the Role of FDEs

Forward Deployed Engineers are software engineers with broad skill sets that enable them to work closely with customers and iterate on enterprise software products. They possess technical expertise while being customer-facing, making them a valuable asset in various areas of an enterprise software organization.

Collaboration with Engineering

Forward Deployed Engineers (FDEs) play a crucial role in fostering collaboration between engineering teams and external stakeholders. By actively contributing to internal codebases and working closely with core engineering teams, FDEs ensure that customer feedback and implementation needs are effectively communicated and addressed.

FDEs act as the bridge between the technical complexities of the product and the understanding of external stakeholders. They have a deep understanding of the product's architecture, functionalities, and underlying technologies. This expertise allows them to effectively communicate technical topics to non-technical stakeholders, such as customers or business executives.

When customers encounter challenges or require customizations to the product suite, FDEs work closely with the engineering team to find viable solutions. They provide valuable insights on the implementation needs and collaborate with engineers to identify the best approach. FDEs act as advocates for customers, ensuring that their requirements are properly understood and addressed within the product's capabilities.

Through this collaboration, FDEs contribute to the improvement of internal codebases. They provide feedback to engineering teams regarding areas that require enhancements or optimizations based on real-world customer experiences. This feedback loop helps create a continuous improvement process for the product, making it more robust and aligned with customer needs.

Furthermore, FDEs actively participate in cross-functional meetings, bringing the perspective of external stakeholders to the engineering team. This collaboration helps align engineering efforts with customer requirements and provides valuable context for decision-making.

Collaboration with engineering is a critical aspect of the FDE role. By effectively communicating technical topics to external stakeholders and working closely with the engineering team, FDEs ensure that customer feedback is accurately relayed, implementation needs are addressed, and the product continues to evolve to meet customer expectations.

Partnership with Product Teams

Forward Deployed Engineers (FDEs) play a pivotal role in establishing a strong partnership between external stakeholders and the product teams. By leveraging their customer-facing experience and technical expertise, FDEs bring valuable insights to the table, shaping the product roadmap and driving its evolution.

FDEs act as the voice of the customer within the organization. They gather feedback, requirements, and feature requests directly from customers and effectively communicate these insights to the product teams. By understanding the customers' pain points, desired features, and use cases, FDEs provide invaluable information that helps shape the product's direction.

Throughout the engineering lifecycle, FDEs collaborate closely with the product teams to iterate on existing features and deliver new use cases. They work in tandem with product managers, developers, and designers to ensure that the product roadmap aligns with the specific needs of customers. FDEs provide real-world context and technical expertise, enabling product teams to make informed decisions regarding prioritization, feature enhancements, and trade-offs.

FDEs also act as a bridge between product teams and customers during the implementation phase. They facilitate ongoing communication, ensuring that the product is implemented effectively and meets customers' expectations. FDEs provide guidance on technical integration, address any gaps between the product suite and customer requirements, and offer insights on best practices for successful deployment.

Additionally, FDEs actively participate in testing and validation processes, providing feedback on new features and enhancements from the customer's perspective. They collaborate with product teams to conduct user acceptance testing, gather feedback, and ensure that the product meets the desired outcomes.

By establishing a strong partnership with product teams, FDEs contribute to the overall success of the product. Their unique position allows them to bridge the gap between customer needs and product development, ensuring that the product remains relevant, competitive, and aligned with the evolving market landscape.

The partnership between FDEs and product teams is essential for driving innovation, customer satisfaction, and product evolution. FDEs bring customer insights, technical expertise, and a deep understanding of implementation needs to collaborate closely with product teams, influencing the product roadmap, and delivering value-driven solutions to customers.

Support for Revenue Growth

Forward Deployed Engineers (FDEs) contribute significantly to revenue growth by providing technical expertise and support in various revenue-related activities. Their role extends beyond engineering and involves actively participating in sales meetings, leading technical discussions, and completing Requests for Proposal (RFPs).

As technical advisors, FDEs join sales meetings with non-technical external stakeholders, such as executives or business leaders. In this capacity, they provide valuable insights into the product's capabilities, technical requirements, and implementation process. By bridging the gap between the product suite and the customers' specific needs, FDEs help potential clients understand the value proposition and make informed purchasing decisions.

Moreover, FDEs take the lead in technical sales calls and meetings with external technical stakeholders. They are responsible for communicating the technical aspects of the product, answering complex inquiries, and addressing any technical concerns potential customers may have. FDEs play a crucial role in building trust and confidence in the product's ability to meet the customers' requirements.

FDEs also contribute to revenue growth by completing RFPs. These documents are often requested by potential customers to evaluate software solutions for their specific needs. FDEs leverage their technical knowledge and customer insights to provide comprehensive and accurate responses to these RFPs. By effectively showcasing the product's capabilities and aligning them with customer requirements, FDEs play a key role in unlocking new revenue opportunities.

Additionally, FDEs collaborate with the sales and marketing teams to develop technical collateral, such as case studies, technical whitepapers, and solution guides. These resources help articulate the product's value proposition, highlight successful customer implementations, and provide technical details to support the sales process. FDEs actively contribute to these materials, ensuring they are accurate, relevant, and impactful.

By supporting revenue growth initiatives, FDEs contribute to the overall success of the organization. Their technical expertise, customer-centric mindset, and ability to effectively communicate the value of the product position them as trusted advisors and advocates for both the customers and the sales teams. FDEs help drive new business opportunities, enhance customer satisfaction, and ultimately contribute to the financial growth of the company.

FDEs play a crucial role in supporting revenue growth by providing technical support, leading sales discussions, completing RFPs, and developing collateral. Their ability to bridge the gap between technical complexities and customer needs helps build trust, accelerate sales cycles, and unlock new revenue streams. FDEs are instrumental in driving the financial success of the organization.

Leadership in Customer Success

Forward Deployed Engineers (FDEs) take on a leadership role in ensuring customer success throughout the implementation and deployment of the product. They act as technical leads and provide critical support to customers, facilitating onboarding, and driving the adoption of new features into customers' production environments.

FDEs serve as the primary point of contact for customers during the implementation phase. They work closely with customer success teams to understand the customers' specific requirements and develop tailored implementation plans. FDEs leverage their technical expertise to guide customers through the integration process, ensuring a smooth and successful onboarding experience.

As technical leads, FDEs provide ongoing support to customers, addressing any technical issues or challenges they may encounter. They troubleshoot and resolve complex technical problems, acting as a bridge between the customers and the engineering team. FDEs leverage their deep understanding of the product to provide timely and effective solutions, ensuring that customers can fully leverage the capabilities of the software.

In addition to technical support, FDEs play a critical role in driving the adoption of new features and enhancements. They collaborate with customers to understand their specific use cases and provide guidance on how to best utilize the product's functionality to achieve their desired outcomes. FDEs conduct training sessions, create documentation, and offer best practices to ensure that customers can maximize the value they derive from the product.

FDEs also act as advocates for customers within the organization. They actively collect feedback, feature requests, and insights from customers and communicate them to the product teams. By representing the customers' voice, FDEs contribute to the continuous improvement of the product, ensuring that it evolves to meet their changing needs.

Building strong relationships with customers is a key aspect of the FDE role. FDEs engage in regular communication, conduct business reviews, and seek opportunities to deepen customer engagement. By understanding the customers' goals, challenges, and aspirations, FDEs can provide personalized recommendations and strategic guidance, ultimately fostering long-term customer satisfaction and loyalty.

FDEs assume a leadership role in customer success by providing technical guidance, support, and advocacy throughout the implementation and deployment process. Their deep technical expertise, customer-centric approach, and ability to build strong relationships position them as trusted partners for customers. FDEs play a crucial role in driving customer success, ensuring that customers achieve their desired outcomes and maximizing the value they derive from the product.

Why Forward Deployed Engineers are in High Demand?

The increasing complexity of enterprise software products and the variability in customer requirements have created a significant demand for FDEs. Here are some reasons why this role is sought after:

Technical Expertise and Customer Focus

FDEs possess a unique mix of technical skills and customer-centricity. They understand the intricacies of the product and can effectively communicate its value to both technical and non-technical stakeholders. Their ability to bridge the gap between engineering and customer needs is invaluable in ensuring successful deployments.

Agile Problem Solvers

FDEs exhibit an entrepreneurial mindset, allowing them to adapt quickly to evolving customer requirements. They are adept at identifying challenges, proposing solutions, and iterating on product features. This agility is essential in a rapidly changing technological landscape, where customers' needs evolve at a fast pace.

Product Intuition

By working closely with customers, FDEs develop a deep understanding of their pain points and aspirations. This product intuition enables them to provide valuable insights to product teams, helping shape the product roadmap and prioritize features that align with customer needs. FDEs contribute to the development of customer-centric software solutions.

Conclusion

Forward Deployed Engineers play a vital role in enterprise software organizations, acting as the bridge between products and customer implementations. Their broad skill set, technical expertise, entrepreneurial mindset, and product intuition make them invaluable assets in driving customer success, revenue growth, and product evolution. As enterprise software continues to evolve, the demand for FDEs will likely increase, providing software engineers with a customer-facing path that allows them to thrive in both technical and business domains

Any comments or suggestions? Let me know.

How to Count Tokens - Tokenization With Tiktoken.

2023-06-08T00:00:00+02:00

Counting tokens is a useful task in natural language processing (NLP) that allows us to measure the length and complexity of a text. The two important use cases for counting the tokens are:

controlling the length of the prompt - models has limit on the number of input tokens - it is good to have control if you don't exceed the limits for the model
cost awareness - when you know how many tokens you pass as input, you know the cost related to the prompt.

In this blog post, we will explore how to count the number of tokens in a given text using OpenAI's tokenizer, called tiktoken. Whether you're a seasoned Python developer or just getting started with NLP, this guide will provide you with a step-by-step process to accurately determine the token count of your text.

Introduction to `tiktoken`

To begin with, we need to install the tiktoken library, which is a powerful tokenizer developed by OpenAI. It offers efficient tokenization capabilities and supports a wide range of languages. You can find the library on GitHub at this link.

Code Example

Let's dive into a code example that demonstrates how to count tokens using tiktoken:

import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

num_tokens_from_string("tiktoken is great!", "cl100k_base")

In the example above, we import the tiktoken library and define a function called num_tokens_from_string. This function takes a text string and an encoding name as input parameters. It returns the number of tokens in the given text string.

To count the tokens, we first obtain the encoding using tiktoken.get_encoding(encoding_name). The encoding_name specifies the type of encoding we want to use. In this case, we use the cl100k_base encoding, which is suitable for second-generation embedding models like text-embedding-ada-002.

Next, we encode the input string using encoding.encode(string) and calculate the number of tokens by taking the length of the encoded sequence. The final result is the total number of tokens in the text string.

tiktoken supports three encodings used by OpenAI models:

Encoding name	OpenAI models
`cl100k_base`	`gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`
`p50k_base`	Codex models, `text-davinci-002`, `text-davinci-003`
`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`

OpenAI Cookbook Guide

For a more detailed explanation and additional examples, you can refer to the OpenAI Cookbook guide on how to count tokens with tiktoken. The guide provides comprehensive instructions on token counting and offers insights into various use cases.

Tokenization Sandbox

If you're looking to experiment with text tokenization, OpenAI provides a convenient web application called the Tokenization Sandbox. You can access it here. The sandbox allows you to input text and observe the resulting tokens, helping you better understand the tokenization process.

Text splitter module

A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken library for encoding and decoding text.: https://gist.github.com/izikeros/17d9c8ab644bd2762acf6b19dd0cea39

Count tokens cli tool

Check this simple CLI tool that have one purpose - count tokens in a text file:

izikeros/count_tokens: Count tokens in a text file.

Rule of thumb

OpenAI on the website with the tokenizer sandbox provides rule of thumb that helps to estimate approximate number of tokens in given text.

A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly ¾ of a word (so 100 tokens ~= 75 words).

References

To develop this guide, we drew inspiration from the token counting instructions provided by OpenAI. You can find additional information in the OpenAI documentation, where they discuss the limitations and risks associated with embeddings.

Token counting is essential when working with NLP, enabling us to analyze and process text effectively. By leveraging OpenAI's tiktoken library and following the guidelines outlined in this blog post, you'll be well-equipped to count tokens accurately and efficiently.

The Best Vector Databases for Storing Embeddings

2023-06-05T00:00:00+02:00

Best Vector Databases for Storing Embeddings in NLP

As natural language processing (NLP) continues to advance, the need for efficient storage and retrieval of vector representations, or embeddings, has become paramount.

Vector databases are purpose-built databases that excel in storing and querying high-dimensional vector data, such as word embeddings or document representations.

This article explores the best vector databases available, their unique features, and the crucial parameters that differentiate them.

TLDR
What vector databases are, and why there is demand for them?
Understanding tradeoffs and identifying the specific requirements to choose the best tool
Vector databases
Tabular summary of the features
Recommendations
Related reading

TLDR

If you don't want to spent time on reading about each solution, you might want to head directly for the recommendations section where solutions for various use cases are proposed.

What vector databases are, and why there is demand for them?

Vector databases are specialized databases designed for efficient storage, retrieval, and manipulation of vector representations, particularly in the context of Natural Language Processing (NLP) and machine learning applications. They are optimized for handling high-dimensional embeddings that represent textual or numerical data in a vectorized format.

While traditional databases like PostgreSQL are versatile and battle-tested, they are not specifically optimized for vector operations. Vector databases, on the other hand, provide a set of features and optimizations tailored to the unique requirements of working with vector embeddings. Here are some reasons why vector databases are in demand despite the existence of other types of databases:

Scalability: Vector databases are built to handle large-scale datasets and can scale horizontally to accommodate growing data volumes. They distribute the storage and processing of vectors across multiple machines, enabling efficient handling of massive amounts of embedding data.
Query Speed: Vector databases employ advanced indexing structures and search algorithms, such as approximate nearest neighbor (ANN) search, to achieve fast and accurate similarity searches. These optimizations enable rapid retrieval of vectors based on their similarity to a given query vector.
Accuracy of Search Results: Vector databases focus on preserving the accuracy of similarity search results. They leverage techniques like space partitioning, dimensionality reduction, and quantization to ensure that similar vectors are efficiently identified, even in high-dimensional spaces.
Flexibility: Vector databases offer flexibility in terms of supported vector operations and indexing methods. They often provide a range of indexing algorithms, allowing users to choose the one that best suits their specific use case. Additionally, vector databases may support additional functionality like filtering, ranking, and semantic search.
Data Persistence and Durability: Vector databases prioritize data persistence and durability, ensuring that vector embeddings are reliably stored and protected against data loss. They often integrate with existing storage solutions or provide mechanisms for backup and replication.
Storage Location: Vector databases can be deployed either on-premises or in the cloud, providing flexibility in terms of infrastructure choices. Cloud-based vector databases offer the advantage of managed services, offloading the operational overhead of maintaining and scaling the database infrastructure.
Direct Library vs. Abstraction: Vector databases come in two main forms: those that offer a direct library interface for integration into existing systems and those that provide a higher-level abstraction, such as RESTful APIs or query languages. This flexibility allows developers to choose the level of control and integration that best fits their requirements.

While traditional databases like PostgreSQL can handle various data types, including vectors, they may lack the specialized optimizations and features provided by vector databases. Vector databases excel in efficiently storing and querying high-dimensional embeddings, enabling faster similarity search and supporting specific vector-related operations. By leveraging these optimizations, vector databases streamline the development and deployment of NLP and machine learning applications.

Understanding tradeoffs and identifying the specific requirements to choose the best tool

When choosing a vector database, there are several tradeoffs and potentially contradicting requirements that developers need to consider. Here are some typical tradeoffs and contradictions related to selecting a vector database:

Scalability vs. Query Speed: Achieving high scalability often requires distributing data across multiple nodes, which can impact query speed due to network communication. Balancing the need for scalability with the requirement for fast query response times can be a tradeoff when selecting a vector database.
Search Accuracy vs. Query Speed: Algorithms that provide high search accuracy, such as exact nearest neighbor search, can be computationally expensive and impact query speed. Approximate algorithms, while faster, might sacrifice some accuracy. The tradeoff lies in finding the right balance between search accuracy and query speed based on the specific use case.
Flexibility vs. Performance: Some vector databases offer extensive customization options, allowing users to tailor the system to their specific requirements. However, the more flexibility provided, the more overhead might be introduced, potentially impacting overall performance. Balancing the need for flexibility with performance considerations is crucial.
Data Persistence and Durability vs. Query Performance: Ensuring data persistence and durability typically involves additional disk I/O operations, which can impact query performance. The tradeoff here is finding the right level of data persistence and durability while maintaining satisfactory query performance.
Storage Location vs. Data Security: Storing vector embeddings locally provides faster access, but it may introduce data security risks. Cloud-based storage solutions offer scalability and redundancy but may raise concerns about data privacy and compliance. The choice between local and cloud storage involves weighing the benefits of each option against data security requirements.
Direct Library vs. Abstraction: Some vector databases offer direct library interfaces for seamless integration into existing systems, while others provide higher-level abstractions like APIs or query languages for ease of use. The tradeoff here is between the level of control and integration required versus the simplicity of implementation and maintenance.
Ease of Use vs. Advanced Features: Vector databases that prioritize ease of use often sacrifice some advanced features and optimization techniques. Developers must consider the complexity of their use case and weigh the need for advanced features against the simplicity of the database.

Understanding these tradeoffs and identifying the specific requirements of a project is crucial in selecting a vector database that best aligns with the desired tradeoff priorities. It requires carefully evaluating the tradeoffs and making informed decisions based on the unique needs of the application or system being developed.

Vector databases

Chroma

Chroma is an open-source vector database developed by Chroma.ai. It focuses on scalability, providing robust support for storing and querying large-scale embedding datasets efficiently. Chroma offers a distributed architecture with horizontal scalability, enabling it to handle massive volumes of vector data. It leverages Apache Cassandra for high availability and fault tolerance, ensuring data persistence and durability.

One unique aspect of Chroma is its flexible indexing system. It supports multiple indexing strategies, such as approximate nearest neighbors (ANN) algorithms like HNSW and IVFPQ, enabling fast and accurate similarity searches. Chroma also provides comprehensive Python and RESTful APIs, making it easily integratable into NLP pipelines. With its emphasis on scalability and speed, Chroma is an excellent choice for applications that require high-performance vector storage and retrieval.

They have Colab notebook with the demo.

The core API commands (from the product page)

import chromadb

client = chromadb.Client()

c = client.create_collection("test")

# add embeddings and documents
c.add(...)

# get back similar ones
c.query(...)

Note: there are plugins for LangChain, LlamaIndex, OpenAI and others.

Haystack by DeepsetAI

DeepsetAI's Haystack is another popular vector database designed specifically for NLP applications. It offers a range of features tailored to support end-to-end development of search systems using embeddings. Haystack integrates well with popular transformer models like BERT, allowing users to extract embeddings directly from pre-trained models. It leverages Elasticsearch as its underlying storage engine, providing powerful indexing and querying capabilities.

Haystack stands out with its intuitive query language, which supports complex semantic searches and filtering based on various parameters. Additionally, it offers a modular pipeline architecture for preprocessing, embedding extraction, and querying, making it highly customizable and adaptable to different NLP use cases. With its user-friendly interface and comprehensive functionality, DeepsetAI's Haystack is an excellent choice for developers seeking a flexible and feature-rich vector database for NLP.

Faiss by Facebook

Faiss logo, developed by Facebook AI Research, is a widely-used vector database renowned for its high-performance similarity search capabilities. It provides a range of indexing methods optimized for efficient retrieval of nearest neighbors, including IVF (Inverted File) and HNSW (Hierarchical Navigable Small World). Faiss also supports GPU acceleration, enabling fast computation on large-scale embeddings.

One of Faiss' notable features is its support for multi-index search, which combines different indexing methods to improve search accuracy and speed. Additionally, Faiss offers a Python interface, making it easy to integrate with existing NLP pipelines and frameworks. With its focus on search performance and versatility, Faiss is a go-to choice for projects demanding fast and accurate similarity search over vast embedding collections.

Milvus

Milvus is an open-source vector database developed by Zilliz, designed for efficient storage and retrieval of large-scale embeddings. It provides high scalability and supports distributed deployment across multiple machines, making it suitable for handling massive NLP datasets. Milvus integrates with popular ANN libraries like Faiss, Annoy, and NMSLIB, offering flexible indexing options to achieve high search accuracy.

One key feature of Milvus is its GPU support, leveraging NVIDIA GPUs for accelerated computation. This makes Milvus an excellent choice for deep learning applications that require fast vector search and similarity calculations. Furthermore, Milvus provides a user-friendly WebUI and supports multiple programming languages, simplifying development and deployment processes. With its focus on scalability and GPU acceleration, Milvus is an ideal vector database for large-scale NLP projects.

pgvector

Open-source vector similarity search for Postgres. Pgvector helps to built vector database on top of PostgreSQL, a popular open-source relational database. It leverages the powerful indexing capabilities of PostgreSQL's extension system to provide efficient storage and retrieval of vector embeddings. pgvector supports both CPU and GPU inference, enabling high-performance vector operations.

One key advantage of pgvector is its seamless integration with the broader PostgreSQL ecosystem. Users can leverage the rich functionality of PostgreSQL, such as ACID compliance and support for complex queries, while benefiting from vector-specific operations. pgvector provides a PostgreSQL extension that extends the SQL syntax to handle vector operations and offers a Python library for easy integration. With its compatibility with PostgreSQL and efficient vector storage, pgvector is a reliable choice for NLP applications that require a seamless SQL integration.

Pinecone

Pinecone is a managed vector database built for handling large-scale embeddings in real-time applications. It focuses on low-latency search and high-throughput indexing, making it suitable for latency-sensitive NLP use cases. Pinecone's cloud-native infrastructure handles indexing, storage, and query serving, allowing developers to focus on building their applications.

Pinecone offers a RESTful API and client libraries for various programming languages, simplifying integration with different NLP frameworks. It supports dynamic indexing, allowing incremental updates to embeddings without rebuilding the entire index. Pinecone also provides advanced features like vector similarity search, filtering, and result ranking. With its emphasis on real-time performance and ease of use, Pinecone is an excellent choice for developers seeking a fully managed vector database for NLP applications.

Supabase

Supabase, known for its open-source data platform, offers a scalable vector storage solution designed for fast and efficient retrieval of embeddings. Supabase leverages PostgreSQL as its underlying storage engine, ensuring data durability and compatibility with standard SQL queries. It provides a range of features such as indexing, querying, and filtering, optimized for vector data.

One distinctive aspect of Supabase is its real-time capabilities, enabled by its integration with PostgREST and PostgreSQL's logical decoding feature. This allows developers to build real-time applications that can react to changes in vector data. Supabase also provides a user-friendly interface and client libraries for various programming languages, making it accessible to developers with different skill sets. With its combination of vector storage and real-time capabilities, Supabase is an excellent choice for NLP projects that require both scalability and real-time updates.

Qdrant

Qdrant is an open-source vector database designed for similarity search and efficient storage of high-dimensional embeddings. It leverages an approximate nearest neighbor (ANN) algorithm based on Hierarchical Navigable Small World (HNSW) graphs, enabling fast and accurate similarity searches. Qdrant supports both CPU and GPU inference, allowing users to leverage hardware acceleration for faster computations.

One notable feature of Qdrant is its RESTful API, which provides a user-friendly interface for indexing, searching, and managing vector data. Qdrant also offers flexible query options, allowing users to specify search parameters and control the trade-off between accuracy and speed. With its focus on efficient similarity search and user-friendly API, Qdrant is a powerful vector database for various NLP applications.

Vespa

Vespa is an open-source big data processing and serving engine developed by Verizon Media. It provides a distributed, scalable, and high-performance infrastructure for storing and querying vector embeddings. Vespa utilizes an inverted index structure combined with approximate nearest neighbor (ANN) search algorithms for efficient and accurate similarity searches.

One of Vespa's key features is its built-in ranking framework, allowing developers to define custom ranking models and apply complex ranking algorithms to search results. Vespa also supports real-time updates, making it suitable for dynamic embedding datasets. Additionally, Vespa provides a query language and a user-friendly WebUI for managing and monitoring the vector database. With its focus on distributed processing and advanced ranking capabilities, Vespa is a powerful tool for NLP applications that require complex ranking models and real-time updates.

Weaviate

Weaviate is an open-source knowledge graph and vector search engine that excels in handling high-dimensional embeddings. It combines the power of graph databases and vector search to provide efficient storage, retrieval, and exploration of vector data. Weaviate offers powerful indexing methods, including approximate nearest neighbor (ANN) algorithms like HNSW, for fast and accurate similarity searches.

One unique aspect of Weaviate is its focus on semantics and contextual relationships. It allows users to define custom schema and relationships between entities, enabling complex queries that go beyond simple vector similarity. Weaviate also provides a RESTful API, client libraries, and a user-friendly WebUI for easy integration and management. With its combination of graph database features and vector search capabilities, Weaviate is an excellent choice for NLP applications that require semantic understanding and exploration of embeddings.

DeepLake

DeepLake is an open-source vector database designed for efficient storage and retrieval of embeddings. It focuses on scalability and speed, making it suitable for handling large-scale NLP datasets. DeepLake provides a distributed architecture with built-in support for horizontal scalability, allowing users to handle massive volumes of vector data.

One unique feature of DeepLake is its support for distributed vector indexing and querying. It leverages an ANN algorithm based on the Product Quantization (PQ) method, enabling fast and accurate similarity searches. DeepLake also provides a RESTful API for easy integration with NLP pipelines and frameworks. With its emphasis on scalability and distributed processing, DeepLake is a robust vector database for demanding NLP applications.

VectorStore from LangChain

LangChain VectorStore is an open-source vector database optimized for multilingual NLP applications. It focuses on efficient storage and retrieval of embeddings across multiple languages. VectorStore supports various indexing methods, including approximate nearest neighbor (ANN) algorithms like HNSW and Annoy, for fast similarity searches.

One distinguishing feature of VectorStore is its language-specific indexing and retrieval capabilities. It provides language-specific tokenization and indexing strategies to optimize search accuracy for different languages. VectorStore also offers a RESTful API and client libraries for easy integration with NLP pipelines. With its multilingual support and language-specific indexing, VectorStore is an excellent choice for projects that deal with embeddings across multiple languages.

Other Relevant Vector Databases

While the above tools represent some of the best vector databases available for storing embeddings in NLP, there are other notable options worth exploring:

Annoy

Annoy is a lightweight C++ library for approximate nearest neighbor (ANN) search, offering efficient indexing and querying of high-dimensional embeddings.

Elasticsearch

Elasticsearch is a popular distributed search and analytics engine that can be used to store and retrieve vector embeddings efficiently.

Hnswlib

Hnswlib is a C++ library for efficient approximate nearest neighbor (ANN) search, providing high-performance indexing and retrieval of embeddings.

NMSLIB

NMSLIB is an open-source library for similarity search, offering a range of indexing methods and data structures for efficient storage and retrieval of embeddings.

These vector databases provide additional options and features that may suit specific requirements or preferences. Exploring these alternatives can help developers find the best fit for their NLP projects.

To explore more, often lesser-known libraries you can use GitHub's topic search: vector-database · GitHub Topics · GitHub

Tabular summary of the features

Tool	Scalability	Query Speed	Search Accuracy	Flexibility	Persistence	Storage Location
Chroma	High	High	High	High	Yes	Local/Cloud
DeepsetAI	High	High	High	High	Yes	Local/Cloud
Faiss	High	High	High	Medium	No	Local/Cloud
Milvus	High	High	High	High	Yes	Local/Cloud
pgvector	Medium	Medium	High	High	Yes	Local
Pinecone	High	High	High	High	Yes	Cloud
Supabase	High	High	High	High	Yes	Cloud
Qdrant	High	High	High	High	Yes	Local/Cloud
Vespa	High	High	High	High	Yes	Local/Cloud
Weaviate	High	High	High	High	Yes	Local/Cloud
DeepLake	High	High	High	High	Yes	Local/Cloud
LangChain VectorStore	High	High	High	High	Yes	Local/Cloud
Annoy	Medium	Medium	Medium	Medium	No	Local/Cloud
Elasticsearch	High	High	High	High	Yes	Local/Cloud
Hnswlib	High	High	High	High	No	Local/Cloud
NMSLIB	High	High	High	High	No	Local/Cloud

Recommendations

Please find recommendations for three groups of use cases

Easy Start and User-Friendliness - good for PoC

In this group, the focus is on vector databases that are easy to start with and user-friendly, even if they may sacrifice some advanced capabilities or performance.

Chroma: Chroma is an excellent choice for this group due to its simplicity and ease of use. It provides a straightforward API and offers out-of-the-box functionality for vector storage and retrieval. While it may not have the same level of scalability or advanced search algorithms as some other tools, it is ideal for small to medium-sized projects or beginners who want to quickly get started with vector databases.
DeepsetAI: DeepsetAI is another tool that prioritizes user-friendliness without compromising on essential functionalities. It offers a user-friendly interface, powerful search capabilities, and easy integration into existing NLP workflows. DeepsetAI is well-suited for developers who want a simple and efficient solution for storing and querying vector embeddings.

Advanced Capabilities and Performance

In this group, we consider vector databases that provide advanced capabilities and high-performance, catering to more demanding use cases.

Faiss: Faiss is a widely used and highly performant vector database that specializes in efficient similarity search. It offers a range of indexing structures and search algorithms, making it suitable for large-scale projects that require fast and accurate retrieval of embeddings. Faiss is an optimal choice when performance and search accuracy are critical.
Milvus: Milvus is another powerful vector database known for its scalability and performance. It provides distributed storage and indexing, allowing for efficient handling of large-scale embedding datasets. Milvus supports various indexing algorithms, including approximate nearest neighbor (ANN) search, enabling fast similarity search. It is a robust solution for projects that demand scalability, high-performance, and flexibility.

Customization and Advanced Use Cases

In this group, we consider vector databases that offer extensive customization options and cater to advanced use cases with specific requirements.

Pinecone: Pinecone is a vector database that excels in providing real-time search capabilities and high scalability. It offers advanced features such as dynamic indexing, custom similarity functions, and efficient updates, making it ideal for applications that require real-time embeddings and constant model refinement.
Supabase: Supabase is an open-source database platform that provides a wide range of features, including support for vector storage and retrieval. With its flexibility and customizability, Supabase is suitable for projects that require not only vector database functionality but also the benefits of a comprehensive database platform.

By considering the diverse requirements of each group, we have recommended vector databases that prioritize ease of use, advanced capabilities, and customization. These recommendations aim to assist developers in selecting the most appropriate vector database for their specific use case and level of expertise.

Mastering the Kanban Method - Unveiling the Hidden Gems of Effective Kanban Board Usage

2023-05-26T00:00:00+02:00

Introduction

In today's fast-paced and ever-evolving business landscape, organizations are constantly seeking efficient project management methodologies to enhance productivity and streamline workflows. One such approach that has gained significant popularity is the Kanban method. Kanban, originating from the Japanese word for "billboard" or "visual card," is a visual project management system that allows teams to track and manage work effectively. In this comprehensive guide, we will delve into the intricacies of the Kanban method, explore the proper utilization of Kanban boards, and reveal lesser-known tips and tricks to maximize their potential.

Understanding the Kanban Method Principles
Avoiding Common Mistakes
Unveiling Lesser-Known Tips and Tricks
Conclusion

Understanding the Kanban Method Principles

The Kanban method, developed by Taiichi Ohno at Toyota, is built on the principles of visualizing work, limiting work in progress (WIP), and focusing on continuous improvement. At its core, Kanban promotes transparency, flexibility, and collaboration, providing teams with a clear overview of their tasks and enabling them to optimize their workflows.

Visualize Your Workflow

The fundamental principle of Kanban lies in visualizing your workflow. By representing each task as a card or sticky note on a Kanban board, teams gain a shared understanding of the work in progress. A typical Kanban board comprises columns that depict different stages of work, such as "To Do," "In Progress," and "Done." Visualizing tasks fosters transparency, enhances communication, and enables team members to identify bottlenecks or inefficiencies quickly.

Limit Work in Progress (WIP)

To maintain a smooth workflow and prevent overburdening team members, it is crucial to limit the number of tasks in progress simultaneously. Setting WIP limits for each column on the Kanban board ensures a manageable workload, promotes focus, and encourages completing tasks before moving on to new ones. WIP limits prevent multitasking, which can lead to reduced productivity and increased lead times.

Collaborate and Encourage Flow

Kanban encourages collaboration and cross-functional teamwork. By eliminating silos and fostering a culture of shared responsibility, teams can achieve a seamless flow of work. Encourage frequent communication, promote knowledge sharing, and embrace a collective ownership mindset to optimize the overall efficiency of your Kanban system.

Continuously Improve

The Kanban method is rooted in the philosophy of continuous improvement. Encourage your team to reflect on their processes, identify areas of improvement, and implement changes accordingly. By regularly reviewing your Kanban board, analyzing cycle times, and seeking feedback from team members, you can refine your workflows, streamline processes, and enhance overall productivity.

Avoiding Common Mistakes

While the Kanban method offers numerous benefits, it's important to be aware of common pitfalls that can hinder its effectiveness. By recognizing and avoiding these mistakes, you can ensure your Kanban implementation is successful.

Neglecting WIP Limits

One common mistake is neglecting WIP limits or setting them too high. Failing to adhere to WIP limits can lead to task overload, reduced focus, and increased lead times. Regularly review and adjust WIP limits based on team capacity and project requirements.

Lack of Clarity and Standardization

Without clear guidelines and standardized practices, teams may interpret Kanban differently, leading to confusion and inconsistency. Establish explicit rules for how tasks should be represented on the board, how updates are communicated, and how metrics are measured. Consistency ensures everyone understands the workflow and can collaborate effectively.

Failure to Prioritize and Swarm

In Kanban, it's important to prioritize tasks and encourage the team to focus on completing them one at a time. Neglecting prioritization can lead to cherry-picking tasks or tackling low-value items first. Additionally, encourage swarming, where team members collaborate to complete tasks together, rather than working individually, to maximize efficiency and knowledge sharing.

Lack of Continuous Improvement

One of the main principles of Kanban is continuous improvement. Failing to allocate time for retrospectives, process analysis, and incremental changes can hinder your team's growth and limit the full potential of your Kanban system. Regularly review and refine your workflows to ensure ongoing progress and evolution.

Unveiling Lesser-Known Tips and Tricks

Now, let's uncover some lesser-known tips and tricks that can take your Kanban practice to the next level, boosting your team's productivity and overall success.

Class of Service

Introduce the concept of "Class of Service" to prioritize tasks based on their impact and urgency. By assigning different classes to tasks, such as expedite, standard, or fixed-date, teams can ensure that critical work is appropriately prioritized and expedited, while still maintaining a steady flow.

Visualizing Blocked Tasks

In addition to representing tasks in progress, leverage the Kanban board to highlight blocked or stalled tasks. Use specific indicators or flags to denote issues preventing progress, such as dependencies, resource constraints, or waiting for external feedback. This visual cue helps the team focus on resolving blockers and ensures smoother workflow management.

Kanban Swimlanes

Introduce swimlanes on your Kanban board to categorize tasks based on different criteria, such as priority, team member, or project phase. Swimlanes provide a higher level of organization and enable teams to filter and analyze their work in a more granular manner. This approach can be particularly beneficial for larger teams or complex projects.

Implementing Agile Practices

Combine Kanban with agile practices to amplify its impact. Techniques like daily stand-ups, sprint planning, and retrospectives can complement the visual nature of Kanban, fostering enhanced collaboration, transparency, and adaptability within your team.

Conclusion

The Kanban method, with its emphasis on visualization, limiting work in progress, and continuous improvement, offers organizations a powerful tool to optimize their workflows and enhance team productivity. By avoiding common mistakes and incorporating lesser-known tips and tricks, teams can unlock the full potential of Kanban, streamline their processes, and achieve remarkable results. Embrace the power of Kanban, and watch your projects flourish in an environment of transparency, collaboration, and continuous improvement.

Credits: heading image from unsplash by edenconstantin0

Any comments or suggestions? Let me know.

Getting the User's Home Directory Path in Python - A Cross-Platform Guide

2023-04-20T00:00:00+02:00

Use `os.path.expanduser()`

To get the user's home directory in Python, you can use the os.path.expanduser() function. This function expands the initial tilde ~ character in a file path to the user's home directory path.

Here's an example:

import os

home_dir = os.path.expanduser("~")
print(home_dir)

This should output the path to the user's home directory, which will be different depending on the operating system.

For example, on a Unix-based system such as macOS or Linux, this will output something like /Users/username. On a Windows system, it will output something like C:\Users\username.

Using os.path.expanduser() is a cross-platform solution because it automatically handles the differences between operating systems in how they represent home directory paths.

Use `Path.home()`

You can also use the Path.home() method of the pathlib module to get the user's home directory path in a platform-independent way. Here's an example:

from pathlib import Path

home_dir = Path.home()
print(home_dir)

This will output the same path to the user's home directory as the previous example, but it uses the Path object instead of the os module.

The Path.home() method is a cross-platform way of getting the user's home directory path. It returns a Path object representing the home directory path, which can be used with other pathlib methods to manipulate file paths in a platform-independent way.

Other alternatives

There are a few other ways to get the user's home directory path in Python, some of which are platform-dependent.

Using the os.environ dictionary:

import os

home_dir = os.environ['HOME']
print(home_dir)

This works on Unix-based systems like macOS and Linux, where the HOME environment variable is set to the user's home directory path.

Using the os.path.expandvars() function:

import os

home_dir = os.path.expandvars('$HOME')
print(home_dir)

This also works on Unix-based systems where the HOME environment variable is set, but it can also work on other systems if the appropriate environment variable is set.

Using the winreg module on Windows:

import winreg

key = winreg.OpenKey(winreg.HKEY_CURRENT_USER, "SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\Shell Folders")
home_dir = winreg.QueryValueEx(key, "Personal")[0]
print(home_dir)

This works on Windows systems, but it requires the winreg module and accesses the Windows Registry, so it is not as platform-independent as the other solutions.

Overall, using either os.path.expanduser() or Path.home() is the most reliable and platform-independent way to get the user's home directory path in Python.

Attacking Differential Privacy Using the Correlation Between the Features

2023-04-19T00:00:00+02:00

Introduction

Differential privacy is a technique that adds random noise to the data to protect individual privacy while still allowing for accurate data analysis. However, differential privacy can still be vulnerable to attacks that can compromise the privacy of individuals. One such attack is through the use of correlation between features. In this blog post, we will discuss how an attacker can use correlation between features to attack differential privacy and how to mitigate this attack.

Background
Correlation Between Features
Steps for the attack using correlation between features
1. Identify highly correlated features
2. Compute expected values
3. Compare expected and observed values
Mitigating the Attack
Summary
Tutorial
Select a dataset that requires privacy protection
Apply differential privacy
Perform the attack - reconstruct original data by exploiting correlation between features
Conclusion

Background

Differential privacy adds random noise to the data to protect the privacy of individuals. The amount of noise added depends on a parameter called the privacy budget. The higher the privacy budget, the less noise is added, and the lower the privacy budget, the more noise is added. The privacy budget is usually set based on the desired level of privacy and the size of the data set. A smaller privacy budget leads to better privacy but less accurate data analysis, while a larger privacy budget leads to less privacy but more accurate data analysis.

Correlation Between Features

In many data sets, the features are not independent but are correlated with each other. Correlation between features can be measured using the correlation coefficient. The correlation coefficient between two features x and y is defined as:

$$ ρ_{x,y} = cov(x,y) / (σ_x * σ_y) $$

where $cov(x,y)$ is the covariance between $x$ and $y$, and $\sigma_x$ and $\sigma_y$ are the standard deviations of $x$ and $y$, respectively.

Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. For example, suppose an attacker knows that two features x and y are highly correlated. If the attacker sees that the value of y is very different from what they would expect based on the value of x, they can infer that the individual's data was not included in the data set.

Steps for the attack using correlation between features

An attacker can use the following steps to attack differential privacy using correlation between features:

1. Identify highly correlated features

The attacker identifies which features in the data set are highly correlated with each other.

2. Compute expected values

The attacker computes the expected values of the features based on the values of the other features.

3. Compare expected and observed values

The attacker compares the expected values with the observed values of the features. If the observed values are significantly different from the expected values, the attacker can infer that the individual's data was not included in the data set.

Mitigating the Attack

There are several ways to mitigate the attack using correlation between features. One approach is to decorrelate the features by transforming the data. For example, principal component analysis (PCA) can be used to decorrelate the features. Another approach is to add noise to the data in a way that preserves the correlation between features. This approach is called differentially private PCA (DP-PCA). DP-PCA adds noise to the data in a way that satisfies differential privacy while preserving the correlation between features.

Summary

Correlation between features can be used to attack differential privacy. An attacker can use the correlation between features to infer the presence or absence of an individual's data in the data set. To mitigate this attack, the features can be decorrelated or noise can be added to the data using DP-PCA. Data security experts should be aware of this attack and take appropriate measures to mitigate its effects.

Tutorial

In this tutorial, we will go through the steps of attacking differential privacy by exploiting correlations between features, using Python code to demonstrate each step.

In the tutorial we will be using pydp Python library, so you need to install it first:

pip install python-dp

Select a dataset that requires privacy protection

For this tutorial, we will use the Adult dataset from the UCI Machine Learning Repository. This dataset contains information about individuals, including their age, education level, marital status, occupation, and more. The goal is to predict whether an individual earns more than $50K per year. We will load this dataset using pandas:

import pandas as pd

df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
                 header=None,
                 names=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
                        "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
                        "hours-per-week", "native-country", "income"])

Apply differential privacy

We will use the PyDP library to apply differential privacy to the dataset. We will add Laplace noise to the age and education-num features, with a privacy budget of 1.0:

from pydp.algorithms.laplacian import BoundedMean

epsilon = 1.0

# apply differential privacy to age
bm = BoundedMean(epsilon=epsilon, lower=0, upper=100)
df["age"] = df["age"].apply(lambda x: bm.quick_result(x))

# apply differential privacy to education-num
bm = BoundedMean(epsilon=epsilon, lower=0, upper=16)
df["education-num"] = df["education-num"].apply(lambda x: bm.quick_result(x))

Perform the attack - reconstruct original data by exploiting correlation between features

Now that we have applied differential privacy to the dataset, we will attempt to reconstruct the original data by exploiting the correlation between features. Specifically, we will use the age and education-num features, which we know are highly correlated, to infer the values of the original data.

First, we will create a copy of the dataset and remove the age and education-num features, as we will be reconstructing these features:

df_attack = df.drop(columns=["age", "education-num"])

Next, we will compute the mean and covariance matrix of the remaining features:

import numpy as np

# compute mean and covariance of remaining features
mean = np.mean(df_attack.values, axis=0)
cov = np.cov(df_attack.values.T)

We can now use the mean and covariance matrix to generate synthetic data:

# generate synthetic data
synthetic_data = np.random.multivariate_normal(mean, cov, size=df.shape[0])
synthetic_df = pd.DataFrame(synthetic_data, columns=df_attack.columns)

Finally, we will reconstruct the age and education-num features using the generated synthetic data:

# reconstruct age and education-num features
reconstructed_age = (df["education-num"].values - mean[1]) / cov[1, 1] * cov[0, 1] + mean[0]
reconstructed_edu_num = (df["age"].values - mean[0]) / cov[0, 0] * cov[0, 1] + mean[1]

# combine reconstructed features with original data
reconstructed_df = pd.DataFrame({"age":reconstructed_age, "education-num": reconstructed_edu_num})

df_reconstructed = pd.concat([df_attack, reconstructed_df], axis=1)

We can now compare the reconstructed age and education-num features with the original features to see how well our attack worked:

# compare reconstructed age and education-num with original features print("Age:") 
print("Original:", df["age"].values[:10]) 
print("Reconstructed:", reconstructed_age[:10]) 
print() 

print("Education-num:") 
print("Original:", df["education-num"].values[:10]) print("Reconstructed:", reconstructed_edu_num[:10])

Age:
Original: [39 50 38 53 28 37 49 52 31 42]
Reconstructed: [39.38640885 49.44619487 38.2757904  52.75103613 26.46121269 37.760824
 47.88143872 52.8530772  30.79760633 42.56495885]

Education-num:
Original: [13 13  9  7 13 14  5  9 14 13]
Reconstructed: [13.19164695 13.19406455  9.04750693  6.8549391  13.25155432 13.76664294
  5.45598348  8.72003132 14.14489928 12.9968581 ]

As we can see, the reconstructed values are quite similar to the original values. This suggests that an attacker could use the correlation between the age and education-num features to infer the original values, even with the protection of differential privacy.

Conclusion

In this tutorial, we have demonstrated how an attacker can exploit correlations between features to attack differential privacy. We used the PyDP library to apply differential privacy to a dataset, and then showed how an attacker could use the correlation between the age and education-num features to reconstruct the original values. This highlights the importance of considering the correlations between features when applying differential privacy, and suggests that additional protections may be necessary to prevent attacks based on feature correlations.

Any comments or suggestions? Let me know.

Python Regex Named Groups

2023-04-19T00:00:00+02:00

In Python regex, match.groupdict() is a method that returns a dictionary containing all the named groups of a regular expression match.

When you use named capturing groups in a regular expression using the (?P<name>...) syntax, you can access the captured text using the groupdict() method on the match object returned by re.match() or re.search(). The keys of the dictionary correspond to the group names, and the values are the captured text for each group.

Here's an example:

import re

pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
text = 'Today is 2023-04-19'

match = re.search(pattern, text)

if match:
    print(match.groupdict())

Output:

{'year': '2023', 'month': '04', 'day': '19'}

In the above example, the regular expression pattern matches a date string in the format 'yyyy-mm-dd', and each part of the date is captured using named groups. The groupdict() method returns a dictionary with keys 'year', 'month', and 'day', and their corresponding captured values.

Are LIME Explanations Any Useful?

2023-04-18T00:00:00+02:00

LIME (Local Interpretable Model-agnostic Explanations) is a method used to interpret black box models. This technique is widely used in the field of data science to explain the predictions of complex machine learning models. By providing local explanations, LIME can help users understand the decision-making process of the models and increase their trust in the models' predictions. However, the question remains, are the local explanations obtained with LIME method useful? And what are the practical use cases when using LIME gave tangible results?

In this article, we will delve into the concept of LIME, its practical applications, and how it can be used to provide interpretable machine learning models.

What is LIME?

LIME is a model-agnostic technique used to explain the predictions of machine learning models. The idea behind LIME is to explain the predictions of a black box model by training a local, interpretable model around the data point of interest. The interpretable model is trained to mimic the behavior of the black box model around that data point. Once the local model is trained, it can be used to generate an explanation of the prediction, highlighting the most important features that contributed to the prediction.

The LIME algorithm consists of the following steps:

Select a data point of interest.
Generate a dataset of perturbed instances around the selected data point.
Evaluate the black box model on the perturbed instances to obtain a set of weights that indicate the importance of each feature for the prediction.
Train an interpretable model (such as a linear regression model) on the perturbed instances, using the weights obtained in step 3 as feature weights.
Use the trained interpretable model to generate an explanation of the prediction for the selected data point.

Practical applications of LIME

LIME has been successfully applied in various domains, including healthcare, finance, and image recognition. Here are some practical use cases where LIME has been used to provide interpretable machine learning models:

Healthcare: LIME has been used to interpret the predictions of machine learning models that diagnose diseases. For example, in a study conducted by Zech et al., LIME was used to interpret the predictions of a deep learning model that diagnosed pneumonia from chest X-rays. The LIME explanations provided by the study helped radiologists understand the decision-making process of the model and identify areas of the X-rays that contributed the most to the diagnosis.
Finance: LIME has also been used to interpret the predictions of machine learning models that predict financial outcomes. For example, in a study conducted by Chen et al., LIME was used to interpret the predictions of a machine learning model that predicted the credit risk of borrowers. The LIME explanations provided by the study helped lenders understand the factors that contributed to the credit risk prediction and make informed lending decisions.
Image recognition: LIME has been used to interpret the predictions of machine learning models that recognize images. For example, in a study conducted by Selvaraju et al., LIME was used to interpret the predictions of a deep learning model that recognized objects in images. The LIME explanations provided by the study helped users understand which parts of the image were important for the prediction and identify areas of improvement for the model.

Benefits and limitations of LIME

LIME provides several benefits to data scientists and machine learning practitioners.

First, LIME can help increase the trust of users in machine learning models by providing interpretable explanations of the models' predictions. This can be especially useful in high-stakes domains, such as healthcare and finance, where decisions based on machine learning predictions can have significant consequences. Second, LIME can help users identify areas of improvement for machine learning models. By providing explanations of the models' predictions, LIME can help users identify which features were important for the prediction and which ones were not. This can help users refine their feature engineering process and improve the performance of their models.

However, LIME also has some limitations that data scientists and machine learning practitioners should be aware of.

First, LIME provides local explanations, which means that the explanations are only valid for the selected data point of interest. Therefore, the explanations generated by LIME may not generalize to other data points.

Second, LIME requires a significant amount of computational resources to generate the perturbed instances and train the interpretable model. This can be a limitation when working with large datasets or computationally expensive models.

Conclusion

LIME is a useful technique for interpreting the predictions of machine learning models. LIME can help increase the trust of users in machine learning models and identify areas of improvement for the models. LIME has been successfully applied in various domains, including healthcare, finance, and image recognition. However, LIME also has some limitations, such as providing local explanations and requiring significant computational resources. Therefore, data scientists and machine learning practitioners should carefully consider the use of LIME and its limitations when interpreting the predictions of their models.

Any comments or suggestions? Let me know.

Intrinsic vs. Extrinsic Evaluation - What's the Best Way to Measure Embedding Quality?

2023-04-18T00:00:00+02:00

X::Demystifying Perplexity - Assessing Dimensionality Reduction With PCA

Introduction

Let's start with the concept of embedding vectors. In natural language processing (NLP), an embedding vector is a mathematical representation of words or phrases. It's a way to convert text data into numerical values that can be processed by machine learning algorithms. Word embeddings and sentence embeddings are widely used in natural language processing (NLP) for a variety of tasks, such as text classification, named entity recognition, machine translation, and sentiment analysis. However, it is not always straightforward to evaluate the quality of embeddings, and different evaluation metrics may be appropriate for different use cases. In this blog post, we will explore several ways to measure the quality of embeddings, including intrinsic and extrinsic evaluation, and discuss their strengths and limitations.

Intrinsic Evaluation
Extrinsic Evaluation
- F1 Score
- Perplexity
Limitations
Conclusion

Intrinsic Evaluation

Intrinsic evaluation - aims to measure the quality of embeddings by assessing their performance on specific NLP tasks that are related to the embedding space itself, such as word similarity, analogy, and classification.

In this section, we will discuss three commonly used intrinsic evaluation metrics: cosine similarity, Spearman correlation, and accuracy.

Cosine Similarity

Cosine similarity measures the similarity between two vectors by computing the cosine of the angle between them. In the context of embeddings, cosine similarity is often used to measure the similarity between two words, or between a word and its context. The formula for cosine similarity is as follows:

$$ cosine\_similarity(\textbf{v}_1, \textbf{v}_2) = \frac{\textbf{v}_1 \cdot \textbf{v}_2}{\|\textbf{v}_1\|\|\textbf{v}_2\|} $$

where $\textbf{v}_1$ and $\textbf{v}_2$ are the embeddings of two words, and $|\cdot|$ denotes the Euclidean norm.

Spearman Correlation

Spearman correlation measures the monotonic relationship between two variables, which can be the similarity scores of two sets of words or phrases computed by humans and by embeddings. A high Spearman correlation indicates that the embeddings are able to capture the semantic relationships between words that humans perceive. The formula for Spearman correlation is as follows:

$$ \text{Spearman's correlation} = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}

where $d_i$ is the difference between the ranks of the $i$-th pair of similarity scores, and $n$ is the number of pairs.

Accuracy

Accuracy measures the performance of embeddings on classification tasks, such as sentiment analysis or topic classification. Given a dataset of labeled examples, the embeddings are used to represent each example, and a classifier is trained on these representations. The accuracy of the classifier on a held-out test set is then used as a measure of the quality of the embeddings.

Extrinsic Evaluation

Extrinsic evaluation - aims to measure the quality of embeddings by assessing their performance on downstream NLP tasks, such as machine translation or text classification, that are not directly related to the embedding space itself.

In this section, we will discuss two commonly used extrinsic evaluation metrics: F1 score and perplexity.

F1 Score

F1 score is a metric commonly used in binary classification problems, such as sentiment analysis or named entity recognition. It combines precision and recall into a single score that ranges from 0 to 1. A high F1 score indicates that the embeddings are able to capture the relevant features of the input data. The formula for F1 score is as follows:

$$ F1 = \frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$

where precision is the fraction of true positives among the predicted positives, and recall is the fraction of true positives among the actual positives.

Perplexity

Perplexity is a metric commonly used in language modeling tasks, such as machine translation or text generation. It measures how well a language model can predict a held-out test set of text, given the embeddings as input. A low perplexity indicates that the embeddings are able to capture the semantic and syntactic structures of the language. The formula for perplexity is as follows:

$$ \text{perplexity} = 2^{-\frac{1}{N}\sum_{i=1}^{N} \log_2 p(w_i | \textbf{e}_i)} $$

where $N$ is the number of tokens in the test set, $\textbf{e}_i$ is the embedding of the $i$-th token, and $p(w_i | \textbf{e}_i)$ is the conditional probability of the $i$-th token given its embedding.

Limitations

While intrinsic and extrinsic evaluation metrics can provide useful insights into the quality of embeddings, they also have some limitations. Intrinsic evaluation may not always reflect the performance of embeddings on downstream tasks, and extrinsic evaluation may not always isolate the contribution of embeddings from other factors, such as the choice of model architecture or the quality of the training data. Moreover, the choice of evaluation metrics may depend on the specific use case and the available resources, and there is no one-size-fits-all solution.

Any comments or suggestions? Let me know.

Convert HEIC and HEIF to Jpg, Png, BMP With Python

2023-04-14T00:00:00+02:00

HEIF and HEIC image formats are gaining popularity due to their superior image quality and smaller file sizes compared to traditional formats like JPEG and PNG. However, they are not yet widely supported by all devices and software applications. In this blog post, we will explore how to convert HEIF and HEIC files to JPEG and other popular image formats using Python.

Tutorial
Use Pillow
Use pyheif library
Summary

Tutorial

Python provides several libraries for working with images, including Pillow, OpenCV, and PyImageSearch. For this tutorial, we will be using the Pillow library, which is a fork of the Python Imaging Library (PIL) and provides a simple and easy-to-use API for image processing.

Use Pillow

Step 1: Installing Required Libraries

Before we can begin converting HEIF and HEIC files, we need to make sure we have the necessary libraries installed. To install Pillow, open a terminal or command prompt and run the following command:

pip install Pillow

Step 2: Converting HEIF and HEIC Files to JPEG

To convert HEIF and HEIC files to JPEG, we can use the Pillow library's Image module. The Image module provides several methods for opening and saving images in different formats, including JPEG, PNG, and BMP.

Here is a Python code example that shows how to convert a single HEIF or HEIC file to JPEG:

from PIL import Image

# Open HEIF or HEIC file
image = Image.open('example.heic')

# Convert to JPEG
image.convert('RGB').save('example.jpg')

In the code above, we first import the Image module from the Pillow library. We then use the open() method to open the HEIF or HEIC file and assign it to the image variable. We then use the convert() method to convert the image to the RGB color space, which is required for saving the image in JPEG format. Finally, we use the save() method to save the converted image as a JPEG file.

Note that in order to convert HEIF and HEIC files to JPEG using Pillow, we need to convert them to the RGB color space. This can result in a loss of some of the advanced features of HEIF and HEIC, such as support for high dynamic range (HDR) and wide color gamut (WCG).

If you want to convert multiple HEIF or HEIC files to JPEG, you can use a for loop to iterate over a list of file names:

from PIL import Image
import os

# Get list of HEIF and HEIC files in directory
directory = '/path/to/directory'
files = [f for f in os.listdir(directory) if f.endswith('.heic') or f.endswith('.heif')]

# Convert each file to JPEG
for filename in files:
    image = Image.open(os.path.join(directory, filename))
    image.convert('RGB').save(os.path.join(directory, os.path.splitext(filename)[0] + '.jpg'))

In the code above, we use the os library to get a list of HEIF and HEIC files in a directory. We then use a for loop to iterate over the list of file names, open each file using the Image module, convert it to RGB color space, and save it as a JPEG file with the same name as the original file.

Step 3: Converting HEIF and HEIC Files to Other Formats

In addition to converting HEIF and HEIC files to JPEG, we can also convert them to other popular formats like PNG and BMP using the Pillow library. Here is an example that shows how to convert a HEIF or HEIC file to PNG:

from PIL import Image  # Open HEIF or HEIC

HEIC file image = Image.open('example.heic')

# Convert to PNG
image.save('example.png')

In the code above, we use the save() method to save the opened HEIF or HEIC file as a PNG file. Similarly, we can convert HEIF and HEIC files to BMP format using the save() method with the 'BMP' argument:

from PIL import Image # Open HEIF or HEIC file 

image = Image.open('example.heic') 

# Convert to BMP 
image.save('example.bmp')

Step 4: Converting HEIF and HEIC Files in Bulk to JPEG

If you have a large number of HEIF and HEIC files that you need to convert to JPEG, you can use the following Python script:

from PIL import Image
import os

# Get list of HEIF and HEIC files in directory
directory = '/path/to/directory'
files = [f for f in os.listdir(directory) if f.endswith('.heic') or f.endswith('.heif')]

# Create output directory if it does not exist
if not os.path.exists(os.path.join(directory, 'output')):
    os.makedirs(os.path.join(directory, 'output'))

# Convert each file to JPEG
for filename in files:
    image = Image.open(os.path.join(directory, filename))
    image.convert('RGB').save(os.path.join(directory, 'output', os.path.splitext(filename)[0] + '.jpg'))

In the code above, we use the os library to get a list of HEIF and HEIC files in a directory. We then create an output directory if it does not already exist. Finally, we use a for loop to iterate over the list of file names, open each file using the Image module, convert it to RGB color space, and save it as a JPEG file in the output directory with the same name as the original file.

Use pyheif library

Here is an example of how to use the pyheif library to convert HEIF and HEIC files to JPEG:

import pyheif
from PIL import Image

# Open HEIF or HEIC file
heif_file = pyheif.read("example.heic")

# Extract the image data
image = Image.frombytes(heif_file.mode, heif_file.size, heif_file.data)

# Save as JPEG
image.save('example.jpg')

In the code above, we use the pyheif library to read in the HEIF or HEIC file, then use the frombytes() method of the PIL.Image module to create a PIL image object from the extracted image data. Finally, we use the save() method to save the image as a JPEG file.

To convert multiple HEIF and HEIC files in bulk using pyheif, you can use the following code:

import pyheif
from PIL import Image
import os

# Get list of HEIF and HEIC files in directory
directory = '/path/to/directory'
files = [f for f in os.listdir(directory) if f.endswith('.heic') or f.endswith('.heif')]

# Create output directory if it does not exist
if not os.path.exists(os.path.join(directory, 'output')):
    os.makedirs(os.path.join(directory, 'output'))

# Convert each file to JPEG
for filename in files:
    heif_file = pyheif.read(os.path.join(directory, filename))
    image = Image.frombytes(heif_file.mode, heif_file.size, heif_file.data)
    image.save(os.path.join(directory, 'output', os.path.splitext(filename)[0] + '.jpg'))

In this code, we use the same approach to get a list of HEIF and HEIC files in a directory and create an output directory if it does not already exist. Then, we use a for loop to iterate over the list of file names, read in each HEIF or HEIC file using pyheif, create a PIL image object from the extracted image data, and save it as a JPEG file in the output directory with the same name as the original file.

Using the pyheif library to convert HEIF and HEIC files to JPEG is a simple and effective way to handle image file format conversions in Python.

Summary

In this blog post, we explored how to convert HEIF and HEIC files to JPEG and other popular image formats using Python and the Pillow and pyheif libraries. We covered how to convert a single file as well as multiple files in bulk. With this knowledge, you can easily convert HEIF and HEIC files to more widely supported formats, enabling you to use them on any device or software application that supports images.

X::Smaller Files, Better Quality - The Advantages of HEIF and HEIC

Explaining AI - The Key Differences Between LIME and SHAP Methods

2023-04-14T00:00:00+02:00

LIME and SHAP are both popular methods for explainable AI (XAI), but they differ in several key ways.

Model-agnostic vs model-specific
Local vs global explanations
Kernel-based vs game-theoretic approach
Interpretability vs accuracy trade-off
Conclusion

Model-agnostic vs model-specific

One of the main differences between LIME and SHAP is that LIME is model-agnostic, meaning it can be used to explain the decisions of any machine learning model, regardless of the algorithm used. In contrast, SHAP is a model-specific method that is designed to explain the decisions of tree-based models, such as decision trees, random forests, and gradient boosting machines.

Local vs global explanations

Another key difference between LIME and SHAP is the type of explanation they provide. LIME generates local explanations, meaning it explains the decision of a complex model for a specific instance or observation. In contrast, SHAP provides global explanations, meaning it explains the overall behavior of the model across all instances.

Kernel-based vs game-theoretic approach

LIME uses a kernel-based approach to explain the decisions of a complex model. It creates a local, interpretable model that approximates the behavior of the complex model around a specific instance. In contrast, SHAP uses a game-theoretic approach to explain the contribution of each feature to the final prediction. It assigns a "credit" score to each feature based on how much it contributes to the prediction.

Interpretability vs accuracy trade-off

Finally, LIME and SHAP differ in their approach to the interpretability vs accuracy trade-off. LIME sacrifices some accuracy in order to provide more interpretable explanations. It creates a simpler model that may not be as accurate as the complex model, but is easier to understand. In contrast, SHAP aims to provide accurate explanations without sacrificing model accuracy. It uses a more sophisticated approach to explain the contribution of each feature, but this can be more difficult to understand for non-experts.

Conclusion

LIME and SHAP are both useful methods for XAI, but they differ in their approach to explaining the decisions of complex machine learning models. LIME is model-agnostic and provides local, interpretable explanations, while SHAP is model-specific and provides global explanations using a game-theoretic approach. The choice between LIME and SHAP depends on the specific needs of the user and the characteristics of the machine learning model being explained.

Any comments or suggestions? Let me know.

Smaller Files, Better Quality - The Advantages of HEIF and HEIC

2023-04-14T00:00:00+02:00

Overview

High Efficiency Image Format (HEIF) and High Efficiency Video Coding (HEVC) Image Format (HEIC) are the two latest image file formats introduced by the Moving Picture Experts Group (MPEG). These formats are designed to improve image quality while reducing file size, which is particularly important for mobile devices with limited storage capacity.

Technical Details of HEIF and HEIC
Advantages of HEIF and HEIC

HEIF is a container format that can store a variety of image data, including single images, image sequences, and image collections. HEIC, on the other hand, is a specific implementation of HEIF that is used for still images.

HEIF and HEIC were first introduced in 2015 as part of the HEVC video coding standard. HEVC is a video compression standard that is designed to provide higher quality video at lower bit rates than previous standards such as H.264. HEIF and HEIC take advantage of the HEVC coding algorithms to provide better image quality at lower file sizes.

One of the key advantages of HEIF and HEIC is their support for advanced image features such as high dynamic range (HDR) and wide color gamut (WCG). HDR images have a greater range of brightness and color than standard images, which can make them look more lifelike. WCG images have a wider range of colors than standard images, which can make them look more vibrant and vivid.

HEIF and HEIC also support multiple images and image sequences in a single file, which can make it easier to manage and share collections of images. This is particularly useful for applications such as live photos, which combine still images and short videos into a single file.

Technical Details of HEIF and HEIC

HEIF and HEIC use a container format that is based on the ISO Base Media File Format (ISOBMFF). This format is similar to other container formats such as MP4 and MOV, and it provides a flexible and extensible framework for storing media data.

HEIF and HEIC use a compression algorithm called High Efficiency Image Format (HEVC), which is also known as H.265. HEVC is a video compression standard that was developed by the Joint Collaborative Team on Video Coding (JCT-VC) as part of the ITU-T H.265 standard. HEVC is designed to provide better compression than previous standards such as H.264, which can lead to smaller file sizes and better image quality.

HEVC achieves better compression by using advanced techniques such as intra prediction, inter prediction, and entropy coding. Intra prediction is used to predict pixels within a single image frame, while inter prediction is used to predict pixels between different frames in a video sequence. Entropy coding is used to further compress the data by removing redundancy and optimizing the data for compression.

HEIF and HEIC also support a variety of image features such as alpha channels, depth maps, and image sequences. Alpha channels are used to store transparency information for images, while depth maps are used to store information about the distance of objects in a scene. Image sequences are used to store multiple images in a single file, which can be useful for applications such as burst mode photography and time-lapse photography.

HEIF and HEIC also support a variety of metadata formats, including Exif, IPTC, and XMP. Exif is a standard format for storing metadata such as camera settings and location information, while IPTC is a standard format for storing news and media metadata. XMP is a metadata format that is used by Adobe products such as Photoshop and Lightroom.

Advantages of HEIF and HEIC

HEIF and HEIC offer a number of advantages over previous image formats such as JPEG and PNG. Some of the key advantages include:

Smaller File Sizes

HEIF and HEIC can achieve smaller file sizes than previous formats, which can reduce storage requirements and improve download times. This is particularly important for mobile devices, which often have limited storage capacity.

Better Image Quality

HEIF and HEIC can provide better image quality than previous formats, particularly for images with high dynamic range or wide color gamut. This can result in more realistic and vibrant images.

Support for Advanced Features

HEIF and HEIC support advanced features such as alpha channels, depth maps, and image sequences, which can provide greater flexibility and creativity for image processing and manipulation.

Compatibility

Although HEIF and HEIC are relatively new formats, they are now widely supported by modern operating systems and devices. For example, iOS devices have supported HEIC since iOS 11, and macOS and Windows both have built-in support for HEIF and HEIC.

Future-Proofing

HEIF and HEIC are designed to be flexible and extensible, which means they can support future enhancements and improvements to image processing and storage. This can help ensure that images stored in HEIF and HEIC formats remain compatible and accessible in the future.

Conclusion

HEIF and HEIC are the latest image file formats designed to provide better image quality and smaller file sizes than previous formats. They are based on the HEVC video compression standard and use advanced techniques such as intra prediction, inter prediction, and entropy coding to achieve better compression and image quality. HEIF and HEIC also support advanced image features such as high dynamic range and wide color gamut, as well as metadata formats such as Exif, IPTC, and XMP. Although HEIF and HEIC are relatively new formats, they are now widely supported by modern operating systems and devices, and offer a number of advantages over previous formats.

X::Convert HEIC and HEIF to Jpg, Png, BMP With Python

LIME - Understanding How This Method for Explainable AI Works

2023-04-14T00:00:00+02:00

Artificial intelligence (AI) has revolutionized the way we live and work, but it can sometimes be difficult to understand how AI algorithms reach their decisions. This is where explainable AI (XAI) comes in. XAI is the process of making AI models transparent and understandable to humans. One popular XAI method is Local Interpretable Model-Agnostic Explanations (LIME). In this blog post, we will explore how LIME works and why it is an important tool for XAI.

The need for Explainable AI

One of the main criticisms of AI is its "black box" nature. Many AI models, such as deep neural networks, are complex and difficult to interpret. When these models are used in high-stakes applications like healthcare or finance, it is important to understand how the AI arrived at its decision. This is where XAI comes in. XAI provides a framework for understanding how an AI model makes decisions, increasing trust and accountability.

LIME: A Local Interpretable Model-Agnostic Explanation Method

LIME is a popular XAI method that provides local, interpretable explanations for individual predictions made by any machine learning model. LIME was introduced in 2016 in the paper “Why Should I Trust You?”: Explaining the Predictions of Any Classifier” by Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, and has since become a widely used method for XAI.

LIME works by creating a simpler, interpretable model that approximates the behavior of the complex model. The simpler model is trained on local data points, and the resulting model is used to explain the decision of the complex model. The process involves the following steps:

Selecting an instance to explain
Perturbing the instance to create a dataset of similar instances
Weighting the similar instances based on their similarity to the instance to explain
Training a local, interpretable model on the weighted dataset
Using the local model to generate explanations for the complex model's decision

Let's explore each of these steps in more detail.

Step 1: Selecting an instance to explain

The first step in the LIME process is selecting an instance to explain. This could be an individual data point or an entire dataset. For example, if we are working with a healthcare AI model, we may want to explain the decision to recommend a certain treatment for a specific patient.

Step 2: Perturbing the instance to create a dataset of similar instances

Once we have selected the instance to explain, we perturb it to create a dataset of similar instances. This involves making small changes to the instance while keeping its label (i.e. the prediction of the complex model) the same. The purpose of this step is to create a diverse set of instances that are similar to the instance we want to explain.

Step 3: Weighting the similar instances based on their similarity to the instance to explain

After we have created a dataset of similar instances, we need to weight them based on their similarity to the instance we want to explain. This is done using a kernel function, which assigns a weight to each instance based on its distance to the instance to explain. The kernel function can be any function that measures similarity, such as the Gaussian kernel.

Step 4: Training a local, interpretable model on the weighted dataset

Now that we have a weighted dataset, we can train a local, interpretable model on it. The purpose of this model is to approximate the behavior of the complex model in the local region around the instance we want to explain. The local model should be simple enough to be easily interpretable, but accurate enough to capture the important features of the complex model.

The choice of local model depends on the problem domain and the complexity of the complex model. Some common choices include linear models, decision trees, and rule-based models.

Step 5: Using the local model to generate explanations for the complex model's decision

Once we have trained the local model, we can use it to generate explanations for the complex model's decision. This is done by analyzing the coefficients of the local model and identifying the features that contributed the most to the prediction. These features can be presented to the user as a list of important factors that influenced the decision.

Advantages of LIME

LIME has several advantages over other XAI methods. One of the main advantages is its model-agnostic nature. LIME can be used to explain the decisions of any machine learning model, regardless of its complexity or the algorithm used. This makes it a versatile tool for XAI.

Another advantage of LIME is its ability to generate local explanations. By creating a local model that approximates the behavior of the complex model, LIME is able to generate explanations that are tailored to specific instances. This can be useful in situations where the explanation for a decision needs to be customized for a particular user or context.

Limitations of LIME

Despite its many advantages, LIME also has some limitations. One of the main limitations is the need for human input in the kernel function. The choice of kernel function and its parameters can have a significant impact on the explanations generated by LIME. This means that the user needs to have some domain knowledge and expertise in selecting an appropriate kernel function.

Another limitation of LIME is its sensitivity to perturbations. LIME works by perturbing the instance to create a dataset of similar instances. However, small changes to the instance can result in significantly different explanations. This means that the explanations generated by LIME may not always be robust to changes in the input.

Conclusion

LIME is a powerful tool for XAI that provides local, interpretable explanations for individual predictions made by any machine learning model. By creating a simpler, interpretable model that approximates the behavior of the complex model, LIME is able to generate explanations that are tailored to specific instances. However, LIME also has some limitations, such as its sensitivity to perturbations and the need for human input in the kernel function. Despite these limitations, LIME remains an important tool for XAI and is widely used in industry and academia.

Any comments or suggestions? Let me know.

SHAP - Understanding How This Method for Explainable AI Works

2023-04-14T00:00:00+02:00

As a data scientist, one of the biggest challenges in deploying machine learning models is explaining how the model makes its decisions. The need for explainability is not only important for legal and ethical reasons, but it also helps in building trust in the model and making informed decisions. The SHapley Additive exPlanations (SHAP) method is a powerful technique that provides a unified framework for interpreting any model. In this blog post, I will explain the SHAP method, its mathematical foundation, and how it can be applied to interpret machine learning models.

What is SHAP?

The SHAP method is a game-theoretic approach to explain the output of any machine learning model. It is based on the concept of Shapley values. Here is a timeline for the SHAP method:

1953: Introduction of Shapley values by Lloyd Shapley for game theory
2010: First use of Shapley values for explaining machine learning predictions by Strumbelj and Kononenko
2017: SHAP paper + Python package by Lundberg

The SHAP method is a game-theoretic approach to explain the output of any machine learning model. It is based on the concept of Shapley values, which were introduced by Lloyd Shapley in 1953 to fairly distribute the gains of a cooperative game among its players. In the context of machine learning, the players are the input features, and the gain is the difference between the actual output of the model and the expected output. The SHAP method provides a way to calculate the Shapley values for each input feature, which gives us a measure of the contribution of each feature towards the model output.

The SHapley Additive exPlanations (SHAP) method we are using today was introduced in a paper titled "A Unified Approach to Interpreting Model Predictions" by Scott Lundberg and Su-In Lee, published in the Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017). The paper is available on the arXiv preprint server at https://arxiv.org/abs/1705.07874.

How does SHAP work?

The SHAP method works by computing the Shapley values for each feature in the input space. The Shapley value for feature i, denoted by $\phi_i$, is defined as the average contribution of the feature i across all possible coalitions of features. Mathematically, the Shapley value can be expressed as follows:

$$\phi_i(f,S) = \sum_{T \subseteq S \setminus \{i\}}\frac{|T|!(|S|-|T|-1)!}{|S|!}(f(T \cup \{i\}) - f(T))$$

where $X$ is the set of all input features, $S$ is a coalition of features that does not include feature $i$, $|S|$ is the size of the coalition, and $f(S\cup{i})$ is the output of the model when the features in $S$ and $i$ are present. The term $f(S)$ is the output of the model when only the features in $S$ are present. The Shapley value represents the average marginal contribution of feature $i$ over all possible coalitions.

To compute the Shapley values using the above formula, we need to evaluate the model output for all possible coalitions of features, which is computationally infeasible for most machine learning models. The SHAP method provides an efficient way to estimate the Shapley values using a weighted average of the model outputs for a subset of coalitions. The subset of coalitions is selected based on a feature importance metric, such as the permutation importance or the gradient-based importance.

How to apply SHAP?

To apply the SHAP method, we need to first compute the Shapley values for each feature in the input space. This can be done using one of the many implementations available in popular machine learning libraries, such as scikit-learn, XGBoost, and TensorFlow. Once we have the Shapley values, we can visualize them using various techniques to gain insights into the model's decision-making process.

One popular technique to visualize the Shapley values is the Shapley value plot, which shows the contribution of each feature towards the model output for each individual data point. The plot consists of a horizontal axis representing the feature contribution, and a vertical axis representing the features. Each data point is represented by a vertical bar, where the length of the bar represents the magnitude of the Shapley value for the corresponding feature. The color of the bar represents the value of the feature, where red represents high feature values and blue represents low feature values. The plot helps in identifying the most important features for each data point and the direction of the relationship between the features and the output.

Another technique to visualize the Shapley values is the summary plot, which shows the average contribution of each feature across all data points. The plot consists of a horizontal axis representing the Shapley value and a vertical axis representing the features. Each feature is represented by a horizontal bar, where the length of the bar represents the magnitude of the average Shapley value. The color of the bar represents the direction of the relationship between the feature and the output, where red represents a positive relationship and blue represents a negative relationship.

In addition to visualizing the Shapley values, the SHAP method can also be used to identify instances where the model makes biased or unfair decisions. The method can be used to quantify the extent to which each feature contributes to the model's bias towards a certain group or class. This helps in identifying the root cause of the bias and taking corrective measures to ensure fairness and equity in the model's decisions.

Conclusion

The SHapley Additive exPlanations (SHAP) method provides a powerful framework for interpreting any machine learning model. The method is based on the concept of Shapley values, which provides a fair way to distribute the gain of a cooperative game among its players. The SHAP method provides an efficient way to compute the Shapley values for each feature in the input space, which gives us a measure of the contribution of each feature towards the model output. The method can be applied to visualize the Shapley values, identify the most important features, and quantify the model's bias towards certain groups or classes. By providing a unified framework for interpretability, the SHAP method helps in building trust in the model and making informed decisions.

Any comments or suggestions? Let me know.

KernelShap and TreeShap - Two Most Popular Variations of the SHAP Method

2023-04-14T00:00:00+02:00

TLDR

The original SHAP does not scale well with high dimensions data due to its exponential complexity associated with Shapley value calculations. KernelSHAP and TreeSHAP are two specific implementations of the SHAP method, developed to address the shortcomings of the original framework and to optimize it for different types of machine learning models, but they achieve this in different ways. KernelSHAP uses a model-agnostic method to interpret the impact of features in a model. This means it can provide explanations for any model but at the cost of computational efficiency, making it less suitable for complex, high-dimensional situations or when real-time explanations are needed. In contrast, TreeSHAP is designed specifically for tree-based models (like decision tree and random forests, or boosting machines). It is computationally efficient, exploits the tree structure for faster calculations, and thus, can handle more complex scenarios. Moreover, TreeSHAP guarantees consistency—a helpful property for feature attribution methods to ensure that if a model relies more on a feature, the attributed importance of that feature should not decrease. However, it can't be used for non-tree models.

Introduction

Responsible AI is an approach to artificial intelligence that ensures fairness, transparency, and accountability in the development, deployment, and management of AI systems. In the era of increasing reliance on AI-driven decision-making, understanding and explaining the predictions made by these models is essential. The interpretability of AI models helps build trust, enables better decision-making, and allows us to mitigate biases.

Two popular methods for explaining AI models are KernelShap and TreeShap. These techniques are part of the SHAP (SHapley Additive exPlanations) family, which is based on cooperative game theory. In this blog post, we will delve into the details of KernelShap and TreeShap, exploring their underlying principles, advantages, and use cases.

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions (see papers for details and citations). (from SHAP documentation)

KernelShap
- Steps
- KernelShap advantages and limitations
TreeShap
- Steps
- TreeShap advantages and limitations
Conclusion

KernelShap

KernelShap is a model-agnostic explanation method that provides interpretable explanations for any black-box model. It uses the concept of Shapley values from cooperative game theory to attribute feature importance to individual features in the context of a specific prediction.

The Shapley value for feature $i$ in a model $f$ can be calculated using the following formula:

$$ϕ_i(f) = \sum_{S ⊆ N \setminus {i}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S ∪ {i}) - f(S)]$$

Here, $S$ is a subset of features excluding $i$, and $N$ is the total number of features. The term $|S|!$ represents the factorial of the number of features in subset $S$, while $|N|-|S|-1!$ represents the factorial of the remaining features outside of the subset. The denominator $|N|!$ is the factorial of the total number of features.

Shapley values, in the context of AI, are used to distribute the contribution of each feature to the final prediction. It ensures that the contribution of each feature is fairly allocated in a way that is efficient, symmetric, and additive.

KernelShap approximates the Shapley values by solving a weighted linear regression problem. It samples instances from the feature space and estimates the Shapley values using the Lasso regression model. The Lasso model is a linear model with an L1 penalty term, which helps in feature selection and makes the explanation sparse.

Lasso (Least Absolute Shrinkage and Selection Operator) regression is a linear regression technique that includes an L1 penalty term to shrink the coefficients of less important features towards zero. This allows for both regularization and feature selection, resulting in a more interpretable and parsimonious model.

The equation for Lasso regression is given by:

$$L(\beta) = \sum_{i=1}^{n}(y_i - X_i\beta)^2 + \lambda\sum_{j=1}^{p}|\beta_j|$$

In this equation:

$L(\beta)$ represents the objective function to be minimized,
$y_i$ is the actual response (outcome) for the $i^{th}$ observation,
$X_i$ is the feature vector for the $i^{th}$ observation,
$\beta$ is the vector of coefficients to be estimated,
$n$ is the total number of observations,
$p$ is the total number of features,
$\lambda$ is a non-negative regularization parameter, and
$|\beta_j|$ is the absolute value of the $j^{th}$ coefficient.

The first term, $\sum_{i=1}{n}(y_i - X_i\beta)2$, is the sum of squared residuals, which represents the difference between the actual and predicted responses. Minimizing this term alone would result in an ordinary least squares regression.

The second term, $\lambda\sum_{j=1}^{p}|\beta_j|$, is the L1 penalty term that adds the absolute values of the coefficients multiplied by the regularization parameter $\lambda$. By increasing $\lambda$, the penalty term forces some coefficients to be exactly zero, effectively selecting a subset of features for the final model. The optimal value of $\lambda$ is usually determined through cross-validation.

Steps

The KernelShap algorithm involves the following steps:

Generate a dataset of binary-masked instances by randomly selecting feature combinations.
Compute the output of the black-box model for each instance.
Fit a weighted linear regression model on the generated dataset, where the weights are determined by the similarity between the instance and the instance of interest.
The coefficients of the linear regression model represent the approximate Shapley values.

KernelShap advantages and limitations

KernelShap has several advantages:

It can be applied to any black-box model, regardless of its architecture or training algorithm.
It provides a unified measure of feature importance that is consistent across different models.
It takes into account the interactions between features.

However, KernelShap also has some limitations:

It can be computationally expensive, especially for high-dimensional data or complex models.
It requires a large number of samples to provide accurate estimates of the Shapley values.

TreeShap

TreeShap is a model-specific explanation method designed for tree-based models, such as decision trees, random forests, and gradient boosting machines. Like KernelShap, it is based on Shapley values, but it exploits the structure of tree-based models to compute the values efficiently.

TreeShap computes the exact Shapley values for each feature by recursively traversing the decision tree, attributing contributions to each feature as it moves down the tree. It uses a dynamic programming approach to avoid redundant calculations and reduce the computational complexity.

Steps

The TreeShap algorithm involves the following steps:

Traverse the tree from the root to the leaf nodes, recording the decision path for the instance of interest.
Attribute contributions to each feature encountered along the path, taking into account the number of possible feature combinations and the probability of each combination.
Repeat the process for all trees in the ensemble, if applicable.
Average the contributions across all trees to obtain the final Shapley values.

TreeShap advantages and limitations

TreeShap has several advantages:

It computes the exact Shapley values without the need for sampling or approximations.
It is computationally efficient due to its dynamic programming approach.
It is specifically designed for tree-based models, which are widely used in practice.

However, TreeShap is limited to tree-based models and cannot be applied to other types of models, such as deep learning or support vector machines.

Conclusion

KernelShap and TreeShap are powerful methods for explaining AI models in the context of responsible AI. Both techniques leverage the concept of Shapley values to provide interpretable and fair attributions of feature importance. While KernelShap is a model-agnostic approach that can be applied to any black-box model, TreeShap is tailored for tree-based models and offers exact Shapley values with computational efficiency.

Understanding and implementing these methods is crucial for AI practitioners who aim to build transparent, accountable, and trustworthy AI systems. By providing insights into the inner workings of AI models, KernelShap and TreeShap enable developers to identify potential biases, improve the decision-making process, and ultimately foster trust in AI-driven technologies.

Any comments or suggestions? Let me know.

Edits:

2023-11-05: Added TLDR section, minor edits

LIME Tutorial

2023-04-14T00:00:00+02:00

In this tutorial, we'll be exploring how to use the LIME (Local Interpretable Model-Agnostic Explanations) library for explainable AI. We'll start by discussing what LIME is and why it's useful for explainable AI, and then we'll dive into the code.

What is LIME?

LIME is a library for explaining the predictions of machine learning models. It works by creating "local" surrogate models that approximate the behavior of the original model in the vicinity of a particular prediction. The idea behind LIME is that these surrogate models can be used to provide human-understandable explanations for how the original model arrived at its decision.

Why is LIME useful for explainable AI? There are a few reasons:

Transparency: LIME allows us to peek "under the hood" of a black box model and see how it's making its decisions.
Trust: By providing human-understandable explanations, LIME can increase our trust in the model's decisions.
Debugging: LIME can help us identify problems with our model by highlighting areas where the model is making incorrect or unexpected predictions.

Now that we understand why LIME is useful, let's dive into the code.

Selecting a Dataset

For this tutorial, we'll be using the classic "Iris" dataset, which is a popular dataset for classification tasks. The Iris dataset consists of 150 samples, each with four features (sepal length, sepal width, petal length, and petal width), and each sample belongs to one of three classes (setosa, versicolor, or virginica). The goal is to build a machine learning model that can predict the class of a new sample based on its features.

To start, we'll load the Iris dataset using scikit-learn:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Next, we'll split the dataset into training and testing sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We'll use the training set to train our machine learning model, and the testing set to evaluate its performance.

Training a Machine Learning Model

For this tutorial, we'll use a random forest classifier as our machine learning model. The random forest algorithm is an ensemble learning method that combines multiple decision trees to improve the accuracy and robustness of the predictions.

We'll start by importing the necessary libraries and creating the classifier:

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42)

We're using 100 decision trees in our random forest classifier, and setting the random state to 42 for reproducibility.

Next, we'll fit the classifier to the training data:

rfc.fit(X_train, y_train)

Finally, we'll evaluate the performance of the classifier on the testing data:

from sklearn.metrics import accuracy_score

y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

When we run this code, we should see an accuracy of around 0.97, which means our model is doing a pretty good job of predicting the class of new samples.

Explaining Model Predictions with LIME

Now that we have a trained machine learning model, we can start using LIME to explain its predictions.

First, we need to create an explainer object:

import lime
import lime.lime_tabular

explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train,
    feature_names=iris.feature_names,
    class_names=iris.target_names,
    discretize_continuous=True,
)

Here, we're creating a LimeTabularExplainer object and passing in the training data, feature names, class names, and setting discretize_continuous to True to discretize any continuous features.

Next, we'll pick a sample from the testing data that we want to explain:

idx = 0  # index of the sample we want to explain
exp = explainer.explain_instance(X_test[idx], rfc.predict_proba)

Here, we're using the explain_instance method to generate an explanation for the sample at index idx. We're passing in the sample data and the predict_proba method of the random forest classifier, which is used to predict the probabilities of each class for the given sample.

Now, we can print out the top three features that are contributing to the prediction:

for i in range(3):
    print(f"{exp.as_list()[i][0]}: {exp.as_list()[i][1]:.2f}")

This will give us something like:

4.25 < petal length (cm) <= 5.10: 0.21
0.30 < petal width (cm) <= 1.30: 0.16
sepal width (cm) <= 2.80: -0.03

This tells us that the most important feature for this prediction is petal width (cm), and that a value of 0.80 or less is strongly associated with the "setosa" class.

We can also visualize the explanation using a bar chart:

from lime import lime_tabular

fig = exp.as_pyplot_figure()

This will create a bar chart that shows the contribution of each feature to the prediction, with the most important features at the top:

Visualizing Model Decisions

In addition to explaining individual predictions, LIME can also be used to visualize how the model is making decisions more generally. We can do this by generating multiple explanations for different samples and visualizing the patterns that emerge.

To start, we'll generate a of explanation for the testing data point, next, we'll use these explanations to generate a decision plot:

exp = explainer.explain_instance(
    data_row=X_test[1], 
    predict_fn=rfc.predict_proba
)

exp.show_in_notebook(show_table=True)

Conclusion

In this tutorial, we learned how to use the LIME library for explainable AI. We started by importing the necessary libraries and loading the Iris dataset. Then, we trained a random forest classifier on the dataset and used LIME to explain individual predictions and visualize model decisions.

We saw how LIME can be used to identify the most important features for a prediction, and how these features can be visualized using a bar chart. We also saw how LIME can be used to visualize how the model is making decisions more generally, using a decision plot.

Any comments or suggestions? Let me know.

How to Deploy FreshRSS in the Cloud for Free on Azure?

2023-04-11T00:00:00+02:00

FreshRSS is a free and open-source RSS feed aggregator that allows you to easily follow your favorite websites and blogs in one place. By deploying FreshRSS in the cloud, you can access your feeds from anywhere and enjoy the benefits of cloud computing, such as scalability, reliability, and cost-effectiveness. Microsoft Azure is a popular cloud platform that offers a wide range of services for building, deploying, and managing applications in the cloud. In this tutorial, we'll show you how to deploy FreshRSS in the cloud for free on Azure.

Prerequisites:
Step-by-step guide
Step 1: Create a new Azure web app
Step 2: Configure your web app
Step 3: Deploy FreshRSS
Step 4: Configure SSL (optional)

Prerequisites

A Microsoft Azure account
A basic understanding of Azure services and concepts
A web browser
An SSH client (optional)

Step-by-step guide

Step 1: Create a new Azure web app

The first step is to create a new Azure web app, which will host your FreshRSS installation. Follow these steps to create a new web app:

Log in to the Azure portal (https://portal.azure.com) using your Azure account credentials.
Click on the "Create a resource" button in the left-hand menu and search for "Web App".
Select the "Web App" service and click on the "Create" button.
Fill in the required details for your web app, such as the name, subscription, resource group, and runtime stack.
Choose the "Free" pricing tier, which provides up to 10 web, mobile, or API apps with shared compute resources and 1 GB storage per app.
Click on the "Review + create" button to review your settings and then click on the "Create" button to create your web app.
Wait for the deployment to complete, which may take a few minutes.

Step 2: Configure your web app

The next step is to configure your web app with the necessary settings and dependencies. Follow these steps to configure your web app:

Navigate to your web app in the Azure portal and click on the "Configuration" tab.
Under the "General settings" section, set the "Linux container" option to "On".
Under the "Stack settings" section, set the "Runtime stack" option to "PHP 7.3".
Under the "Application settings" section, add the following key-value pairs:
- Key: WEBSITE_TIME_ZONE, Value: Your timezone (e.g., "America/Los_Angeles")
- Key: WEBSITE_AUTH_ENABLED, Value: False
- Key: WEBSITE_NODE_DEFAULT_VERSION, Value: 10.14.2
- Click on the "Save" button to save your changes.

Step 3: Deploy FreshRSS

The next step is to deploy FreshRSS to your web app. Follow these steps to deploy FreshRSS:

Download the latest version of FreshRSS from the official website (https://freshrss.org) and extract the files to a local directory.
Open a command prompt or terminal window and navigate to the local directory where you extracted the FreshRSS files.
Use the following command to create a ZIP archive of the FreshRSS files:

zip -r freshrss.zip .
Return to the Azure portal and navigate to your web app.
Click on the "Deployment Center" tab and select the "Local Git" option.
Follow the on-screen instructions to create a new deployment user and download the deployment credentials.
Use the following commands to add the Azure Git remote to your local Git repository and push your changes to the Azure web app:

csharpCopy code

git remote add azure <deployment-endpoint> git push azure master
When prompted, enter the deployment username and password that you created earlier.
Wait for the deployment to complete, which may take a few minutes.
Once the deployment is complete, open a web browser and navigate to your web app's URL to access FreshRSS. You should see the FreshRSS installation page.
Follow the on-screen instructions to complete the FreshRSS installation. Make sure to set the database type to "SQLite" and the database path to "/home/site/wwwroot/data/freshrss.db".
Once the installation is complete, you should be able to access FreshRSS and start adding your favorite feeds.

Step 4: Configure SSL (optional)

If you want to secure your FreshRSS installation with SSL, you can do so by configuring a custom domain and adding an SSL certificate. Follow these steps to configure SSL:

Purchase a custom domain from a domain registrar, such as GoDaddy or Namecheap.
Navigate to your web app in the Azure portal and click on the "Custom domains" tab.
Add your custom domain and follow the on-screen instructions to configure DNS settings.
Once your domain is configured, navigate to the "SSL certificates" tab and click on the "Create App Service Managed Certificate" button.
Follow the on-screen instructions to create a new SSL certificate for your custom domain.
Once the certificate is created, navigate back to the "Custom domains" tab and click on the "Add binding" button.
Select your custom domain and the newly created SSL certificate and click on the "Add binding" button.
Wait for the SSL certificate to be provisioned, which may take a few minutes.
Once the SSL certificate is provisioned, you should be able to access FreshRSS securely using your custom domain.

X::How to Deploy FreshRSS in the Cloud for Free on GCP?

How to Deploy FreshRSS in the Cloud for Free on GCP?

2023-04-11T00:00:00+02:00

To deploy FreshRSS in the cloud for free on Google Cloud Platform (GCP), you can follow these steps:

Create a new project on GCP and enable billing. FreshRSS requires a web server and a database, and GCP provides free usage limits for these services for a limited time. You will need to provide billing information to verify your account and enable these services.
Launch a Compute Engine instance. FreshRSS can run on any Linux-based server, so you can choose an instance that meets your needs. For this example, we'll use a micro instance with Debian 10.
Connect to the instance using SSH. You can use the SSH button in the GCP Console or connect from your terminal using the external IP address.
Install the necessary packages. Run the following command to update the package index and install the required packages:

sudo apt update
sudo apt install apache2 mariadb-server php7.3 php7.3-mysql php7.3-curl php7.3-xml

Configure the database. Follow these steps to create a new database and user for FreshRSS:

sudo mysql -u root

CREATE DATABASE freshrss;
GRANT ALL PRIVILEGES ON freshrss.* TO 'freshrssuser'@'localhost' IDENTIFIED BY 'password';
FLUSH PRIVILEGES;
EXIT;

Download and install FreshRSS. Run the following commands to download the latest version of FreshRSS and extract it to the web root:

cd /var/www/html
sudo wget https://github.com/FreshRSS/FreshRSS/archive/master.tar.gz
sudo tar -xzf master.tar.gz --strip-components=1
sudo chown -R www-data:www-data .

Configure the web server. Edit the default Apache configuration file to enable URL rewriting:

sudo nano /etc/apache2/sites-enabled/000-default.conf

Add the following lines inside the `<VirtualHost>` block:

<Directory /var/www/html>
    AllowOverride All
</Directory>

Restart the web server. Run the following command to apply the changes:

sudo systemctl restart apache2

Complete the FreshRSS setup. Open your web browser and navigate to the external IP address of your instance. Follow the on-screen instructions to configure FreshRSS.

Congratulations, you have successfully deployed FreshRSS in the cloud for free on GCP!

X::How to Deploy FreshRSS in the Cloud for Free on Azure?

Zero-Knowledge Explained Like to 5 Years Old

2023-04-06T00:00:00+02:00

Zero-knowledge proofs (ZKPs) are a key technology that underpins the security and privacy of many modern cryptocurrencies. In essence, ZKPs allow parties to prove that they know a piece of information, without revealing that information itself. But what does that mean, exactly? In this blog post, we'll explain ZKPs in a way that even a 5-year-old can understand.

Helper example

Let's start with a basic example. Imagine you have a secret toy that you don't want anyone else to know about. Your friend wants to prove to you that they know what the toy is, without actually telling you what it is. How can they do that?

One way to do it is to play a guessing game. Your friend can ask you a series of questions about the toy, such as "Is it blue?" or "Does it have wheels?" Based on your answers, your friend can narrow down the possibilities until they have a pretty good idea of what the toy is. This is a bit like a multiple-choice test: by eliminating the wrong answers, you can eventually arrive at the right one.

But what if your friend wants to prove that they know the toy, without giving you any clues about what it is? That's where zero-knowledge proofs come in.

Imagine your friend has a magic wand that can tell them whether a particular guess is right or wrong, without actually revealing what the correct answer is. So they can make a guess, wave the wand, and get a "yes" or "no" answer. If the answer is "no", they can make another guess and try again. If the answer is "yes", they've proven that they know the toy, without actually revealing what it is.

This is a bit like playing "20 questions", but with a magical yes-or-no answer that doesn't give away any information. Your friend doesn't need to ask you any questions about the toy, they just need to make a series of guesses and use the magic wand to check if they're right or wrong. And because the wand doesn't reveal anything about the toy itself, you still don't know what it is.

Zero-knowledge in Cryptocurrency

Now, let's apply this idea to cryptocurrency. In a blockchain system like Bitcoin, transactions are recorded on a public ledger that anyone can see. But the ledger doesn't reveal who the parties involved in the transaction are. Instead, it uses cryptographic techniques to obscure their identities.

For example, imagine you want to send some Bitcoin to a friend. You create a transaction that says "send X amount of Bitcoin to this address". But instead of using your real name and address, you use a pseudonymous address that's associated with your public key.

The public key is a string of characters that's generated using a complex mathematical algorithm. It's unique to you, and it's used to encrypt and decrypt messages that are sent to and from your address. But it doesn't reveal your actual identity.

So when you send the Bitcoin, the transaction is broadcast to the network and added to the blockchain. But nobody knows who the parties involved are, because they're identified only by their public keys.

This is where zero-knowledge proofs come in. Imagine you want to prove to someone that you own a particular address, without revealing what that address is. You could use a zero-knowledge proof to demonstrate that you know the private key associated with that address, without actually showing the key itself.

Zero-knowledge proof (ZKP) The proof works by using a mathematical algorithm that allows you to generate a random "challenge" that's based on your private key. You then provide a response to the challenge that demonstrates that you know the private key, without revealing what it is.

This is a bit like the guessing game we talked about earlier. The challenge is like a question that's designed to test whether you know the private key, and the response is like an answer that proves that you do, without revealing what the key is. This allows you to prove ownership of the address, without revealing any sensitive information.

This is important for privacy and security in cryptocurrency, because it means that you can prove ownership of an address without revealing your identity or any other sensitive information. It also makes it much harder for hackers or other bad actors to steal your cryptocurrency, because they would need to know your private key in order to access your funds.

So there you have it, zero-knowledge proofs explained like you're 5 years old! They're a clever way of proving that you know something, without actually revealing what it is. And in the world of cryptocurrency, they're a key technology that helps to ensure the security and privacy of your transactions.

ZKP Origin Zero-knowledge proofs were first introduced by researchers Shafi Goldwasser, Silvio Micali, and Charles Rackoff in 1985. Their groundbreaking paper, "The Knowledge Complexity of Interactive Proof-Systems," laid the foundation for zero-knowledge proof systems. Silvio Micali, won the Turing Award for his works on cryptography and inventing Zero Knowledge (ZK) Proofs

Python - Named Tuples or Dictionaries to Store Structured Data?

2023-04-03T00:00:00+02:00

Let's assume that in python, we have long list of pairs to store. In this note we will discuss what are the pros and cons of using named tuple vs. dict to store single pair?

Both named tuples and dictionaries are useful data structures for storing key-value pairs in Python, but they have different pros and cons depending on the situation.

Named tuples

Here are some pros and cons of using named tuples:

Pros

Named tuples are immutable, so they are safer to use in multithreaded environments where multiple threads might try to modify the same data at the same time.
Named tuples can be more memory-efficient than dictionaries, especially if you have a large number of instances with the same fields.
Named tuples are more readable than dictionaries when you have a fixed set of fields and you want to give them meaningful names.

Cons

Named tuples are less flexible than dictionaries because you can't add or remove fields once they are defined.
Named tuples can be less convenient to use than dictionaries if you need to access fields by key rather than by attribute name.

Dictionaries

Here are some pros and cons of using dictionaries:

Pros

Dictionaries are more flexible than named tuples because you can add or remove fields at any time.
Dictionaries are more convenient to use than named tuples if you need to access fields by key rather than by attribute name.

Cons

Dictionaries are mutable, so you need to be careful when using them in multithreaded environments.
Dictionaries can be less memory-efficient than named tuples, especially if you have a large number of instances with the same fields.
Dictionaries are less readable than named tuples when you have a fixed set of fields and you want to give them meaningful names.

Conclusion

if you have a fixed set of fields with meaningful names, and you don't need to add or remove fields at runtime, a named tuple is a good choice. If you need more flexibility or you need to access fields by key rather than by attribute name, a dictionary is a better choice.

Python - How to Make Type Hint for the Tuple With Undetermined Number of Strings?

2023-04-03T00:00:00+02:00

To make a type hint for a tuple with an undetermined number of strings in Python, you can use the Tuple and Union types from the typing module. Here's an example:

from typing import Tuple, Union

def process_strings(strings: Tuple[Union[str, ...], ...]) -> str:
    return ", ".join(strings)

strings1 = ("hello", "world")
strings2 = ("foo", "bar", "baz")

print(process_strings(strings1))  # Output: "hello, world"
print(process_strings(strings2))  # Output: "foo, bar, baz"

In the type hint above, Tuple is used to indicate that we are working with a tuple, and Union is used to indicate that each element of the tuple can be either a str or an ellipsis (...), which represents an infinite number of str elements. The second ... indicates that the tuple can contain an undetermined number of elements.

X::How to Use Elypsis in Type Hints to Indicate Arbitrary Number of Elements

How to Use Elypsis in Type Hints to Indicate Arbitrary Number of Elements

2023-04-03T00:00:00+02:00

In type hints, ... (ellipsis) is used to indicate that a function parameter or return value can have an arbitrary number of arguments or elements.

For example, if you have a function that takes an arbitrary number of integers as arguments, you can use ... in the function signature to indicate that:

def foo(*args: int) -> List[int]:
    return [x * 2 for x in args]

Here, *args is used to indicate that the function can take any number of arguments, and the int type hint indicates that each argument must be an integer. The return type is a list of integers.

Similarly, you can use ... in a type hint for a tuple to indicate that the tuple can have an arbitrary number of elements of a given type. For example:

from typing import Tuple

def bar(t: Tuple[str, ...]) -> str:
    return " ".join(t)

t1 = ("hello", "world")
t2 = ("foo", "bar", "baz")

print(bar(t1))  # Output: "hello world"
print(bar(t2))  # Output: "foo bar baz"

Here, Tuple[str, ...] is used to indicate that t is a tuple of strings, and the ... indicates that the tuple can have an arbitrary number of elements.

X::Python - How to Make Type Hint for the Tuple With Undetermined Number of Strings?

X::Use Python TypedDict to Type Hint Dictionaries

Git - Annotated vs. Lightweight Tags

2023-03-31T00:00:00+02:00

In Git, tags are used to mark specific points in the history of a repository. They serve as a reference point for developers to easily identify and navigate to important milestones, such as releases or significant commits. There are two types of tags in Git: annotated tags and lightweight tags.

Annotated tags

Annotated tags are more informative than lightweight tags. When creating an annotated tag, Git stores a full object in the repository that contains the tagger name, email, and date, a tagging message, and a SHA-1 checksum of the commit being tagged. Annotated tags are essentially Git objects that are separate from the commit objects they reference, whereas lightweight tags are simply pointers to specific commits.

The additional information stored in an annotated tag makes them useful for documenting significant events in the project's history. The tagging message can provide context about why the tag was created and what it represents. Additionally, annotated tags can be signed and verified to ensure their authenticity. Signed tags provide assurance that the tag was created by an authorized person and that the commit being tagged has not been tampered with.

Example:

git tag -a v1.2 -m "my version 1.4"

Lightweight tags

Lightweight tags, on the other hand, are simply references to specific commits. They do not store any additional information beyond the tag name and the commit ID. Lightweight tags are useful for marking temporary or internal points in the repository history, such as to label a specific commit for testing or debugging purposes. Lightweight tags are created with the git tag command without the -a or -m options.

Example:

git tag v1.2

When to use Annotated and when Lightweight tags?

So when should you use annotated tags versus lightweight tags? Annotated tags are ideal for marking significant events in the project's history, such as releases, milestones, or important changes. They are also useful for documenting the context and reasoning behind a particular tag. Lightweight tags, on the other hand, are useful for temporary or internal purposes, such as marking specific commits for debugging or testing purposes.

In general, it is a good practice to use annotated tags for any official releases or milestones, as they provide a clear and detailed record of the project's progress. Lightweight tags can be used for more informal purposes, such as to mark experimental or intermediate points in the project's history.

Contextual Understanding in Automated Speech-to-Text Transcription - Machine Learning Techniques and Challenges

2023-03-30T00:00:00+02:00

Automated speech-to-text transcription has come a long way in recent years, with advances in artificial intelligence and natural language processing enabling machines to transcribe human speech with increasing accuracy. However, there are still several challenges that remain unsolved, and which continue to limit the capabilities of automated speech recognition technology. In this blog post, we will explore some of the biggest unsolved problems in automated speech-to-text transcription.

Challenges
Contextual understanding
- Importance
- Approaches
Machine Learning techniques for Contextual understanding
- Disambiguation
- hybrid approaches
Summary

Challenges

1. Accurate transcription of spontaneous speech

One of the biggest challenges in automated speech-to-text transcription is accurately transcribing spontaneous speech. Spontaneous speech is characterized by its lack of structure and tendency to contain many disfluencies, such as repetitions, false starts, and filled pauses. This type of speech is particularly challenging for machines to transcribe accurately, as it can be difficult to distinguish between disfluencies and actual words. This can lead to errors in the transcribed text, which can be frustrating for users and limit the usefulness of the technology.

2. Handling multiple speakers

Another major challenge in automated speech-to-text transcription is handling multiple speakers. When there are multiple speakers involved, it can be difficult for machines to distinguish between them and accurately attribute the words to the correct speaker. This can lead to confusion and errors in the transcribed text, which can be particularly problematic in applications where it is important to know who said what. There has been some progress in this area, with some automated transcription services now able to recognize multiple speakers, but there is still room for improvement.

3. Handling accents and dialects

Accents and dialects can also pose a significant challenge for automated speech-to-text transcription. Different accents and dialects can vary greatly in terms of pronunciation, intonation, and grammar, which can make it difficult for machines to accurately transcribe speech from speakers with different accents or dialects. This is particularly problematic in applications where it is important to accurately capture the nuances of the speaker's speech, such as in legal or medical settings.

4. Contextual understanding

Another major challenge in automated speech-to-text transcription is contextual understanding. Machines are able to transcribe speech accurately based on the words that are spoken, but they may not always be able to understand the context in which those words are being used. For example, machines may struggle to accurately transcribe a sentence that contains homophones, such as "I saw the bear" versus "I saw the bare". In order to accurately transcribe speech, machines need to be able to understand the context in which the words are being used.

5. Real-time transcription

Real-time transcription is another major challenge for automated speech-to-text transcription. Real-time transcription involves transcribing speech as it is being spoken, rather than after the fact. This can be particularly challenging, as machines need to be able to transcribe speech quickly and accurately, without the benefit of being able to go back and review what was said. Real-time transcription is becoming increasingly important in a number of applications, such as live captioning of video content, but there is still room for improvement in this area.

6. Data privacy and security

Finally, data privacy and security is a major concern in automated speech-to-text transcription. In order to transcribe speech accurately, machines need to be trained on large amounts of data, which may contain sensitive information. This raises concerns about how that data is collected, stored, and used, and whether appropriate safeguards are in place to protect user privacy. As the use of automated speech-to-text transcription continues to grow, it will be important to ensure that user data is handled in a responsible and secure manner.

Contextual understanding

Contextual understanding is one of the biggest challenges facing automated speech-to-text transcription. Machines are able to transcribe speech accurately based on the words that are spoken, but they may not always be able to understand the context in which those words are being used. In order to accurately transcribe speech, machines need to be able to understand the context in which the words are being used, including the speaker's tone, mood, and intent.

Importance

Contextual understanding is important for a number of reasons. First, it can help to reduce errors in automated speech-to-text transcription. When machines are able to understand the context in which words are being used, they are less likely to make mistakes or misinterpret the speaker's meaning. This can improve the accuracy of the transcribed text and make it more useful for a variety of applications.

Second, contextual understanding can help to improve the quality of the transcribed text. When machines are able to understand the context in which words are being used, they can more accurately transcribe the speaker's tone and mood. This can be particularly important in applications such as customer service or support, where it is important to accurately capture the speaker's emotions in order to provide an appropriate response.

Finally, contextual understanding can help to improve the overall user experience. When machines are able to accurately transcribe speech and understand the context in which words are being used, users are more likely to have a positive experience with the technology. This can help to increase adoption and usage of automated speech-to-text transcription in a variety of applications.

Approaches

There are several approaches that can be used to improve contextual understanding in automated speech-to-text transcription. One approach is to use machine learning algorithms to analyze the context in which words are being used. Machine learning algorithms can be trained on large datasets of speech and text data to learn how to identify patterns in the way that words are used in different contexts. This can help machines to more accurately transcribe speech and understand the context in which words are being used.

Another approach is to incorporate additional information into the transcription process. For example, machines can be programmed to recognize certain words or phrases that are commonly used in specific contexts, such as in a medical setting or in a legal deposition. This can help to improve the accuracy of the transcribed text and ensure that the context in which words are being used is correctly understood.

Contextual understanding is an important area of research in automated speech-to-text transcription, and there is still much work to be done in this area. As the technology continues to evolve and improve, it is likely that machines will become increasingly capable of understanding the context in which words are being used. This will help to improve the accuracy and quality of the transcribed text, and make automated speech-to-text transcription a more valuable tool for a variety of applications.

However, there are also important ethical considerations when it comes to contextual understanding. Machines that can accurately understand the context in which words are being used may also be able to infer personal information about the speaker, such as their emotions, intent, or political beliefs. This raises important questions about data privacy and security, and highlights the need for responsible use and handling of user data in automated speech-to-text transcription. As the technology continues to evolve, it will be important to ensure that user data is protected and used in a responsible and ethical manner.

In addition to machine learning and incorporating additional information, another approach to improving contextual understanding in automated speech-to-text transcription is to incorporate other types of data into the transcription process. For example, machines can be programmed to recognize the speaker's accent or dialect, which can provide important contextual information about the way that words are being used.

Similarly, machines can be programmed to recognize the speaker's gender, age, or other demographic characteristics. This can provide important contextual information about the way that words are being used, and can help machines to more accurately transcribe speech and understand the context in which words are being used.

There are also challenges associated with contextual understanding in automated speech-to-text transcription. For example, there is often a significant amount of variation in the way that words are used in different contexts, which can make it difficult for machines to accurately transcribe speech and understand the context in which words are being used. Additionally, there may be cultural or regional differences in the way that words are used, which can further complicate the transcription process.

Another challenge is that context can be dynamic and change rapidly over the course of a conversation. Machines need to be able to adapt to changes in context in real time in order to accurately transcribe speech and understand the context in which words are being used.

Machine Learning techniques for Contextual understanding

Machine learning techniques are commonly used to improve contextual understanding in automated speech-to-text transcription. In this post, we will discuss some of the key machine learning techniques used for this purpose.

Disambiguation

One of the most widely used machine learning techniques for contextual understanding is natural language processing (NLP). NLP is a subfield of machine learning that focuses on analyzing and understanding human language. NLP algorithms are trained on large datasets of text data and are used to analyze the context in which words are being used in speech.

One of the key challenges in NLP is disambiguation, or the process of determining the correct meaning of a word based on its context. For example, the word "bank" can refer to a financial institution or the side of a river. To accurately transcribe speech, machines need to be able to accurately disambiguate words based on their context.

part-of-speech (POS) tagging

One technique for disambiguation is part-of-speech (POS) tagging. POS tagging involves analyzing each word in a sentence and assigning it a part-of-speech tag, such as noun, verb, adjective, or adverb. By analyzing the parts of speech used in a sentence, machines can gain a better understanding of the context in which words are being used.

named entity recognition (NER)

Another NLP technique used for contextual understanding is named entity recognition (NER). NER involves identifying and classifying named entities in text data, such as people, organizations, and locations. By identifying named entities in speech, machines can gain a better understanding of the context in which words are being used.

sentiment analysis

Another machine learning technique used for contextual understanding is sentiment analysis. Sentiment analysis involves analyzing the emotional tone of a piece of text data, such as whether it is positive, negative, or neutral. By analyzing the sentiment of speech, machines can gain a better understanding of the speaker's emotions and intent.

Deep learning

Deep learning is another machine learning technique that is commonly used for contextual understanding. Deep learning algorithms are designed to learn complex patterns in data, and are often used for tasks such as speech recognition and image recognition.

recurrent neural network (RNN)

One common type of deep learning algorithm used for contextual understanding is the recurrent neural network (RNN). RNNs are designed to analyze sequences of data, such as sentences or audio clips. By analyzing the sequence of words or sounds in speech, RNNs can gain a better understanding of the context in which words are being used.

convolutional neural network (CNN)

Another type of deep learning algorithm used for contextual understanding is the convolutional neural network (CNN). CNNs are often used for image recognition tasks, but can also be used for speech recognition. By analyzing the frequency and amplitude of sound waves in speech, CNNs can gain a better understanding of the context in which words are being used.

hybrid approaches

In addition to these machine learning techniques, there are also hybrid approaches that combine multiple techniques to improve contextual understanding. For example, some systems use a combination of NLP techniques and deep learning algorithms to transcribe speech with high accuracy and understanding of the context in which words are being used.

Summary

Machine learning techniques are critical for improving contextual understanding in automated speech-to-text transcription. NLP techniques, such as POS tagging and NER, can help machines to better understand the context in which words are being used. Deep learning algorithms, such as RNNs and CNNs, can help machines to learn complex patterns in speech and improve accuracy. As the technology continues to evolve, it is likely that new machine learning techniques will be developed to further improve contextual understanding and accuracy in automated speech-to-text transcription.

How to Prepare Python Project to Pass It Over to Another Developer

2023-03-30T00:00:00+02:00

Preparing a Python project to be passed on to another developer requires attention to detail and documentation to ensure that the new developer can understand the project and make modifications with ease. Here is a long and detailed guide on how to prepare a Python project to be handed over to another developer:

Organize your files Make sure that your project files are organized logically and in a way that is easy to navigate. Create folders for each major section of your project, such as source code, data, and documentation.

Use a version control system Use a version control system such as Git to keep track of changes to your code. This will make it easier for the new developer to understand the history of the project and any changes that have been made.

Document your code Document your code using comments and docstrings. Comments should explain the purpose of each section of code, while docstrings should explain the purpose of each function and class. This will make it easier for the new developer to understand how your code works.

Write a README file Create a README file that explains the purpose of your project, how to run it, and any dependencies that are required. This should also include instructions on how to set up a development environment and how to run tests.

Include requirements files Include a requirements.txt file that lists all of the dependencies required to run your project. This will make it easier for the new developer to set up a development environment.

Add instructions on how to recreate a virtual environment Use a virtual environment to create an isolated environment for your project. This will ensure that the new developer has the same environment as you did when you developed the project.

Use consistent coding style Use a consistent coding style throughout your project to make it easier to read and understand. Use a tool such as PEP 8 to check for compliance with the Python coding style guidelines.

Include test cases Include test cases that cover all major functionality of your project. This will make it easier for the new developer to ensure that any modifications they make do not break existing functionality.

Include a license Include a license file that specifies the terms under which your project can be used, modified, and distributed. This will protect your project and ensure that the new developer understands the legal implications of using and modifying your code.

Provide ongoing support Provide ongoing support to the new developer as they take over the project. This may involve answering questions, providing documentation, or even offering training.

Preparing a Python project to be passed on to another developer requires attention to detail and documentation. By following these guidelines, you can ensure that the new developer can understand your project and make modifications with ease.

NOTE: There is interesting write-up, proposing approach that will result in projects always ready to handover to another persons: Always be quitting - Julio Merino (jmmv.dev)

DCA Investing Strategy Variants

2023-03-26T00:00:00+01:00

Investing can be a daunting task, especially for those new to the game. The world of finance is full of complicated terminology and sophisticated techniques, making it difficult for the average person to know where to start. Fortunately, there are a few simple strategies that can help novice investors get started with building their portfolio. One of the most popular and effective strategies is the Dollar-Cost Averaging (DCA) method.

Dollar-Cost Averaging (DCA) is a strategy where an investor purchases a fixed amount of a particular asset, such as stocks or bonds, at regular intervals over a period of time, regardless of the price of the asset. The idea behind DCA is to reduce the impact of market fluctuations by buying more shares when prices are low and fewer shares when prices are high. This can help investors avoid the temptation to buy a large amount of an asset all at once, only to see the price drop shortly after.

There are several variants of the DCA strategy that investors can use to tailor their investment approach to their individual needs and preferences. Here are some of the most common variants of the DCA strategy:

Traditional DCA

The traditional DCA strategy involves investing a fixed amount of money at regular intervals, such as monthly or quarterly, into the same asset or fund. This is the simplest and most common form of DCA, as it involves consistent, automatic investments over a long period of time.

Value Averaging

Value averaging is a more dynamic form of DCA that involves adjusting the amount invested based on the performance of the asset. With value averaging, the investor sets a target value for the investment and adjusts the amount invested each period to maintain the target value. For example, if the value of the investment increases, the investor will invest less money in the next period, whereas if the value decreases, the investor will invest more money to bring the value back up to the target level.

Constant Proportion Portfolio Insurance (CPPI)

CPPI is a more complex form of DCA that involves setting a floor value for the investment and adjusting the allocation of the portfolio between a risk-free asset and a risky asset to maintain the floor value. The risk-free asset acts as a cushion to prevent the portfolio from falling below the floor value, while the risky asset provides potential upside. This strategy can be particularly useful for investors who want to limit their downside risk while still having exposure to the potential upside of the market.

Asset Allocation DCA

Asset allocation DCA involves investing a fixed amount of money at regular intervals into multiple assets or funds, rather than just one. This approach can help investors diversify their portfolio and reduce the risk of having all their eggs in one basket. The investor sets a target allocation for each asset class, and the DCA strategy is used to maintain the target allocation over time.

Reverse DCA

Reverse DCA is a strategy that involves selling a fixed amount of an asset at regular intervals, rather than buying it. This strategy is often used by retirees or investors who want to draw down their portfolio gradually over time. Reverse DCA can help investors avoid selling all their assets at once and potentially locking in losses during a market downturn.

Step-Up DCA

With step-up DCA, the investor starts with a small investment and gradually increases the amount invested over time. This strategy can be particularly useful for investors who are just starting out and want to ease into investing, or for investors who want to build up their investments gradually.

Seasonal DCA

Seasonal DCA involves investing in an asset only during a specific season or time of year. For example, an investor might choose to invest in a particular stock or fund only during the summer months when the company typically experiences higher sales or during a specific quarter when the company releases its earnings report.

Dynamic DCA

Dynamic DCA is a strategy that adjusts the investment amount based on market conditions or other factors, rather than investing a fixed amount at regular intervals. For example, an investor might increase their investment during a market dip or decrease their investment during a market rally.

Fixed Period DCA

Fixed period DCA involves investing a fixed amount of money over a set period of time, rather than indefinitely. For example, an investor might choose to invest $1,000 per month for a year, after which they reassess their investment strategy.

Dividend Reinvestment DCA

With dividend reinvestment DCA, investors use the dividends earned from an investment to purchase additional shares of the same asset or fund. This strategy can help investors increase their investment over time without having to contribute additional funds from their own pockets.

Fund Transfer DCA

Fund transfer DCA involves transferring a fixed amount of money from one asset or fund to another at regular intervals, rather than investing a fixed amount into a single asset. This strategy can be useful for investors who want to diversify their portfolio across multiple assets.

Progressive DCA

Progressive DCA involves gradually increasing the investment amount over time, typically by a fixed percentage or dollar amount. For example, an investor might start with a $100 investment and increase it by $10 each month.

Threshold DCA

Threshold DCA involves investing a fixed amount only when the price of an asset falls below a certain threshold. This strategy can be useful for investors who want to take advantage of buying opportunities during market dips.

Momentum DCA

Momentum DCA involves investing in assets that have been performing well over a recent period of time. For example, an investor might choose to invest in stocks that have been experiencing a positive trend in their price or earnings.

Tax-Loss Harvesting DCA

With tax-loss harvesting DCA, investors sell losing assets to realize a tax deduction, and use the proceeds to invest in new assets. This strategy can help investors offset capital gains and reduce their tax liability.

Conclusion

The Dollar-Cost Averaging strategy can be a powerful tool for investors who want to build a diversified portfolio over time. While the traditional DCA approach is the simplest and most common form of the strategy, there are several variants that investors can use to tailor the approach to their individual needs and preferences. By understanding the different variants of the DCA strategy, investors can choose the one that best suits their investment goals and risk tolerance.

Punctuation Restoration

2023-03-15T00:00:00+01:00

Punctuation restoration using machine learning (ML) is a process of predicting the appropriate punctuation marks in a text that is missing or poorly punctuated. This technique has become increasingly popular in recent years due to the growing volume of unstructured text data available in digital form, such as social media posts, online articles, and chat logs.

Punctuation plays a crucial role in the comprehension of text. Proper punctuation helps to convey the meaning, tone, and structure of a sentence. However, punctuation can be subjective and inconsistent, and the lack of punctuation can lead to ambiguity and misinterpretation. Therefore, restoring punctuation in a text is an essential task that can improve the readability and accuracy of the text.

Punctuation restoration using ML involves the use of algorithms and statistical models to predict the correct punctuation marks in a given text. The process typically involves three main steps: data preparation, feature extraction, and model training.

Punctuation restoration steps

data preparation

In the data preparation step, the text data is collected and preprocessed. This may involve removing unnecessary characters, such as HTML tags, and converting the text to a standard format. The text data is then segmented into individual sentences or phrases.

In the feature extraction step, the text data is converted into a set of numerical features that can be used by the ML model. Common features used in punctuation restoration include word frequency, part-of-speech (POS) tags, and context information. These features are extracted using NLP techniques such as tokenization, stemming, and syntactic parsing.

model training

In the model training step, the ML model is trained using a labeled dataset of punctuated text. The model learns to predict the appropriate punctuation marks based on the extracted features. Various ML algorithms can be used for this task, including decision trees, random forests, and deep neural networks.

punctuation restoration

Once the model is trained, it can be used to restore punctuation in new text data. The input text is segmented into sentences or phrases, and the extracted features are fed into the model to predict the appropriate punctuation marks. The output text is then post-processed to ensure that the punctuation marks are correctly placed.

Challenges

There are several challenges associated with punctuation restoration using ML. One of the main challenges is dealing with the subjective nature of punctuation. Punctuation rules can vary depending on the context and language, making it difficult to develop a universal model. Another challenge is dealing with the noise and errors in the text data, which can affect the accuracy of the model.

Despite these challenges, punctuation restoration using ML has shown promising results in various applications. For example, it can be used to improve the accuracy of speech recognition systems, enhance the readability of machine-generated text, and improve the quality of automatic translations.

References

punctuation - GitHub Topic
deepsegment - A sentence segmenter that actually works!
fastPunct - Punctuation restoration and spell correction experiments.
deepcorrect - Text and Punctuation correction with Deep Learning
X-Punctuator - A PyTorch implementation of a punctuation prediction system using (B)LSTM, which automatically adds suitable punctuation into text without punctuation.

Salt and Pepper in the Context of Hashing/Obfuscation

2023-03-14T00:00:00+01:00

In the context of hashing/obfuscation, "salt and pepper" refer to two different techniques used to enhance the security of hash functions.

Salt

Salt is a random value that is added to the input before it is hashed. This makes it much more difficult for attackers to use precomputed hash tables or rainbow tables to attack the hash. By using a unique salt for each input, even if two inputs have the same value, their hashes will be different, making it much harder for attackers to determine the original input value.

import hashlib
import os

def hash_with_salt(password):
    # Generate a random salt
    salt = os.urandom(16)
    # Add the salt to the password and hash it using SHA256
    hashed_password = hashlib.sha256(salt + password.encode('utf-8')).hexdigest()
    # Return the salt and hashed password as a tuple
    return (salt, hashed_password)

# Example usage
password = "mysecurepassword"
salt, hashed_password = hash_with_salt(password)
print(f"Salt: {salt}")
print(f"Hashed Password: {hashed_password}")

Pepper

Pepper, on the other hand, is a secret key that is used to further obscure the hash output. Unlike a salt, which is stored alongside the hash, the pepper is kept secret and never stored. This makes it much harder for attackers to reverse-engineer the original input value from the hash output.

import hmac
import hashlib

def hash_with_pepper(password, pepper):
    # Hash the password using HMAC-SHA256 with the secret pepper
    hashed_password = hmac.new(pepper.encode('utf-8'), password.encode('utf-8'), hashlib.sha256).hexdigest()
    # Return the hashed password
    return hashed_password

# Example usage
password = "mysecurepassword"
pepper = "mysecretpepper"
hashed_password = hash_with_pepper(password, pepper)
print(f"Hashed Password: {hashed_password}")

Using salt and pepper jointly

import hashlib
import hmac
import os

def hash_with_salt_and_pepper(password, pepper):
    # Generate a random salt
    salt = os.urandom(16)
    # Add the salt to the password and hash it using SHA256
    hashed_password = hashlib.sha256(salt + password.encode('utf-8')).digest()
    # Hash the hashed password using HMAC-SHA256 with the secret pepper
    hashed_password = hmac.new(pepper.encode('utf-8'), hashed_password, hashlib.sha256).hexdigest()
    # Return the salt and hashed password as a tuple
    return (salt, hashed_password)

# Example usage
password = "mysecurepassword"
pepper = "mysecretpepper"
salt, hashed_password = hash_with_salt_and_pepper(password, pepper)
print(f"Salt: {salt}")
print(f"Hashed Password: {hashed_password}")

Krystian Safjan's Blog

OSI Approved in license metadata for Python project

OSI Approval

Examples of OSI approved licenses

Examples of licenses without approval from OSI

GitHub - Troubleshooting 'Permission to repo.git denied to user'

Open Source LLM Observability Tools and Platforms

LLM Observability in the Context of LLMOps for Generative AI

What is LLM Observability?

Expected Functionalities of an LLM Observability Solution

Open Source LLM Observability Tools and Platforms

Non-open source

- Generative AI Studio - Galileo

Other - related

References

The Most Powerful Mac Productivity and Automation Apps

Avoid using curl -u “username:secret”!

HTML5 interactive elements

HTML5 Interactive Elements: An Overview and Usage Guide

The <details> and <summary> Elements

The <dialog> Element

The <datalist> Element

entr - run arbitrary command when files change

Tverski Similarity Metrics

Tversky Similarity

Formula

Python Example

Jaro-Winkler Similarity (for reference)

Summary

GitHub Search Techniques

Databricks Curriculum - From Zero to Hero

Stage 1: Beginner

Topic 1: Introduction to Databricks

Topic 2: Setting up Databricks

Topic 3: Introduction to Apache Spark

Topic 4: Basic Data Processing with Databricks

Stage 2: Intermediate

Topic 5: DataFrames and SQL in Databricks

Topic 6: ETL Processes in Databricks

Topic 7: Machine Learning with Databricks

Stage 3: Advanced

Topic 8: Stream Processing in Databricks

Topic 9: Advanced Spark Programming in Databricks

Topic 10: Databricks for Data Science

Databricks - key concepts

Databricks Workspace

Databricks Runtime

Databricks File System (DBFS)

Databricks Clusters

Databricks Notebooks

Databricks Jobs

Databricks Tables

Semantic Type Detection

Table Representation Learning

Using Mermaid Diagrams in Pelican Blog Post

Embed the HTML code (recommended)

Extension

Store Output of the Command Into Array in Bash

The Importance of Adding a `py.typed` File to Your Typed Package

Understanding py.typed

Adding py.typed to Your Package

Conclusion

In the Python project made with Poetry shall I add poetry.lock to the git repo or ignore it?

Git change remote origin (replace with new)

Git - Replace remote origin

When you might need to perform this operation

SPLADE sparse vectors - explaination, properties

TL; DR

Intro

how it works

References

TF-IDF with examples

Minimal example

High TF-IDF

Low TF-IDF

The formula

Growth Hacking Methodology

References

Product Led Growth

References

The `<details>` and `<summary>` Elements

The `<dialog>` Element

The `<datalist>` Element

Understanding `py.typed`

Adding `py.typed` to Your Package