HomeAboutPostsTagsProjectsRSS

Updated
Words281
TagsRead2 minutes

LLM Sampling Techniques: Minimum Probability and Temperature

Minimum Probability Sampling

Definition

Minimum probability sampling is a technique used in language model APIs to balance between diversity and coherence in the model’s output.

How it works

  • Sets a dynamic threshold for token selection based on the probability of the most likely token.
  • The threshold is a fraction (determined by the min_p value) of the top token’s probability.

Example explanation

Let’s say min_p = 0.1, and we’re generating the next token:

Scenario A:

  • Most likely token probability: 95%
  • Threshold: 95% * 0.1 = 9.5%
  • Only tokens with probabilities ≥ 9.5% are considered

Scenario B:

  • Most likely token probability: 10%
  • Threshold: 10% * 0.1 = 1%
  • Tokens with probabilities ≥ 1% are considered

Adaptive nature

  • When the model is very confident (high top probability), the threshold is higher, limiting options to maintain coherence.
  • When the model is less certain (lower top probability), the threshold lowers, allowing more diverse options.

Benefits

  • Preserves diversity for open-ended choices
  • Maintains coherence for deterministic choices (e.g., programming syntax)
  • Allows higher temperatures without losing coherence

Temperature in LLM Sampling

Definition

Temperature controls the randomness in token selection during text generation.

Effects of Higher Temperature

  1. Increased diversity in outputs
  2. Exploration of less likely options
  3. Reduced repetitiveness
  4. Better performance on open-ended tasks
  5. Potential mitigation of model biases
  6. Improved resilience to prompt engineering

Challenges

  • Maintaining coherence and relevance at higher temperatures

Optimal Use

  • Lower temperatures: Tasks requiring high accuracy or factual correctness
  • Higher temperatures: Creative or exploratory tasks

Synergy: min_p and Temperature

Combining min_p sampling with higher temperatures allows for:

  • Increased creativity and diversity in outputs
  • Maintained coherence by filtering out extremely improbable tokens

Key Takeaways

  1. min_p sampling adapts token selection threshold based on the model’s confidence.
  2. Higher temperatures increase output diversity but risk coherence.
  3. Combining min_p with higher temperatures balances creativity and coherence.
  4. The optimal sampling strategy depends on the specific task and desired outcome.

Updated
Words982
TagsRead4 minutes

Python is a great language but not perfect.

There are some common pitfalls, many of these are legacy issues retained for backward compatibility.

I want to share some of them.

Global Interpreter Lock (GIL)

It’s 2024, but Python still struggles with multi-core utilization due to the Global Interpreter Lock (GIL).

  • The GIL prevents multiple native threads from executing Python bytecode simultaneously.
  • This significantly limits the effectiveness of multi-threading for CPU-bound tasks in CPython.
  • While technically a CPython implementation detail, Python’s lack of a formal language specification means CPython’s behavior is often duplicated in other implementations.

Historically, when Python was created, there were no multi-core CPUs. When multi-core CPUs emerged, OS started to add thread support, the author added a thread interface as well, but the implementation was essentially single-core. The intention was to add real multi-threaded implementation later, but 30 years on, Python still grapples with this issue.

The GIL’s persistence is largely due to backward compatibility concerns and the fundamental changes removing it would require in the language and its ecosystem.

Lack of Block Scope

Unlike many languages, Python doesn’t have true block scope. It uses function scope and module scope instead.

def example_function():
    if True:
        x = 10  # This variable is not block-scoped
    print(x)  # This works in Python, x is still accessible

example_function()  # Outputs: 10

Implications:

  1. Loop Variable Leakage:

    for i in range(5):
        pass
    print(i)  # This prints 4, the last value of i
    
  2. Unexpected Variable Overwrites:

    x = 10
    if True:
        x = 20  # This overwrites the outer x, not create a new one
    print(x)  # Prints 20
    
  3. Difficulty in Creating Temporary Variables: It’s harder to create variables that are guaranteed to be cleaned up after a block ends.

  4. List Comprehension Exception: Interestingly, list comprehensions do create their own scope in Python 3.x.

    [x for x in range(5)]
    print(x)  # This raises a NameError in Python 3.x
    

Best practices:

  • Use functions to simulate block scope when needed.
  • Be mindful of variable names to avoid accidental overwrites.
  • Be cautious of the risk of using incorrect variable names in large functions.

Mutable Objects as Default Arguments

This is a particularly tricky pitfall:

def surprise(my_list = []):
    print(my_list)
    my_list.append('x')

surprise()  # Output: []
surprise()  # Output: ['x']

Why this happens:

  • Default arguments are evaluated when the function is defined, not when it’s called.
  • The same list object is used for all calls to the function.

This behavior:

  • Dates back to Python’s early days, possibly for performance reasons or implementation simplicity.
  • Goes against the “Principle of Least Astonishment”.
  • Has very few practical use cases and often leads to bugs.

Best practice: Use None as a default for mutable arguments and initialize inside the function:

def better_surprise(my_list=None):
    if my_list is None:
        my_list = []
    print(my_list)
    my_list.append('x')

Late Binding Closures

This issue is particularly tricky in loops:

def create_multipliers():
    return [lambda x: i * x for i in range(4)]

multipliers = create_multipliers()
print([m(2) for m in multipliers])  # Outputs: [6, 6, 6, 6]

Explanation:

  • The lambda functions capture the variable i itself, not its value at creation time.
  • By the time these lambda functions are called, the loop has completed, and i has the final value of 3.

Fix: Use a default argument to capture the current value of i:

def create_multipliers():
    return [lambda x, i=i: i * x for i in range(4)]

This behavior is particularly confusing because it goes against the intuitive understanding of how closures should work in many other languages.

The __init__.py Requirement

In Python 2 and early versions of Python 3, a directory had to contain an __init__.py file to be treated as a package.

  • This requirement often confused beginners and led to subtle bugs when forgotten.
  • It provided a clear, explicit way to define package boundaries and behavior.

Evolution:

  • Python 3.3 introduced PEP 420, allowing for implicit namespace packages.
  • Directories without __init__.py can now be treated as packages under certain conditions.

Modern best practices:

  1. Use __init__.py when you need initialization code or to control package exports.
  2. For simple packages or namespace packages, you can often omit __init__.py in Python 3.

Understanding these pitfalls is crucial for writing efficient, bug-free Python code. While they can be frustrating, they’re part of Python’s evolution and often retained for backward compatibility. Being aware of them will help you navigate Python development more effectively.

Updated
Words343
TagsRead2 minutes

Anthropic AI just announced prompt caching

What

Prompt caching is a feature that allows developers to efficiently use large amounts of context or instructions in repeated API calls to Claude. While the entire prompt (including cached content) is sent with each request, the cached portion is processed more efficiently and charged at a reduced rate after the initial setup.

Key benefits:

  • Reduced costs: Cached content is charged at only 10% of the base input token price in subsequent requests.
  • Improved performance: Potentially faster processing times for large, repeated contexts.
  • Enhanced capabilities: You can include more examples, instructions, or background information cost-effectively, leveraging Claude’s in-context learning abilities.

Use cases:

  • Chatbots that need consistent, complex instructions
  • Coding assistants that reference large codebases
  • Q&A systems working with entire books or documents
  • Any application requiring extensive, consistent context across multiple interactions

How

  1. Initial cache setup: The first request to set up the cache is charged at 125% of the base input token price for the cached portion of the prompt (cache write).
  2. Subsequent requests: The cached portion of the prompt is charged at 10% of the base input token price (cached read).
  3. The entire prompt, including cached content, is sent with each request but processed more efficiently.
  4. Cache has a 5-minute lifetime, refreshed each time the cached content is used.

Important notes

  • Prompt caching doesn’t reduce data transfer; the full prompt is sent each time.
  • It’s not traditional fine-tuning, but a way to efficiently leverage Claude’s long context window (200k tokens) and in-context learning capabilities.
  • Enables customization of model behavior for specific tasks without changing model parameters.

Updated
Words178
TagsRead1 minute

I use Emacs with Vertico for completion and want to center the minibuffer content on my large monitor to avoid constantly looking at the left corner. After trying various solutions unsuccessfully, I finally found GitHub - mpwang/perfect-margin: [emacs] auto center emacs windows, work with minimap and/or linum-mode , which allows me to set the minibuffer margin effectively.

(add-to-list 'perfect-margin-force-regexps "*Minibuf")
(add-to-list 'perfect-margin-force-regexps "*which-key")

Updated
Words192
TagsRead1 minute

Nowadays I start a research of topic by writing a note on [[Obsidian]]. I use this template for Introduction - Templater to insert a link to start the search in Google app.

[google search](google://search?q=<% tp.user.z_encode(tp.file.title) %>)

this require a javascript file for user function, set Script file folder location to templates/js

templates/js/z_encode.js

function encode (msg) {
    return encodeURIComponent(msg);
}
module.exports = encode;

Updated
Words937
TagsRead3 minutes

I want to share what I’ve learned while porting The Little Learner from Racket to Gleam . I will compare some Racket code with Gleam.

Gleam is a simple functional programming language. Checkout this great article why simple programming language matters.

The language is so straightforward that an experienced programmer can learn it in just a day or two. However, to truly appreciate its simplicity and constraints, one must have prior experience with complex programming languages and substantial coding practice.

focus on concrete problem

It is hard for less experienced developers to appreciate how rarely architecting for future requirements / applications turns out net-positive.

John Caremark

Gleam’s philosophy is to focus on concrete problem rather than building abstractions.

Consider this Racket code I encountered, it is used in different places for different things.

(λ (l . r) l)

Although it looks beautiful, this function is overly complex and marking it difficult to understand.

The function behaves as follows:

  • If called with a single list argument, it returns the first element of that list.
  • If called with multiple arguments, it returns the first argument.

however it is never called with more than three arguments in the code base.

Translating this to Gleam result in a more understandable code, with not too much verbosity.

pub type Shape = List(Int)

pub fn default_shape_fn_1(shape: Shape) -> Shape {
  case shape {
    [] -> []
    [head, ..] -> [head]
  }
}

pub fn default_shape_fn_2(shape: Shape, _: Shape) -> Shape {
  shape
}

[[no hidden complexity]]

Racket provides layers of abstraction like rename-out to override operators like + - * / > < = (literally can be anything) when exporting module. This is great for building abstraction to teach concepts or to build another new language/DSL, but such flexibility often comes with maintainability costs and cognitive burden.

The Little Learner provides different implementations for tensors. Functions, and operators that appear identical have different meanings across different tensor implementations. Also operators are overridden, when reading the codebase, I often get confused by operators like <, sometimes it compare numbers, other times it compare scalars or tensors, Racket is a dynamic language, without type annotation, checking those functions and operations can be really frustrating and confusing.

While this uniform abstraction layer is beneficial for teaching machine learning concepts, it can be challenging when examining the actual code.

In contrast, Gleam shines with its simplicity and lack of hidden complexities. Everything must be explicitly stated, making the code clean and readable. Additionally, the compiler is smart enough to perform type inference, so you usually don’t need to add type notations for everything.

Updated
Words179
TagsRead1 minute

after reading When An Alias Should Actually Be An Abbr | sean henderson I completely abandon using alias in fish shell, use abbr ever since.

abbr will expand the command into what it shorts for, before pressing enter to execute it. For me the biggest advantage is that when searching the shell history there is no hidden complexity, you don’t have to member all the black magic that a short command name stands for.

abbr is the new alias, function for everything else

Updated
Words785
TagsRead2 minutes

Records in Gleam: Comparison and Uniqueness

Record Comparison

In [[Gleam]], records are compared by value (deep nested comparison), which can present challenges when using them as dictionary keys, unlike in some other functional languages.

Records are Compared by Value

It’s important to note that Gleam doesn’t have objects in the traditional sense. All data structures, including records, are compared by value. This means that two records with identical field values will be considered equal.

To make a record unique, an additional identifier field is necessary. This approach allows for distinguishing between records that might otherwise have the same content but represent different entities.

Ensuring Uniqueness

Simple Approach: UUID Field

One straightforward method to ensure record uniqueness is to add a UUID field. However, UUID strings can be memory-intensive and cpu-costly.

Improved Approach: Erlang Reference

A more efficient alternative is to use an [[erlang reference]] as a unique identifier for records.

Erlang references are unique identifiers created by the Erlang runtime system. They have several important properties:

  1. Uniqueness: Each reference is guaranteed to be unique within an Erlang node (and even across connected nodes).
  2. Lightweight: References are very memory-efficient.
  3. Unguessable: They can’t be forged or guessed, which can be useful for security in some contexts.
  4. Erlang-specific: They are native to the BEAM VM, so they work well with Gleam, which runs on this VM.

It’s important to note that:

  • Erlang references are not persistent across program runs. If you need to save and reload your records, you’ll need to implement a serialization strategy.
  • References are not garbage collected until the object they’re associated with is no longer referenced.

Example

import gleam/erlang

pub type TensorId =
  erlang.Reference

pub type Tensor {
  ScalarTensor(value: Float, id: TensorId)
  ListTensor(List(Tensor))
}

pub fn create_scalar_tensor(value: Float) -> Tensor {
  ScalarTensor(value, erlang.make_reference())
}

pub fn create_list_tensor(tensors: List(Tensor)) -> Tensor {
  ListTensor(tensors)
}

pub fn tensor_id(tensor: Tensor) -> TensorId {
  case tensor {
    ScalarTensor(_, id) -> id
    ListTensor(_) -> erlang.make_reference()
  }
}

pub fn tensor_equal(a: Tensor, b: Tensor) -> Bool {
  tensor_id(a) == tensor_id(b)
}
import gleam/dict

pub type GradientMap =
  dict.Dict(TensorId, Float)

pub fn create_gradient_map() -> GradientMap {
  dict.new()
}

pub fn set_gradient(map: GradientMap, tensor: Tensor, gradient: Float) -> GradientMap {
  dict.insert(map, tensor_id(tensor), gradient)
}

pub fn get_gradient(map: GradientMap, tensor: Tensor) -> Result(Float, Nil) {
  dict.get(map, tensor_id(tensor))
}

Updated
Words325
TagsRead2 minutes

This is my thought about the layers of computation in Machine Learning while porting The Little Learner to Gleam .

  1. Bit-Level Representation: At the lowest level, data is stored as bits in memory. However, direct manipulation at this level is rarely necessary in modern ML. I learned how flat binary tensors are represented in memory and how to manipulate bit arrays for performing arithmetic operations on tensors.
  2. Arithmetic Operations: Higher-level abstractions allow for operations on tensor objects without manual bit manipulation. These tensors represent multi-dimensional arrays of numbers, which can be integers or floating-point values. A important concept is that how the operations are extended to apply to tensors of certain shapes, rather than just apply to numbers.
  3. Automatic Differentiation: This layer use a dual data structure #(tensor, link) to connect a tensor with its generating function. This structure is crucial for backpropagation, where gradients are computed via the chain rule. Nodes in the graph represent operations, while edges represent tensors flowing between operations.
  4. Differentiable Data and Model Parameters: A tensor, by nature of its operations and structure, is differentiable. Any list or sequence of tensors adhering to the operations defined is also differentiable. At the top of the computation tier, differentiable data (which generally represent model parameters) are passed to gradient descent or other optimization functions. These are updated during training using optimization algorithms based on gradients calculated through the lower layers.

Updated
Words220
TagsRead1 minute

I use a cron job to sync my [[Obsidian]] notes to a private GitHub repository every day. It’s just a simple script to do a git commit && git push, and it works mostly fine. However, the .github directory synced to iCloud significantly slows down the Obsidian startup speed on mobile.

I just discovered [icloud-nosync]( GitHub - nicolodiamante/icloud-nosync: Take control of your iCloud Drive files with iCloud NoSync—the ultimate utility for effortless sync and management! Get the edge you need to stay organised and up–to–date. ), which can prevent the .git folder from syncing. The way it works is by replacing the actual folder with a symbolic link and renaming the folder with a .tmp extension, which prevents the folder from syncing to iCloud.