-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite of parser: Julia type<->JSON and back #143
Comments
Here's an extremely hacky way to do this. I'm going to re-do this I think because it relies on some weird metaprogramming shit and there has to be a much cleaner way. primitive_types = Union{
Integer, Real, AbstractString, Bool, Nothing, Missing, AbstractArray}
function Base.dump(arg; maxdepth = 10)
# this is typically used interactively, so default to being in Main (or current active module)
# mod = get(io, :module, PromptingTools)
dump(IOContext(stdout), arg; maxdepth = maxdepth)
end
# Monster <: Any
# name::String
# age::Int64
# height::Float64
# friends::Vector{String}
"""
Convert a Julia type to a typed JSON schema.
https://github.com/svilupp/PromptingTools.jl/issues/143
https://www.boundaryml.com/blog/type-definition-prompting-baml
"""
function typed_json_schema(x::Type{T}) where {T}
@info "Typed JSON schema for $T" propertynames(T) fieldnames(T) T.types
# If there are no fields, return the type
if isempty(fieldnames(T))
return to_json_schema(T)
end
dump(T)
# Preallocate a mapping
mapping = Dict()
for (type, field) in zip(T.types, fieldnames(T))
mapping[field] = typed_json_schema(type)
end
# Get property names
return mapping
end
function type2dict(T)
buffer = IOBuffer()
dump(buffer, T)
dumpstring = String(take!(buffer))
lines = filter(!isempty, split(dumpstring, "\n"))
main_type = string(T)
mapping = Dict()
for line in lines
is_space = map(==(' '), collect(line))
first_nonspace_index = findfirst(l -> !l, is_space)
if first_nonspace_index == 1
# This is the main type, skip it.
continue
elseif first_nonspace_index == 3
# This is a field, add it to the dict. These are formatted as name::Type
splitted = split(strip(line), "::")
field_name = splitted[1]
field_type = splitted[2] |> String
field_type_expr = Meta.parse(:($field_type)) |> eval
# Lastly, check if the type is a non-primitive. If so, recursively call type2dict.
if !(field_type_expr <: primitive_types)
mapping[field_name] = type2dict(field_type_expr)
else
mapping[field_name] = to_json_type(field_type_expr)
end
end
end
return mapping
end
struct SimpleSingleton
singleton_value::Int
end
struct Nested
inside_element::SimpleSingleton
end
struct IntFloatFlat
int_value::Int
float_value::Float64
end
struct Monster
name::String
age::Int
height::Float64
friends::Vector{String}
nested::Nested
flat::IntFloatFlat
end
res = type2dict(Monster) |> JSON3.write |> JSON3.read |> println
which gets you {
"nested": {
"inside_element": {
"singleton_value": "integer"
}
},
"name": "string",
"height": "number",
"flat": {
"float_value": "number",
"int_value": "integer"
},
"age": "integer",
"friends": "string[]"
} This also required changing the to_json_type(s::Type{<:AbstractString}) = "string"
to_json_type(n::Type{<:Real}) = "number"
to_json_type(n::Type{<:Integer}) = "integer"
to_json_type(b::Type{Bool}) = "boolean"
to_json_type(t::Type{<:Union{Missing, Nothing}}) = "null"
to_json_type(t::Type{T}) where {T <: AbstractArray} = to_json_type(eltype(t)) * "[]" Note here that I removed the catch-all fallback of I'm not sure how we handle array types for custom structs, however. Not sure I could find a quick answer in the document linked above. I.e. struct ABunchOfVectors
strings::Vector{String}
ints::Vector{Int}
floats::Vector{Float64}
nested_vector::Vector{Nested}
end does not have a well-defined type definition. Should this be {
"strings": "string[]",
"ints": "number[]",
"floats": "number[]",
"nested_vector": "??????????????????????"
} Perhaps this should be something like {
"nested_vector": [
"nested": {
"inside_element": {
"singleton_value": "integer"
}
},
]
} |
by chance would you like to get a julia integration with BAML? then you would be able to just take advantage of not just the type-schemas, but also the deserializer have. It's built in rust and we are just about to add support for ruby (and soon java)? we found it can be a bit more tricky esp with nested types. for example, there are scenarios when one wants inline type definitions, and other times when one wants types defined up top. https://www.github.com/boundaryml/baml For some context, am one of the authors of BAML |
Honestly, yes. That sounds cool to me. |
Hi @hellovai , that sounds cool! @cpfiffer , I had to handle the nested schemas in the Anthropic XML case (it's very similar to the "typed JSON"). I passed it as an So we can probably just indicate the object structure and that it's a list of them - I'd try brackets around it rather than behind it, because that's how it would look if there was actually data. In general, we'll need to give a guidance to people to keep their structs as flat as possible. Btw I think we need to have first-class support for parsing:
That should be sufficient for 99% of cases. WDYT? |
Shall we create some parsing flavors?
So we can support different schemas/parsing engines in parallel? |
Btw. clever to use Is there a public interface to the functionality that |
Alright, sounds reasonable!
Yeah -- agreed. For,
Yes! Great idea.
Yeah, I'm not the biggest fan of using |
Here is another option for the parser that does not use metaprogramming: function typed_json_schema(x::Type{T}) where {T}
@info "Typed JSON schema for $T" propertynames(T) fieldnames(T) T.types
# If there are no fields, return the type
if isempty(fieldnames(T))
# Check if this is a vector type. If so, return the type of the elements.
if T <: AbstractArray
# Now check if the element type is a non-primitive. If so, recursively call typed_json_schema.
if eltype(T) <: primitive_types
return to_json_type(eltype(T))
else
return "List[" * JSON3.write(typed_json_schema(eltype(T))) * "]"
end
end
# Check if the type is a non-primitive.
if T <: primitive_types
@info "Type is a primitive: $T"
return to_json_type(T)
else
return typed_json_schema(T)
end
end
# Preallocate a mapping
mapping = Dict()
for (type, field) in zip(T.types, fieldnames(T))
mapping[field] = typed_json_schema(type)
end
# Get property names
return mapping
end This works pretty well: struct ABunchOfVectors
strings::Vector{String}
ints::Vector{Int}
floats::Vector{Float64}
nested_vector::Vector{Nested}
end
res = typed_json_schema(ABunchOfVectors) |> JSON3.write
res |> JSON3.pretty has {
"strings": "string",
"ints": "integer",
"nested_vector": "List[{\"inside_element\":{\"singleton_value\":\"integer\"}}]",
"floats": "number"
} Here I have opted to put the object type inside the list type, but not sure how well that works empirically. An alternative is to put the type in the value side: function typed_json_schema(x::Type{T}) where {T}
@info "Typed JSON schema for $T" propertynames(T) fieldnames(T) T.types
# If there are no fields, return the type
if isempty(fieldnames(T))
# Check if this is a vector type. If so, return the type of the elements.
if T <: AbstractArray
# Now check if the element type is a non-primitive. If so, recursively call typed_json_schema.
if eltype(T) <: primitive_types
return to_json_type(eltype(T))
else
return Dict("list[Object]" => JSON3.write(typed_json_schema(eltype(T))))
# return "List[" * JSON3.write(typed_json_schema(eltype(T))) * "]"
end
end
# Check if the type is a non-primitive.
if T <: primitive_types
@info "Type is a primitive: $T"
return to_json_type(T)
else
return typed_json_schema(T)
end
end
# Preallocate a mapping
mapping = Dict()
for (type, field) in zip(T.types, fieldnames(T))
mapping[field] = typed_json_schema(type)
end
# Get property names
return mapping
end which gives you {
"strings": "string",
"ints": "integer",
"nested_vector": {
"list[Object]": "{\"inside_element\":{\"singleton_value\":\"integer\"}}"
},
"floats": "number"
} |
IT would be excellent if anyone wanted to look into improving the current parsing engine for structured extraction!
Currently:
Desired state:
Why JSON type (ie, JSON structure and types instead of values, all pretty-printed)?
For open-source / smaller models, it seems to be a much better option - read more here.
Why the rewrite?
Because it's not a very reliable code and we're desperately missing something like Pydantic in Julia.
Moreover, it's WAY TO COMPLICATED NOW...
Look how simple producing JSON schema can be in Python and Julia has way better reflection and analysis tools!
(Source: https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb)
I'd propose splitting it in a few PRs:
type_representation
(open to other ideas) to allow users useaiextract
with JSON mode (for open-source models) and with the "type representation" of their choosingThe text was updated successfully, but these errors were encountered: